On the computation of multidimensional Aggregates 데이타베이스 연구실 김 은정 982COG31@ewha.ac.kr 2018-11-27 on the computation of multidimensional aggregates
on the computation of multidimensional aggregates Contents Cube operator Sort-based methods Hash-based methods evaluation Overlap methods comparison 2018-11-27 on the computation of multidimensional aggregates
Optimizations for computing group-bys(1) Smallest-parent - 이전에 계산된 group-by중 가장 작은 것으로 부터 group-by 계산 Cache-results - disk I/o를 줄이기 위해 다른 group-by 계산된 결과 이용 Amortize-scans - disk scan 한번으로 가능한 많은 group-by 계산 Share-sorts specific to the sort-based algorithm 여러 group-by 간에 sorting-cost를 공유하는 방법 2018-11-27 on the computation of multidimensional aggregates
Optimizations for computing group-bys(2) 5.Share-partitions specific to the hash-based algorithm Hash-table이 메모리에 비해 큰 경우, 데이터를 분할하여 , 메모리에 맞는 각 부분들에 대해 aggregation 수행 Optimizations are often contradictory 2018-11-27 on the computation of multidimensional aggregates
on the computation of multidimensional aggregates Sort-based methods(1) Pipe sort -optimizations share-sorts, smallest-parent 결합 minimum total cost - optimizations cache-results, amortize-scans 포함 pipelined fasion 2018-11-27 on the computation of multidimensional aggregates
on the computation of multidimensional aggregates Sort-based methods(2) Algorithm pipesort Assumption -각각의 group-by에 대해 서로 다른 값들의 수를 추정한다고 가정 Input : a search lattice search lattice for a data cube : graph Vertex : a group-by of the cube Directed edge i>j : j는 i로 부터 생성할 수 있고, j는 i보다 반드시 하나 적은 attribute를 갖는다. (i is parent of j) Level k : 정확하게 k attributes 를 포함하는 모든 group-bys 2018-11-27 on the computation of multidimensional aggregates
on the computation of multidimensional aggregates Sort-based methods(3) -eij : labled with two costs A(eij) : I가 이미 정렬되어 있을때, I에서 j를 계 산하는 비용 S(eij) : I가 정렬되어 있지 않을때, I에서 j를 계 산하는 비용 Output : a subgragh of the search lattice each group-by Find an output O that has minimum sum of edge costs 2018-11-27 on the computation of multidimensional aggregates
on the computation of multidimensional aggregates Sort-based methods(4) Algorithm Pipesort (input : search lattice with the A() and S() edges costs ) For level k = 0 to N –1 /* find how to generate level k from level k+1 */ Generate-Plan (k+1 ->k); For each group-by g in level k+1 Fix the sort order of g as the order of the group-by connected to g by an A() edge; 2018-11-27 on the computation of multidimensional aggregates
on the computation of multidimensional aggregates Sort-based methods(5) Generate-Plan(k+1 -> k) Make k additional copies of each level k+1 vertex; Connect each copy vertex to the same set of level k vertices as the original vertex; Assign cost A(eij) to edge eij from the original vertex and S(eij) to edge from the copy vertex; Find the minimum cost matching on the transformed levels 2018-11-27 on the computation of multidimensional aggregates
on the computation of multidimensional aggregates Sort-based methods(6) 2018-11-27 on the computation of multidimensional aggregates
on the computation of multidimensional aggregates Sort-based methods(7) 2018-11-27 on the computation of multidimensional aggregates
on the computation of multidimensional aggregates Sort-based methods(8) 2018-11-27 on the computation of multidimensional aggregates
on the computation of multidimensional aggregates Hash-based methods(1) Careful memory allocation of multiple hash tables Optimizations : cache-results, amortize-scans, smallest-parent share partition Data to be aggregated too large for the hash tables to fit in memory partition the data on some attribute memory allocation - What group-bys to compute together? –memory not sufficent - NP-complete heuristic solution 2018-11-27 on the computation of multidimensional aggregates
on the computation of multidimensional aggregates Hash-based methods(2) Algorithm Pipehash (input : search lattice with group-by estimated sizes) Initialize worklist with MST of the search lattice; While worklist is not empty pick any tree T from the worklist; T’ = select-subtree of T to be executed next; compute-subtree T’; 2018-11-27 on the computation of multidimensional aggregates
on the computation of multidimensional aggregates Hash-based methods(3) 2018-11-27 on the computation of multidimensional aggregates
Experimental evaluation(1) RS/6000 250 workstation AIX 3.2.5 Memory : 256MB Buffer size : 32MB Datasets : 2GB SCSI 3.5”drive with sequential throughput of about 1.5MB/second Perfomance results 2 to 8 times faster Pipesort : close to lower bound ( max. difference is 22%) Pipehash : close to lower bound ( max. difference is 8%) For most datasets pipesort is superior 2018-11-27 on the computation of multidimensional aggregates
Experimental evaluation(2) 2018-11-27 on the computation of multidimensional aggregates
Options for computing the CUBE Cube를 계산함에 있어,다양한 cuboids의 계산을 중첩(overlap)함으로써, disk access를 최소화하는 sort-based method -> memory가 제한된 상태에서도 잘 수행. Cuboid = group-by Base cuboid = 모든 attribute 에 대한 group-by Independent method -(multiple independent group-by queries) Independently compute each cuboid from base cuboid, using standard group-by techniques -> poor performance Parent method -(hierarch of group-by queries) - better than I.M., since parent is likely to be much smaller than the base cuboid Overlap method 2018-11-27 on the computation of multidimensional aggregates
on the computation of multidimensional aggregates Overlap method(1) Sort-based overlap method - 다른 cuboids의 계산이 중첩되고, 모든 cuboid는 정렬 순서내에서 계산됨 Multi-pass method Algorithm은 base cuboid의 sort로 시작 각 cuboid는 다시 sort되지 않고, 존재하는 sorted run merge 각 cuboid 의 계산에는 memory가 필요한데, 대부분의 경우,전체 cube가 너무 커서 한번의 scan으로 불가능 서로 다른 cuboid 의 중첩을 해야 partition이 계산 단위 partition 이 되자마자 tuple들이 자손 cuboid 를 위해 pipelined되거나 disk로 쓰여지고, 그memory 는 다음 partition 을 위해 쓰여짐 모든 cuboid 가 계산될때까지 반복 2018-11-27 on the computation of multidimensional aggregates
on the computation of multidimensional aggregates Overlap method(2) DAG : node :cuboid edge : k -> k-1 attribute root : base cuboid Choosing a tree (fig.8) - minimize the size of the partitions of a cuboid minimum memory is needed for its computation Maximum match in their sort orders Choosing a set of cuboids for overlapped computation (fig.10) optimal allocation problem-> NP-hard Heuristic allocation (Breadth First) search order 사용 Computing a cuboid from its parent 2018-11-27 on the computation of multidimensional aggregates
on the computation of multidimensional aggregates Overlap method(3) Cube {A,B,C,D}에 대한 DAG 2018-11-27 on the computation of multidimensional aggregates
on the computation of multidimensional aggregates Overlap method(4) 2018-11-27 on the computation of multidimensional aggregates
on the computation of multidimensional aggregates Overlap method(5) 2018-11-27 on the computation of multidimensional aggregates
on the computation of multidimensional aggregates Overlap method(6) Available memory : 25 pages -> BF allocaion 3 subtree 생성 2018-11-27 on the computation of multidimensional aggregates
Overlap method - 성능 평가(8) 2018-11-27 on the computation of multidimensional aggregates
Overlap method - 성능 평가(9) 2018-11-27 on the computation of multidimensional aggregates
on the computation of multidimensional aggregates comparison Pipesort -single pipeline -choosing parent -> group-by size 고려 Overlap multiple pipeline(partition을 이용해 더 많이 overlap하기위함) choosing parent ->sort order의 maximum match가 중요 Future work -성능비교 - 두 방법을 결합한 새로운 algorithm 2018-11-27 on the computation of multidimensional aggregates