Download presentation
Presentation is loading. Please wait.
1
On the computation of multidimensional Aggregates
데이타베이스 연구실 김 은정 on the computation of multidimensional aggregates
2
on the computation of multidimensional aggregates
Contents Cube operator Sort-based methods Hash-based methods evaluation Overlap methods comparison on the computation of multidimensional aggregates
3
Optimizations for computing group-bys(1)
Smallest-parent - 이전에 계산된 group-by중 가장 작은 것으로 부터 group-by 계산 Cache-results - disk I/o를 줄이기 위해 다른 group-by 계산된 결과 이용 Amortize-scans - disk scan 한번으로 가능한 많은 group-by 계산 Share-sorts specific to the sort-based algorithm 여러 group-by 간에 sorting-cost를 공유하는 방법 on the computation of multidimensional aggregates
4
Optimizations for computing group-bys(2)
5.Share-partitions specific to the hash-based algorithm Hash-table이 메모리에 비해 큰 경우, 데이터를 분할하여 , 메모리에 맞는 각 부분들에 대해 aggregation 수행 Optimizations are often contradictory on the computation of multidimensional aggregates
5
on the computation of multidimensional aggregates
Sort-based methods(1) Pipe sort -optimizations share-sorts, smallest-parent 결합 minimum total cost - optimizations cache-results, amortize-scans 포함 pipelined fasion on the computation of multidimensional aggregates
6
on the computation of multidimensional aggregates
Sort-based methods(2) Algorithm pipesort Assumption -각각의 group-by에 대해 서로 다른 값들의 수를 추정한다고 가정 Input : a search lattice search lattice for a data cube : graph Vertex : a group-by of the cube Directed edge i>j : j는 i로 부터 생성할 수 있고, j는 i보다 반드시 하나 적은 attribute를 갖는다. (i is parent of j) Level k : 정확하게 k attributes 를 포함하는 모든 group-bys on the computation of multidimensional aggregates
7
on the computation of multidimensional aggregates
Sort-based methods(3) -eij : labled with two costs A(eij) : I가 이미 정렬되어 있을때, I에서 j를 계 산하는 비용 S(eij) : I가 정렬되어 있지 않을때, I에서 j를 계 산하는 비용 Output : a subgragh of the search lattice each group-by Find an output O that has minimum sum of edge costs on the computation of multidimensional aggregates
8
on the computation of multidimensional aggregates
Sort-based methods(4) Algorithm Pipesort (input : search lattice with the A() and S() edges costs ) For level k = 0 to N –1 /* find how to generate level k from level k+1 */ Generate-Plan (k+1 ->k); For each group-by g in level k+1 Fix the sort order of g as the order of the group-by connected to g by an A() edge; on the computation of multidimensional aggregates
9
on the computation of multidimensional aggregates
Sort-based methods(5) Generate-Plan(k+1 -> k) Make k additional copies of each level k+1 vertex; Connect each copy vertex to the same set of level k vertices as the original vertex; Assign cost A(eij) to edge eij from the original vertex and S(eij) to edge from the copy vertex; Find the minimum cost matching on the transformed levels on the computation of multidimensional aggregates
10
on the computation of multidimensional aggregates
Sort-based methods(6) on the computation of multidimensional aggregates
11
on the computation of multidimensional aggregates
Sort-based methods(7) on the computation of multidimensional aggregates
12
on the computation of multidimensional aggregates
Sort-based methods(8) on the computation of multidimensional aggregates
13
on the computation of multidimensional aggregates
Hash-based methods(1) Careful memory allocation of multiple hash tables Optimizations : cache-results, amortize-scans, smallest-parent share partition Data to be aggregated too large for the hash tables to fit in memory partition the data on some attribute memory allocation - What group-bys to compute together? –memory not sufficent - NP-complete heuristic solution on the computation of multidimensional aggregates
14
on the computation of multidimensional aggregates
Hash-based methods(2) Algorithm Pipehash (input : search lattice with group-by estimated sizes) Initialize worklist with MST of the search lattice; While worklist is not empty pick any tree T from the worklist; T’ = select-subtree of T to be executed next; compute-subtree T’; on the computation of multidimensional aggregates
15
on the computation of multidimensional aggregates
Hash-based methods(3) on the computation of multidimensional aggregates
16
Experimental evaluation(1)
RS/ workstation AIX 3.2.5 Memory : 256MB Buffer size : 32MB Datasets : 2GB SCSI 3.5”drive with sequential throughput of about 1.5MB/second Perfomance results 2 to 8 times faster Pipesort : close to lower bound ( max. difference is 22%) Pipehash : close to lower bound ( max. difference is 8%) For most datasets pipesort is superior on the computation of multidimensional aggregates
17
Experimental evaluation(2)
on the computation of multidimensional aggregates
18
Options for computing the CUBE
Cube를 계산함에 있어,다양한 cuboids의 계산을 중첩(overlap)함으로써, disk access를 최소화하는 sort-based method -> memory가 제한된 상태에서도 잘 수행. Cuboid = group-by Base cuboid = 모든 attribute 에 대한 group-by Independent method -(multiple independent group-by queries) Independently compute each cuboid from base cuboid, using standard group-by techniques -> poor performance Parent method -(hierarch of group-by queries) - better than I.M., since parent is likely to be much smaller than the base cuboid Overlap method on the computation of multidimensional aggregates
19
on the computation of multidimensional aggregates
Overlap method(1) Sort-based overlap method - 다른 cuboids의 계산이 중첩되고, 모든 cuboid는 정렬 순서내에서 계산됨 Multi-pass method Algorithm은 base cuboid의 sort로 시작 각 cuboid는 다시 sort되지 않고, 존재하는 sorted run merge 각 cuboid 의 계산에는 memory가 필요한데, 대부분의 경우,전체 cube가 너무 커서 한번의 scan으로 불가능 서로 다른 cuboid 의 중첩을 해야 partition이 계산 단위 partition 이 되자마자 tuple들이 자손 cuboid 를 위해 pipelined되거나 disk로 쓰여지고, 그memory 는 다음 partition 을 위해 쓰여짐 모든 cuboid 가 계산될때까지 반복 on the computation of multidimensional aggregates
20
on the computation of multidimensional aggregates
Overlap method(2) DAG : node :cuboid edge : k -> k-1 attribute root : base cuboid Choosing a tree (fig.8) - minimize the size of the partitions of a cuboid minimum memory is needed for its computation Maximum match in their sort orders Choosing a set of cuboids for overlapped computation (fig.10) optimal allocation problem-> NP-hard Heuristic allocation (Breadth First) search order 사용 Computing a cuboid from its parent on the computation of multidimensional aggregates
21
on the computation of multidimensional aggregates
Overlap method(3) Cube {A,B,C,D}에 대한 DAG on the computation of multidimensional aggregates
22
on the computation of multidimensional aggregates
Overlap method(4) on the computation of multidimensional aggregates
23
on the computation of multidimensional aggregates
Overlap method(5) on the computation of multidimensional aggregates
24
on the computation of multidimensional aggregates
Overlap method(6) Available memory : 25 pages -> BF allocaion 3 subtree 생성 on the computation of multidimensional aggregates
25
Overlap method - 성능 평가(8)
on the computation of multidimensional aggregates
26
Overlap method - 성능 평가(9)
on the computation of multidimensional aggregates
27
on the computation of multidimensional aggregates
comparison Pipesort -single pipeline -choosing parent -> group-by size 고려 Overlap multiple pipeline(partition을 이용해 더 많이 overlap하기위함) choosing parent ->sort order의 maximum match가 중요 Future work -성능비교 - 두 방법을 결합한 새로운 algorithm on the computation of multidimensional aggregates
Similar presentations