On the computation of multidimensional Aggregates

Slides:



Advertisements
Similar presentations
김예슬 김원석 김세환. Info Northcutt Bikes Northcutt Bikes The Forecasting problem The Forecasting problem The solution 1~6 The.
Advertisements

Copyright © 2012 Pearson Education, Inc. Publishing as Prentice Hall
자료구조: CHAP 7 트리(1) 순천향대학교 컴퓨터공학과 하 상 호.
1. 자료구조 개념 (1) 자료구조의 개요 1) 자료구조 ① 일련의 자료들을 조직하고 구조화하는 것 ② 자료의 표현과 그것과 관련된 연산 2) 자료구조에 따라 저장공간의 효율성과 프로그램의 실행시간이 달라짐 3) 어떠한 자료구조에서도 필요한 모든 연산들을 처리하는 것이.
Lecture 9 프로그램 실행의 비용 computation cost – 시간 time, 메모리 memory – tractable vs intractable problems.
Maximum Flow.
Shortest Path Algorithm
Internet Computing KUT Youn-Hee Han
Mesh Saliency 김 종 현.
Delivery and Routing of IP Packets
Information Technology
7장 : 캐시와 메모리.
Ch.04 Greedy Method (탐욕적 방법)
12. 데이터베이스 설계.
CHAP 10 : 그래프 C로 쉽게 풀어쓴 자료구조 생능출판사 학기 데이터구조.
CHAP 10:그래프 C로 쉽게 풀어쓴 자료구조 생능출판사 2005.
CHAP 10 : 그래프 C로 쉽게 풀어쓴 자료구조 Slide 1 (of 62).
쉽게 배우는 알고리즘 5장. 검색트리
오라클 데이터베이스 성능 튜닝.
Numerical Analysis - preliminaries -
6장. 물리적 데이터베이스 설계 물리적 데이터베이스 설계
CAVE : Channel-Aware Buffer Management Scheme for Solid State Disk
Genetic Algorithm 신희성.
쉽게 배우는 알고리즘 9장. 그래프 알고리즘.
Dynamic Programming.
Ch. 10 Algorithm Efficiency & Sorting
Ch. 10 Algorithm Efficiency & Sorting
Internet Computing KUT Youn-Hee Han
알고리즘(Algorithm) – Greedy Method (탐욕적 방법)
Chapter 10 그래프(graph) SANGJI University Kwangman KO
Cluster Analysis (군집 분석)
숭실대학교 마이닝연구실 김완섭 2009년 2월 8일 아이디어  - 상관분석에 대한 연구
for Robust Facial Landmark Localization
운영체제 (Operating Systems)
LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data
10장. 그래프 알고리즘.
그래프(graph) 연결되어 있는 객체 간의 관계를 표현하는 자료구조 가장 일반적인 자료구조 형태
Chapter 10. 파일 시스템 인터페이스(File System Interface)
파일 시스템 인터페이스(File System Interface)
Fault Diagnosis for Embedded Read-Only Memories
Computer System Architecture
SQL Server 7.0 세미나 (Performance Tuning)
CHAPTER 6 그래프.
스케줄링 (Scheduling) 시스템 내부시간(time in the system): 스케줄링 문제
제6장 교착상태 OS 컴퓨터 운영체제 Operating Systems
운영체제(Operating System)
정보 추출기술 (Data Mining Techniques ) : An Overview
Dynamic Programming.
Chapter 12 Memory Organization
그래프와 트리 (Graphs and Trees)
Internet Computing KUT Youn-Hee Han
CHAP 10 : 그래프.
How I Approach Tuning a SQL Statement
히스토그램 그리고 이진화 This course is a basic introduction to parts of the field of computer vision. This version of the course covers topics in 'early' or 'low'
스케줄링 (Scheduling) 시스템 내부시간(time in the system): 스케줄링 문제
HMM 기반 연속음성인식 베이스라인 시스템 서강대학교 음성언어처리연구실.
점화와 응용 (Recurrence and Its Applications)
1. 관계 데이터 모델 (1) 관계 데이터 모델 정의 ① 논리적인 데이터 모델에서 데이터간의 관계를 기본키(primary key) 와 이를 참조하는 외래키(foreign key)로 표현하는 데이터 모델 ② 개체 집합에 대한 속성 관계를 표현하기 위해 개체를 테이블(table)
The general form of 0-1 programming problem based on DNA computing
데이터 베이스의 내부 구조.
I/O Management and Disk Scheduling
자료구조 자료구조의 개요 배열 선형 리스트 스택 큐 트리.
4. 분자 상호 작용의 네트워크 분석 4.1 네트워크 표현과 계산
MST – Kruskal 알고리즘 (추상적)
Dynamic Graph Query Primitives for SDN-based Cloud Network Management Ramya Raghavendra, Jorge Lobo, Kang-Won Lee 2012 HotSDN 정보통신공학과.
[CPA340] Algorithms and Practice Youn-Hee Han
Aggregated K-nearest neighbor queries for High – dimensional data Eojin Yun, Dept. of Computer Science and Engineering, POSTECH. Motivation 만약.
Traditional Methods – Part 1
Chapter 7: Deadlocks.
가상 기억장치 (Virtual Memory)
Presentation transcript:

On the computation of multidimensional Aggregates 데이타베이스 연구실 김 은정 982COG31@ewha.ac.kr 2018-11-27 on the computation of multidimensional aggregates

on the computation of multidimensional aggregates Contents Cube operator Sort-based methods Hash-based methods evaluation Overlap methods comparison 2018-11-27 on the computation of multidimensional aggregates

Optimizations for computing group-bys(1) Smallest-parent - 이전에 계산된 group-by중 가장 작은 것으로 부터 group-by 계산 Cache-results - disk I/o를 줄이기 위해 다른 group-by 계산된 결과 이용 Amortize-scans - disk scan 한번으로 가능한 많은 group-by 계산 Share-sorts specific to the sort-based algorithm 여러 group-by 간에 sorting-cost를 공유하는 방법 2018-11-27 on the computation of multidimensional aggregates

Optimizations for computing group-bys(2) 5.Share-partitions specific to the hash-based algorithm Hash-table이 메모리에 비해 큰 경우, 데이터를 분할하여 , 메모리에 맞는 각 부분들에 대해 aggregation 수행 Optimizations are often contradictory 2018-11-27 on the computation of multidimensional aggregates

on the computation of multidimensional aggregates Sort-based methods(1) Pipe sort -optimizations share-sorts, smallest-parent 결합 minimum total cost - optimizations cache-results, amortize-scans 포함 pipelined fasion 2018-11-27 on the computation of multidimensional aggregates

on the computation of multidimensional aggregates Sort-based methods(2) Algorithm pipesort Assumption -각각의 group-by에 대해 서로 다른 값들의 수를 추정한다고 가정 Input : a search lattice search lattice for a data cube : graph Vertex : a group-by of the cube Directed edge i>j : j는 i로 부터 생성할 수 있고, j는 i보다 반드시 하나 적은 attribute를 갖는다. (i is parent of j) Level k : 정확하게 k attributes 를 포함하는 모든 group-bys 2018-11-27 on the computation of multidimensional aggregates

on the computation of multidimensional aggregates Sort-based methods(3) -eij : labled with two costs A(eij) : I가 이미 정렬되어 있을때, I에서 j를 계 산하는 비용 S(eij) : I가 정렬되어 있지 않을때, I에서 j를 계 산하는 비용 Output : a subgragh of the search lattice each group-by Find an output O that has minimum sum of edge costs 2018-11-27 on the computation of multidimensional aggregates

on the computation of multidimensional aggregates Sort-based methods(4) Algorithm Pipesort (input : search lattice with the A() and S() edges costs ) For level k = 0 to N –1 /* find how to generate level k from level k+1 */ Generate-Plan (k+1 ->k); For each group-by g in level k+1 Fix the sort order of g as the order of the group-by connected to g by an A() edge; 2018-11-27 on the computation of multidimensional aggregates

on the computation of multidimensional aggregates Sort-based methods(5) Generate-Plan(k+1 -> k) Make k additional copies of each level k+1 vertex; Connect each copy vertex to the same set of level k vertices as the original vertex; Assign cost A(eij) to edge eij from the original vertex and S(eij) to edge from the copy vertex; Find the minimum cost matching on the transformed levels 2018-11-27 on the computation of multidimensional aggregates

on the computation of multidimensional aggregates Sort-based methods(6) 2018-11-27 on the computation of multidimensional aggregates

on the computation of multidimensional aggregates Sort-based methods(7) 2018-11-27 on the computation of multidimensional aggregates

on the computation of multidimensional aggregates Sort-based methods(8) 2018-11-27 on the computation of multidimensional aggregates

on the computation of multidimensional aggregates Hash-based methods(1) Careful memory allocation of multiple hash tables Optimizations : cache-results, amortize-scans, smallest-parent share partition Data to be aggregated too large for the hash tables to fit in memory partition the data on some attribute memory allocation - What group-bys to compute together? –memory not sufficent - NP-complete  heuristic solution 2018-11-27 on the computation of multidimensional aggregates

on the computation of multidimensional aggregates Hash-based methods(2) Algorithm Pipehash (input : search lattice with group-by estimated sizes) Initialize worklist with MST of the search lattice; While worklist is not empty pick any tree T from the worklist; T’ = select-subtree of T to be executed next; compute-subtree T’; 2018-11-27 on the computation of multidimensional aggregates

on the computation of multidimensional aggregates Hash-based methods(3) 2018-11-27 on the computation of multidimensional aggregates

Experimental evaluation(1) RS/6000 250 workstation AIX 3.2.5 Memory : 256MB Buffer size : 32MB Datasets : 2GB SCSI 3.5”drive with sequential throughput of about 1.5MB/second Perfomance results 2 to 8 times faster Pipesort : close to lower bound ( max. difference is 22%) Pipehash : close to lower bound ( max. difference is 8%) For most datasets pipesort is superior 2018-11-27 on the computation of multidimensional aggregates

Experimental evaluation(2) 2018-11-27 on the computation of multidimensional aggregates

Options for computing the CUBE Cube를 계산함에 있어,다양한 cuboids의 계산을 중첩(overlap)함으로써, disk access를 최소화하는 sort-based method -> memory가 제한된 상태에서도 잘 수행. Cuboid = group-by Base cuboid = 모든 attribute 에 대한 group-by Independent method -(multiple independent group-by queries) Independently compute each cuboid from base cuboid, using standard group-by techniques -> poor performance Parent method -(hierarch of group-by queries) - better than I.M., since parent is likely to be much smaller than the base cuboid Overlap method 2018-11-27 on the computation of multidimensional aggregates

on the computation of multidimensional aggregates Overlap method(1) Sort-based overlap method - 다른 cuboids의 계산이 중첩되고, 모든 cuboid는 정렬 순서내에서 계산됨 Multi-pass method Algorithm은 base cuboid의 sort로 시작 각 cuboid는 다시 sort되지 않고, 존재하는 sorted run merge 각 cuboid 의 계산에는 memory가 필요한데, 대부분의 경우,전체 cube가 너무 커서 한번의 scan으로 불가능 서로 다른 cuboid 의 중첩을 해야 partition이 계산 단위 partition 이 되자마자 tuple들이 자손 cuboid 를 위해 pipelined되거나 disk로 쓰여지고, 그memory 는 다음 partition 을 위해 쓰여짐 모든 cuboid 가 계산될때까지 반복 2018-11-27 on the computation of multidimensional aggregates

on the computation of multidimensional aggregates Overlap method(2) DAG : node :cuboid edge : k -> k-1 attribute root : base cuboid Choosing a tree (fig.8) - minimize the size of the partitions of a cuboid minimum memory is needed for its computation Maximum match in their sort orders Choosing a set of cuboids for overlapped computation (fig.10) optimal allocation problem-> NP-hard Heuristic allocation (Breadth First) search order 사용 Computing a cuboid from its parent 2018-11-27 on the computation of multidimensional aggregates

on the computation of multidimensional aggregates Overlap method(3) Cube {A,B,C,D}에 대한 DAG 2018-11-27 on the computation of multidimensional aggregates

on the computation of multidimensional aggregates Overlap method(4) 2018-11-27 on the computation of multidimensional aggregates

on the computation of multidimensional aggregates Overlap method(5) 2018-11-27 on the computation of multidimensional aggregates

on the computation of multidimensional aggregates Overlap method(6) Available memory : 25 pages -> BF allocaion 3 subtree 생성 2018-11-27 on the computation of multidimensional aggregates

Overlap method - 성능 평가(8) 2018-11-27 on the computation of multidimensional aggregates

Overlap method - 성능 평가(9) 2018-11-27 on the computation of multidimensional aggregates

on the computation of multidimensional aggregates comparison Pipesort -single pipeline -choosing parent -> group-by size 고려 Overlap multiple pipeline(partition을 이용해 더 많이 overlap하기위함) choosing parent ->sort order의 maximum match가 중요 Future work -성능비교 - 두 방법을 결합한 새로운 algorithm 2018-11-27 on the computation of multidimensional aggregates