Data Organization Patterns

Slides:

Advertisements

Similar presentations

Big Data & Hadoop. 1. Data Type by Sectors Expected Value using Big Data.

Advertisements

2010 – 06 – 24 주간 보고서.

chapter 3. Filtering Patterns

인공지능실험실 석사 2학기 이희재 TCP/IP Socket Programming… 제 11장 프로세스간 통신 인공지능실험실 석사 2학기 이희재

최윤정 Java 프로그래밍 클래스 상속 최윤정

Entity Relationship Diagram

Java로 배우는 디자인패턴 입문 Chapter 5. Singleton 단 하나의 인스턴스

윤 홍 란 다이알로그(대화상자) 윤 홍 란

연결리스트(linked list).

제 9 장 구조체와 공용체.

Report #2 - Solution 문제 #1: 다음과 같이 프로그램을 작성하라.

Hybrid INDIGO project 중간보고

데이터 파일 C 데이터 파일과 스트림(Stream) 텍스트 파일 처리

테이블 : 데이터베이스를 구성하는 요소로 같은 성격에 정보의 집합체. 레코드 : 하나의 정보를 가지고 있는 컬럼의 집합체

UNIT 07 Memory Map 로봇 SW 교육원 조용수.

6장 그룹 함수.

07 그룹 함수 그룹 함수의 개념 그룹 함수의 종류 데이터 그룹 생성 HAVING 절.

디지털영상처리 및 실습 대구보건대학 방사선과.

Introduction to Big Data, Summer, 2013

11장. 포인터 01_ 포인터의 기본 02_ 포인터와 Const.

SqlParameter 클래스 선문 비트 18기 발표자 : 박성한.

데이터베이스 프로그래밍 (소프트웨어 개발 트랙) 퍼스널 오라클 9i 인스톨.

18강. 데이터 베이스 - II JDBC 살펴보기 Statement객체 살펴보기 Lecturer Kim Myoung-Ho

하둡 기반 빅데이터 처리 방법.

학습목표 학습목차 다른 홈페이지의 HTML 파일 코드를 보는 방법에 대해 알아봅니다.

자료구조: CHAP 4 리스트 (3) 순천향대학교 컴퓨터공학과 하 상 호.

PySpark Review 박영택.

자바 5.0 프로그래밍.

C 프로그래밍 C언어 (CSE2035) (Chap11. Derived types-enumerated, structure, and union) (1-1) Sungwook Kim Sogang University Seoul, Korea Tel:

자바 5.0 프로그래밍.

프로그래밍 개요

Linux/UNIX Programming

UNIT 07 Memory Map 로봇 SW 교육원 조용수.

자료구조: CHAP 7 트리 –review 순천향대학교 컴퓨터공학과 하 상 호.

영상처리 실습 인공지능연구실.

Linux/UNIX Programming

Chapter6 : JVM과 메모리 6.1 JVM의 구조와 메모리 모델 6.2 프로그램 실행과 메모리 6.3 객체생성과 메모리

USN(Ubiquitous Sensor Network)

컴퓨터 프로그래밍 기초 - 10th : 포인터 및 구조체 -

2장. 변수와 타입.

Decision Tree & Ensemble methods

5강. 배열 배열이란? 배열의 문법 변수와 같이 이해하는 배열의 메모리 구조의 이해 레퍼런스의 이해 다차원 배열

CHAP 21. 전화, SMS, 주소록.

Linux/UNIX Programming

Linux/UNIX Programming

Canary value 스택 가드(Stack Guard).

데이터 동적 할당 Collection class.

DA :: 퀵 정렬 Quick Sort 퀵 정렬은 비교방식의 정렬 중 가장 빠른 정렬방법이다.

오라클 11g 보안.

05. General Linear List – Homework

3장 JSP프로그래밍의 개요 이장에서 배울 내용 : JSP페이지의 기본적인 개요설명과 JSP페이지의 처리과정 그리고 웹 어플리케이션의 구조에 대해서 학습한다.

3. 모듈 (5장. 모듈).

Chapter 10 데이터 검색1.

세션에 대해 알아보고 HttpSession 에 대해 이해한다 세션 관리에 사용되는 요소들을 살펴본다

발표자 : 이지연 Programming Systems Lab.

9 브라우저 객체 모델.

ER-관계 사상에 의한 관계데이터베이스 설계 충북대학교 구조시스템공학과 시스템공학연구실

Numerical Analysis Programming using NRs

동적메모리와 연결 리스트 컴퓨터시뮬레이션학과 2016년 봄학기 담당교수 : 이형원 E304호,

제 4 장 Record.

TrustNet 전자 협조전 사용설명서 목 차 작성,수정,삭제 결재함 처리현황 발송대장,접수대장

6장. SQL 쿼리.

CODE INJECTION 시스템B 김한슬.

C++ Espresso 제15장 STL 알고리즘.

7 생성자 함수.

Linux/UNIX Programming

Linux/UNIX Programming

20 XMLHttpRequest.

Presentation transcript:

Data Organization Patterns 21st of May, 2015 Data Organization Patterns 인공지능 연구실 | 조연정

Structured to Hierarchical Partitioning Binning Total Order Sorting Shuffling 21st of Mar, 2015

Structured to Hierarchical 21st of Mar, 2015 Structured to Hierarchical

Structured to Hierarchical 서로 다른 구조를 가진 데이터에서 새로운 레코드를 생성하는 패턴 Intent 행 기반 데이터를 JSON, XML과 같은 계층 형식으로 변환 Motivation RDBMS에서 Hadoop system으로 data migration join 연산을 줄일 수 있음 Applicability foreign key로 연결된 데이터 소스를 가지고 있는 경우 행 기반의 구조화된 데이터를 가지고 있는 경우

Structured to Hierarchical Mapper 데이터 로드 레코드를 하나의 형식으로 파싱 key : identifier / value : data Reducer 데이터 리스트로부터 계층적 데이터 구조 구축 Consequence 지정한 키로 그룹화된 계층 구조

Structured to Hierarchical Known Uses Pre-joining data Performance Analysis reducer가 생성하는 객체의 메모리 사용량 터무니없이 큰 레코드가 발생한 경우 JVM의 힙을 초과할 수 있음

Structured to Hierarchical Example Post / Comment data를 통하여 데이터를 그룹화 Post Id, PostTypeId, AcceptedAnswerId, Body, … Comment Id, PostId, Text, CreationData, UserId <row Id="1" PostTypeId="1" … /> <row Id=“2" PostTypeId="1" … /> <row Id=“3" PostTypeId="1" … /> <row Id=“4" PostTypeId="1" … /> <row Id=“5" PostTypeId="1" … /> <row Id=“6" PostTypeId="1" … /> <row Id=“7" PostTypeId="1" … /> <row Id=“8" PostTypeId="1" … /> . <row Id=“1" PostId="2" … /> <row Id=“2" PostId=“5" … /> <row Id=“3" PostId=“6" … /> <row Id=“4" PostId=“2" … /> <row Id=“5" PostId=“1" … /> <row Id=“6" PostId=“5" … /> <row Id=“7" PostId=“8" … /> <row Id=“8" PostId=“10" … /> . <post Id=“1” PostedTypeId="1” … /> <post Id=“3” PostedTypeId="1” … /> <post Id=“6” PostedTypeId="1” … /> <post Id=“4” PostedTypeId="1” … /> <post Id=“2” PostedTypeId="1” … /> . + =

Structured to Hierarchical Main - MultipleInputs 각 입력에 대하여 서로 다른 input path 및 mapper 클래스 지정 데이터가 하나의 소스에서 로드되는 경우 생략

Structured to Hierarchical Mapper 레코드 파싱 output : ( Id, “P” + value )

Structured to Hierarchical Mapper output : ( PostId, “C” + value )

Structured to Hierarchical Reducer 플레그를 제외한 데이터 추출 Element 태그 변경

21st of Mar, 2015 Partitioning

레코드의 순서와는 관계없이 카테고리로 나누어 이동 Partitioning 레코드의 순서와는 관계없이 카테고리로 나누어 이동 Intent 데이터셋에서 비슷한 레코드를 각각 작은 데이터셋으로 분리 Motivation 전체 데이터 셋에 분포되어있는 특정한 데이터셋을 보고싶은 경우 Applicability 얼마나 많은 파티션으로 나누어야 하는지를 알고있어야 함  partition의 수를 결정하는 분석 수행

Structure Consequence Partitioning Mapper / Reducer partitioner identity mapper / reducer가 사용 partitioner 데이터 분할 분할 조건을 나타내는 함수 정의 레코드를 어떤 reducer로 보낼지를 결정 ( reducer = partition ) Consequence job의 output folder에 각 파티션에 대한 하나의 part file이 생성

Known uses Performance Analysis Partitioning Partition Pruning by Continuous value 날짜와 같은 연속적인 변수를 분석하는 경우 Partition Pruning by Category 명확하게 정의된 카테고리에 일치하는 레코드를 분석하는 경우 Sharding 데이터가 분리되어 있는 경우 Performance Analysis 파티션들이 비슷한 수의 레코드를 가지지 않는 경우 하나의 큰 파티션에 대하여 여러 개의 reducer를 할당

Example Partitioning 2012 2013 2014 사용자의 최종 접속 날짜 연도를 기준으로 레코드를 분할하라 Users Id, Reputation, CreationDate, … <row Id="1" CreationDate="2012"…/> <row Id="2" CreationDate="2014"…/> <row Id="3" CreationDate="2013"…/> <row Id="4" CreationDate="2012"…/> <row Id="5" CreationDate="2012"…/> <row Id="6" CreationDate="2013"…/> <row Id="7" CreationDate="2014"…/> <row Id="8" CreationDate="2013"…/> <row Id="9" CreationDate="2012"…/> 2012 2013 2014

Partitioning Main customPartition 설정 partition 최소값 정의

Mapper Partitioning LastAccessDate에서 Year 추출 output : ( LastAccessYear, Data )

Partitioner Partitioning Partition 할당 (minLastAccessYear : 2012) key : 2012, 2013, 2014 partition : 0, 1, 2

Partitioning Reducer

21st of Mar, 2015 Binning

레코드의 순서와는 관계없이 카테고리로 나누어 이동 Binning 레코드의 순서와는 관계없이 카테고리로 나누어 이동 Intent 데이터셋에서 레코드를 카테고리로 분류 Motivation Partitioning과 매우 유사 partitioner 대신 map 단계에서 데이터를 분할 reducer 생략 가능 하나의 mapper가 가능한 output bin에 대하여 하나의 파일을 가짐 ( NameNode의 확장성 및 후속 분석에 악영향 )

Structure Consequence Partitioning mapper outputFile의 갯수 = mapper의 수 * bin의 수 mapper 각 레코드에 대하여 각 bin에 대한 조건의 리스트를 반복 레코드가 조건을 충족하는 경우 해당 bin으로 전송 combiner, reducer, partitioner 없음 Consequence 각 mapper의 output은 bin 당 하나의 작은 파일이 생성  더 큰 파일로 수집하는 일부 사후처리를 실행해야 함

Performance Analysis Binning 다른 map 전용 작업과 같은 확장성 및 성능 속성을 나타냄 어떤 종류의 sort, shuffle, reduce도 수행되지 않음 대부분의 처리는 로컬 데이터에서 수행

Example Binning 2012 2013 2014 사용자의 최종 접속 날짜 연도를 기준으로 레코드를 분할하라 Users Id, Reputation, CreationDate, … 2012 <row Id="1" CreationDate="2012"…/> <row Id="2" CreationDate="2014"…/> <row Id="3" CreationDate="2013"…/> <row Id="4" CreationDate="2012"…/> <row Id="5" CreationDate="2012"…/> <row Id="6" CreationDate="2013"…/> <row Id="7" CreationDate="2014"…/> <row Id="8" CreationDate="2013"…/> <row Id="9" CreationDate="2012"…/> 2013 2014

Main Binning “bin”이라 불리는 MultipleOutputs 생성 OutputFormat, keyFormat, valueFormat

Mapper Binning 새로운 MultipleOutputs 인스턴스 생성 <tag1><tag2><tag3>과 같은 형식에서 tag 추출 >, < 제거 조건과 일치하는 경우 레코드를 bin으로 출력 MultipleOutputs를 반드시 닫아주어야함

21st of Mar, 2015 Total Order Sorting

Intent Motivation Requirement Total Order Sorting 전체 레코드의 순서를 정렬 sort key를 이용하여 데이터를 병렬로 정렬 Motivation MapReduce 작업의 출력 파일을 연결하는 경우 각 부분은 정렬되어 있으나 전체 데이터 셋은 정렬되어 있지 않음 MapReduce 작업에서는 정렬된 데이터가 거의 필요하지 않으므로 신중해야 함 Requirement sort key는 비교 가능(Comparable) 해야 한다.

Structure (1) : Analyze Phase Total Order Sorting Structure (1) : Analyze Phase 데이터의 범위 결정 (선택적으로 수행) 데이터의 분포가 급격하게 변화하지 않는 경우 한 번만 실행 경우에 따라 직접 파티션을 추측할 수 있음 Mapper( Sort key, null ) 실제 레코드는 관여하지 않으므로 null 값을 사용 Reducer 하나의 reducer가 사용  sort key 통합

Structure (2) : Order Phase Total Order Sorting Structure (2) : Order Phase 데이터 정렬 Mapper( Sort key, Record ) value 값으로 레코드 자체가 저장 Partitioner ( TotalOrderPartitioner ) 이전 단계에 생성된 partition file로부터 데이터 범위를 로드 Reducer 결정 Performance Analysis 데이터를 두 번 로드/파싱해야하므로 비용이 높음

Example Total Order Sorting 사용자의 최종 방문 날짜 연도를 기준으로 레코드를 정렬하라 Users Id, CreationDate, LastAccessDate, … <row Id="1" LastAccessDate="20120101"…/> <row Id="2" LastAccessDate="20130202"…/> <row Id="3" LastAccessDate="20140303"…/> <row Id="4" LastAccessDate="20120504"…/> <row Id="5" LastAccessDate="20130405"…/> <row Id="6" LastAccessDate="20120305"…/> <row Id="7" LastAccessDate="20140206"…/> <row Id="8" LastAccessDate="20131107"…/> <row Id="9" LastAccessDate="20131208"…/> <row Id="21" LastAccessDate="20120101"…/> <row Id="267" LastAccessDate="20120202"…/> <row Id="33" LastAccessDate="20120303"…/> <row Id="15" LastAccessDate="20130405"…/> <row Id="64" LastAccessDate="20130607"…/> <row Id="79" LastAccessDate="20131106"…/> <row Id="800" LastAccessDate="20141017"…/> <row Id="90" LastAccessDate="20141208"…/>

Main Total Order Sorting inputPath : 입력 파일 partitionFile : 파이션 목록 outputStage : 중간 결과 outputOrder : 출력 파일 sampleRate : 샘플링 비율 샘플링을 위한 작업 준비 outputFormat 파일의 형태를 SequenceFile로 설정  다음 job에게 전달하기에 최적화 된 파일 형태

Main(2) Total Order Sorting Analyze phase 완료  Order phase 실행 Mapper.class : identity mapper input : 이전 작업의 출력 결과 separator : empty String partition 파일, staging directory 제거

Total Order Sorting Mapper output : ( LastAccessDate, Record )

Total Order Sorting Reducer

21st of Mar, 2015 Shuffling

데이터의 순서에 관한 패턴이지만, Total Order Sorting과 정반대의 효과를 가짐 Shuffling 데이터의 순서에 관한 패턴이지만, Total Order Sorting과 정반대의 효과를 가짐 Intent 무작위의 레코드 셋을 얻고자 하는 경우 Motivation 완전하게 정렬을 파괴  익명화(Anonymizing)  무작위 표본 추출

Structure Consequence Shuffling mapper ( RandomKey, Record ) combiner, reducer, partitioner는 사용되지 않음 RandomKey를 사용하여 정렬 Consequence 각 reducer의 출력은 무작위로 정렬된 레코드가 포함된 파일

Performance Analysis Shuffling 매우 좋은 성능 특성 파일의 크기 예측 가능 파일 크기 = 데이터 셋 크기 / Reducer 개수 원하는 크기의 파일을 얻을 수 있음

Example Shuffling Comment 데이터를 익명화 Comment UserId, Id, CreationData 제거 Id, PostId, Text, CreationData, UserId UserId, Id, CreationData 제거

Shuffling Main

Mapper Shuffling UserId, Id, CreationDate entry 제거 output : ( randomInteger, processedData )

Shuffling Reducer

21st of Mar, 2015 Conclusion

각 mapper의 출력이 bin당 하나의 파일 Conclusion 형식 변환 카테고리 분류 데이터 정렬 Structured to Hierarchical Partitioning Binning Total Order Sorting Shuffling 목적 행 기반 데이터를 계층 형식으로 변환 비슷한 레코드를 작은 데이터 셋으로 분리 전체 데이터 정렬 데이터 익명화 사용처 RDBMS to Hadoop 특정 데이터 셋 분석 NoSQL 타임라인 구성 검색 익명화 무작위 표본 추출 요구사항 Foreign key로 연결 파티션의 개수 파악 경우에 따라 파일 통합을 위한 사후처리 필요 Sort key comparable - 출력 결과 그룹화된 계층 구조 각 파티션에 대한 파일 각 mapper의 출력이 bin당 하나의 파일 정렬된 데이터 무작위로 정렬된 데이터