PySpark Review 박영택
RDD (Resilient Distributed Dataset)
RDD 생성 RDD를 생성하는 첫번째 방법 예시 파일, 파일의 집합을 통해 생성 sc.textFile(“myfile.txt”) sc.textFile(“mydata/*.log”) sc.textFile(“myfile1.txt, myfile2.txt”) 파일, 파일의 집합을 통해 생성 Time can bring you down Time can bend your knees Would you know my name If I saw you in heaven File: test.txt RDD: newRDD Time can bring you down Time can bend your knees Would you know my name If I saw you in heaven > newRDD = sc.textFile(“test.txt”) > newRDD.count() 4
RDD 생성 RDD를 생성하는 두번째 방법 메모리에 있는 데이터를 통해 생성 > num = [1,2,3,4] list: num > num = [1,2,3,4] > rdd = sc.parallelize(num) RDD: rdd 1 2 3 4
RDD 생성 RDD를 생성하는 세번째 방법 RDD를 통해 생성 Time can bring you down Time can bend your knees Would you know my name If I saw you in heaven File: test.txt > newRDD = sc.textFile(“test.txt") > newRDD_uc = newRDD.map(lambda line: \ line.upper()) > newRDD_uc.count() 4 RDD: newRDD Time can bring you down Time can bend your knees Would you know my name If I saw you in heaven RDD: newRDD_uc Time can bring you down Time can bend your knees Would you know my name If I saw you in heaven
RDD Operations RDD 함수의 종류 Actions : 값을 리턴 Transformations : 현재의 것 에 기초하여 새로운 RDD를 정의한다.
RDD 함수: Transformations (1) Transformation 함수는 이미 존재하는 RDD를 통해 새로운 RDD를 생성 RDD는 불변(immutable) RDD에 있는 데이터는 절대 바꿀수 없다. 필요에 따라 데이터를 수정하는 시퀀스를 변환한다.
RDD 함수: Transformations (2) map(function) : 주어진 RDD의 각 레코드 (라인)별로 기능을 수행하여 새로운 RDD를 생성 Time can bring you down Time can bend your knees Would you know my name If I saw you in heaven File: test.txt RDD: newRDD > newRDD = sc.textFile(“test.txt") > newRDD_uc = newRDD.map(lambda line: line.upper()) > newRDD_uc.count() 4 Time can bring you down Time can bend your knees Would you know my name If I saw you in heaven RDD: newRDD_uc TIME CAN BRING YOU DOWN TIME CAN BEND YOUR KNEES WOULD YOU KNOW MY NAME IF I SAW YOU IN HEAVEN
RDD 함수: Transformations (3) filter(function) : 주어진 RDD를 라인(레 코드)별로 조건에 맞는 라인으로 새로운 RDD를 생 성 Time can bring you down Time can bend your knees Would you know my name If I saw you in heaven File: test.txt RDD: newRDD Time can bring you down Time can bend your knees Would you know my name If I saw you in heaven > newRDD = sc.textFile(“test.txt") > newRDD _ft = newRDD.filter(lambda line: \ line.startswith(’T')) > newRDD_ft.count() 2 RDD: newRDD_ft Time can bring you down Time can bend your knees
RDD 함수: Actions (1) 주요 Action 함수 count() : RDD의 요소의 갯수를 반환 Time can bring you down Time can bend your knees Would you know my name If I saw you in heaven File: test.txt > newRDD= sc.textFile(“test.txt”) > newRDD.count() 4 RDD: newRDD Time can bring you down Time can bend your knees Would you know my name If I saw you in heaven
RDD 함수: Actions (2) 주요 Action 함수 take(n) : RDD의 첫번째 요소부터 n개의 요소를 리스트로 반환 Time can bring you down Time can bend your knees Would you know my name If I saw you in heaven File: test.txt > newRDD= sc.textFile(“test.txt”) > newRDD.take(2) [ u’Time can bring you down’, u'Time can bend your knees' ] RDD: newRDD Time can bring you down Time can bend your knees Would you know my name If I saw you in heaven list Time can bring you down Time can bend your knees
RDD 함수: Actions (3) 주요 Action 함수 collect(n) : RDD의 모든 요소를 반환 Time can bring you down Time can bend your knees Would you know my name If I saw you in heaven File: test.txt > newRDD = sc.textFile(“test.txt”) > newRDD .collect() [u" Time can bring you down", u' Time can bend your knees', u' Would you know my name', u" If I saw you in heaven"] RDD: newRDD Time can bring you down Time can bend your knees Would you know my name If I saw you in heaven list Time can bring you down Time can bend your knees Would you know my name If I saw you in heaven
RDD 함수: Actions (4) Some common actions Time can bring you down Time can bend your knees Would you know my name If I saw you in heaven File: test.txt Some common actions saveAsTextFile(path) : RDD를 파일로 저장 RDD: newRDD > newRDD= sc.textFile("test.txt") > newRDD_ft = newRDD.filter(lambda line: \ line.startswith(’T')) > newRDD_ft.saveAsTextFile(“output”) Time can bring you down Time can bend your knees Would you know my name If I saw you in heaven RDD: newRDD_ft Time can bring you down Time can bend your knees Time can bring you down Time can bend your knees File: part-0000
Lazy Execution (1) RDD의 데이터는 action 함수로 인한 작업이 수행 될 때까지, 처리되지 않 음 > Time can bring you down Time can bend your knees Would you know my name If I saw you in heaven File: test.txt RDD의 데이터는 action 함수로 인한 작업이 수행 될 때까지, 처리되지 않 음 >
Lazy Execution (2) RDD의 데이터는 action 함수로 인한 작업이 수행 될 때까지, 처리되지 않음 Time can bring you down Time can bend your knees Would you know my name If I saw you in heaven File: test.txt RDD의 데이터는 action 함수로 인한 작업이 수행 될 때까지, 처리되지 않음 RDD: newRDD > newRDD = sc.textFile(”text.txt")
Lazy Execution (3) RDD의 데이터는 action 함수로 인한 작업이 수행 될 때까지, 처리되지 않 음 Time can bring you down Time can bend your knees Would you know my name If I saw you in heaven File: test.txt RDD의 데이터는 action 함수로 인한 작업이 수행 될 때까지, 처리되지 않 음 RDD: newRDD > newRDD = sc.textFile(”test.txt") > newRDD_uc = newRDD.map(lambda line: line.upper()) RDD: newRDD_uc
Lazy Execution (4) RDD의 데이터는 action 함수로 인한 작업이 수행 될 때까지, 처리되지 않 음 Time can bring you down Time can bend your knees Would you know my name If I saw you in heaven File: test.txt RDD의 데이터는 action 함수로 인한 작업이 수행 될 때까지, 처리되지 않 음 RDD: newRDD > newRDD = sc.textFile(”test.txt") > newRDD_uc = newRDD.map(lambda line: line.upper()) > newRDD_filt = \ newRDD_uc.filter(lambda line: \ line.startswith(’T')) RDD: newRDD_uc RDD: newRDD_filit
Lazy Execution (5) RDD의 데이터는 action 함수로 인한 작업이 수행 될 때까지, 처리되지 않 음 Time can bring you down Time can bend your knees Would you know my name If I saw you in heaven File: test.txt RDD의 데이터는 action 함수로 인한 작업이 수행 될 때까지, 처리되지 않 음 RDD: newRDD > newRDD = sc.textFile(”test.txt") > newRDD_uc = newRDD.map(lambda line: line.upper()) > newRDD_filt = \ newRDD_uc.filter(lambda line: \ line.startswith(’T')) > newRDD_filt.count() 2 Time can bring you down Time can bend your knees Would you know my name If I saw you in heaven RDD: newRDD_uc TIME CAN BRING YOU DOWN TIME CAN BEND YOUR KNEES WOULD YOU KNOW MY NAME IF I SAW YOU IN HEAVEN RDD: newRDD_filit Time can bring you down Time can bend your knees
Chaining Transformations > newRDD = sc.textFile("test.txt") > newRDD_uc = newRDD.map(lambda line: line.upper()) > newRDD_filt = newRDD_uc.filter(lambda line: line.startswith(’T')) > newRDD_filt.count() 2 > sc.textFile(“test.txt”).map(lambda line: line.upper()) \ .filter(lambda line: line.startswith(‘T’)).count() 2
Example: Passing Named Functions (1) Anonymous Functions 식별자(함수명)없는 인라인 함수 Best for short, one-off functions Python: lamda x: … Time can bring you down Time can bend your knees Would you know my name If I saw you in heaven File: test.txt > def toUpper(s) : return s.upper() > newRDD = sc.textFile(“test.txt”) > newRDD.map(toUpper).take(2) > newRDD.map(lambda line: line.upper()).take(2)
Example: Passing Named Functions (2) Anonymous Functions 식별자(함수명)없는 인라인 함수 Best for short, one-off functions Python: lamda x: … Time can bring you down Time can bend your knees Would you know my name If I saw you in heaven File: test.txt > def toUpper(s) : return s.upper() > newRDD = sc.textFile(“test.txt”) > newRDD.map(toUpper).take(2) > newRDD.map(lambda line: line.upper()).take(2)
Some Other General RDD Operations (1) Transformations flatMap(function) : base RDD의 각 라인별 엘리먼트를 각 엘리먼 트 단위로 매핑 > sc.textFile(“test.txt”) \ .map(lambda line: line.split()) \ [ [“Time”, “can”, “bring”, “you”, “down”] [”Time”, “can”, “bend”, “your”, “knees”] [“Would”, “you”, “know”, “my”, “name”] [“If”, “I”, “saw”, “you”, “in”, “heaven”] ] Time can bring you down Time can bend your knees Would you know my name If I saw you in heaven File: test.txt
Some Other General RDD Operations (1) Transformations flatMap(function) : base RDD의 각 라인별 엘리먼트를 각 엘리먼 트 단위로 매핑 > sc.textFile(“test.txt”) \ .flatMap(lambda line: line.split()) [ [“Time”, “can”, “bring”, “you”, “down”] [”Time”, “can”, “bend”, “your”, “knees”] [“Would”, “you”, “know”, “my”, “name”] [“If”, “I”, “saw”, “you”, “in”, “heaven”] ] Time can bring you down Time can bend your knees Would you know my name If I saw you in heaven File: test.txt [“Time”, “can”, “bring”, “you”, “down”, ”Time”, “can”, “bend”, “your”, “knees”, “Would”, “you”, “know”, “my”, “name”, “If”, “I”, “saw”, “you”, “in”, “heaven”]
Some Other General RDD Operations (2) Transformations distinct: 중복제거 > sc.textFile(“test.txt”) \ .flatMap(lambda line: line.split()) .distinct() Time can bring you down Time can bend your knees Would you know my name If I saw you in heaven File: test.txt [“Time”, “can”, “bring”, “you”, “down”, ”Time”, “can”, “bend”, “your”, “knees”, “Would”, “you”, “know”, “my”, “name”, “If”, “I”, “saw”, “you”, “in”, “heaven”] [“Time”, “can”, “bring”, “you”, “down”, “bend”, “your”, “knees”, “Would”, “know”, “my”, “name”, “If”, “I”, “saw”, “in”, “heaven”]
Some Other General RDD Operations (4) Other RDD operations first: RDD의 첫 번째 엘리먼트를 리턴 > newRDD.first() “Time can bring you down” RDD: newRDD Time can bring you down Time can bend your knees Would you know my name If I saw you in heaven
Some Other General RDD Operations (5) Other RDD operations top(n): 현재 RDD 상태에서 가장 큰 n개의 엘리먼트를 리턴 > newRDD.top(2) “Time can bring you down” “Time can bend your knees” RDD: newRDD Time can bring you down Time can bend your knees Would you know my name If I saw you in heaven