데이터 마이닝 소개 Introduction to Data Mining

데이터 마이닝 소개 Introduction to Data Mining

왜 데이터 마이닝 ? 데이터 홍수의 시대 엄청난 양의 데이터가 생산되고 있음 : 빅 데이터 예제
은행, 이동 통신, 기타 비즈니스 거래 ... 과학 데이터 : 천문학, 생물학, 기타 웹, 문서, 전자 상거래 빅 데이터 예제 유럽의 우주 측지 관측센터 VLBI (Very Long Baseline Interferometry) 는 16 개의 망원경을 갖고 있으며, 각 망원경은 초당 1 Gigabit/second 의 데이터를 생산 저장 및 분석의 어려움 storage and analysis a big problem 미국 이동 통신사 AT&T 는 매일 수십억 통화를 처리 so much data, it cannot be all stored -- analysis has to be done “on the fly”, on streaming data

데이터 마이닝 역사: 데이터 마이닝에 대한 여러 이름들
데이터 마이닝 역사: 데이터 마이닝에 대한 여러 이름들 Data Fishing, Data Dredging: 1960- used by Statistician (as bad name) Data Mining : used DB, business in 2003 – bad image because of TIA Knowledge Discovery in Databases (1989-) used by AI, Machine Learning Community also Data Archaeology, Information Harvesting, Information Discovery, Knowledge Extraction, ... Currently: 데이터 마이닝 또는 지식 발견 작업 이란 용어를 혼용하여 사용하고 있음

데이터 마이닝 정의 대용량 데이터 베이스에서 자동으로 숨겨져 있는 예측 정보를 추출하는 작업 3 가지 키 워드 :
자동 (Automated) 숨겨진 (Hidden) 예측 (Predictive) 통계적 방법을 사용하여 작업 수행 데이터 마이닝은 사전 대책 마련 작업 회고하기 (Retrospective) 보다는 전망하는 (Prospective) 역할 수행

데이터 마이닝 구현 방법은 .. 기타 (And more …) 결정 트리 (Decision Trees) 최소 이웃 거리 분류기
(Nearest Neighbor Classification) 신경망 (Neural Networks) 규칙 유도 (Rule Induction) 군집화 알고리즘 (K-means Clustering Algorithm) 기타 (And more …)

다음은 데이터 마이닝 작업에 해당되지 않는다 …
대용량 데이터의 단순 처리 작업 (Brute-force crunching) 존재하지 않는 관계를 찾는 작업 (relationships which don’t exist) 다른 형식으로 데이터 표현 작업 복잡한 데이터 베이스 관련 작업

데이터 마이닝 vs. 데이터베이스 DB 작업은 무엇을 찾을 지(what is looking for)를 미리 알고 있다
DM 작업은 무엇을 찾을 지(what is looking for)를 미리 알지 못한다 DB는 질의에 대한 답이 100% 정확하다 DM은 가능한 정확한 해를 구하고자 한다 DB는 저장된 형태대로 검색된다 DM은 만들어진 결과를 간혹 후처리 할 필요가 있다 DB의 처리 결과는 데이터 집합이다 DM의 처리 결과는 데이터 분석이다 DB는 처리 결과가 어떠한 의미를 갖는 가를 중요시 하지 않는다 DM은 처리 결과가 어떠한 의미를 갖는 가를 매우 중요시

DM을 위해 필요한 기술 요소들 대용량 컴퓨팅 파워 DM 통계적 & 학습 알고리즘 데이터 수집 및 관리 기술

컴퓨팅 파워 데이터 수집 무어 법칙에 의하여 컴퓨팅 파워는 매 18 개월 마다 2 배가 된다
병열 처리 가능한 값싼 서버들이 시장에 등장 데이터 수집 일반적으로 데이터 양이 많을수록 DM 결과는 더 좋다

알고리즘 항상 알고리즘 개발이 컴퓨터 성능을 앞서고 있다 통계학자들은 이미 수작업으로 데이터 마이닝을 해오고 있었다
기계 학습은 통계적 처리 방법을 보다 지능적으로 적용하는 기술이다 대부분의 데이터 마이닝 기법들은 이미 알고 있던 방법들을 조금 개선하여 성능 향상을 시도한다

데이터 마이닝 : 어떤 유형의 데이터 사용? Relational databases Data warehouses
Transactional databases Advanced DB and information repositories Object-oriented and object-relational databases Spatial databases Time-series data and temporal data Text databases and multimedia databases Heterogeneous and legacy databases WWW

데이터 마이닝 : 응용 분야 과학 분야 비즈니스 분야 웹 : 정부 기관
astronomy, bioinformatics, drug discovery, … 비즈니스 분야 advertising, CRM (Customer Relationship management), investments, manufacturing, sports/entertainment, telecom, e-Commerce, targeted marketing, health care, … 웹 : search engines, bots, … 정부 기관 law enforcement, profiling tax cheaters, anti-terror(?)

데이터 마이닝 작업 유형... 분류 (Classification) [Predictive]
군집화 (Clustering) [Descriptive] 상관 규칙 발견 (Association Rule Discovery) [Descriptive] 연속 패턴 발견 (Sequential Pattern Discovery) [Descriptive] 회귀 분석 (Regression) [Predictive] 이탈 검출 (Deviation Detection) [Predictive]

상관 규칙 (Association Rules)
Given: 고객 거래 (customer transactions) 데이터 베이스 각 거래 데이터는 아이템들의 집합 Find all rules X => Y ; X 아이템이 존재하면, Y 아이템이 존재한다 예들 들어, 기저귀와 분류를 구입하는 고객의 98% 는 맥주를 구입한다 규칙의 조건부와 결론부에는 아이템 집합이 포함될 수 있다 아이템에 좀 더 구체적인 제약을 추가할 수 있다 (e.g., 비싼 수입 제품)

신뢰도 및 지지도 (Confidence and Support)
1 & 2 => 3 has 90% confidence if when a customer bought 1 and 2, in 90% of cases, the customer also bought 3. 모든 규칙은 사용자가 지정한 최소한도의 지지도 (support)를 만족해야 한다 1 & 2 => 3 should hold in some minimum percentage of transactions to have business value

예제 (Example) Example: For minimum support = 50%, minimum confidence = 50%, we have the following rules 1 => 3 with 50% support and 66% confidence 3 => 1 with 50% support and 100% confidence

문제 분할 (Problem Decomposition) - Example
For minimum support = 50% and minimum confidence = 50% For the rule 1 => 3: Support = Support({1, 3}) = 50% Confidence = Support({1,3})/Support({1}) = 66%

The Apriori Algorithm Ck : Set of 후보 아이템 집합(candidate itemsets) of size k Fk : Set of 빈번한 아이템 집합(frequent itemsets) of size k F1 = {large items} for ( k=1; Fk != 0; k++) do { Ck+1 = New candidates generated from Fk for each transaction t in the database do Increment the count of all candidates in Ck+1 that are contained in t Fk+1 = Candidates in Ck+1 with minimum support } Answer = Uk Fk

핵심 포인트 빈번한 아이템 집합 (frequent itemset)의 모든 부분 집합 역시 빈번한 아이템 집합에 해당된다 => Ck+1 에 속한 각 후보 아이템 집합의 부분 집합은 반드시 Fk에 포함된다

Apriori - Example Database D C1 F1 Scan D C2 C2 F2 Scan D

분할 (Partitioning) Divide database into partitions D1,D2,…,Dp
Apply Apriori to each partition

분할 알고리즘 (Partitioning Algorithm)
Divide D into partitions D1,D2,…,Dp; For I = 1 to p do Li = Apriori(Di); C = L1  …  Lp; Count C on D to generate L;

Partitioning Example L1 ={{Bread}, {Jelly}, {PeanutButter}, {Bread,Jelly}, {Bread,PeanutButter}, {Jelly, PeanutButter}, {Bread,Jelly,PeanutButter}} D1 L2 ={{Bread}, {Milk}, {PeanutButter}, {Bread,Milk}, {Bread,PeanutButter}, {Milk, PeanutButter}, {Bread,Milk,PeanutButter}, {Beer}, {Beer,Bread}, {Beer,Milk}} D2 S=10%

분할의 장단점 (Partitioning Adv/Disadv)
Advantages: 주 기억 장치에서 처리 가능 병렬 처리 가능 최대 데이터베이스 스캔 횟수는 2번 Disadvantages: 2 번째 스캔 단계에서 후보 아이템 집합 수가 너무 많을 수 있다

분류 (Classification) Given: 데이터 값과 클래스 인덱스로 구성된 튜플들의 집합
Task : 각 클래스의 모델/프로파일 생성 Example profile (good credit): (25 <= age <= 40 and income > 40k) or (married = YES) 응용 분야 : 신용 카드 신용도 (good, bad) 은행 위치 평가 (good, fair, poor) 치료 효과 (good, fair, poor)

예제 분류 문제 (Classification Example)
Sal (<=50K) (>50K) Tid Job Age Salary Class Self 30 30K C Class C Age c 1 Industry 35 40K C (>40) (<=40) 2 Univ. 50 70K C 3 Self 45 60K B Job Class C (Univ., Industry) 4 Univ. 30 70K B (Self) 5 Industry 35 60K A Class B Class A 6 Self 35 60K A 7 Self 30 70K A Training Data Set Sample Decision Tree

결정 트리 (Decision Tree) 흐름도 (flow chart) 형태의 트리 구조 각 노드에서 특징값에 대한 평가 수행
각 에지는 평가 결과를 나타냄 단말 노드는 클래스 인덱스 또는 클래스 분포도를 나타냄 결정트리는 쉽게 분류 규칙으로 변환

예제 결정 트리 (Example Decision Tree)
categorical categorical continuous Splitting Attributes class Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES The splitting attribute at a node is determined based on the Gini index.

결정 트리의 노드 노드의 분기 질문 xi=α? 어떻게 만들 것인가?
d 개의 특징이 있고 그들이 평균 n 개의 값을 가진다면 dn 개의 후보 질문 그들 중 어느 것을 취해야 가장 유리한가?

유리한 정도의 판단 기준은? 불순도 측정 기준 XTleft와 XTright가 동질 일수록 좋다. 엔트로피 지니 불순도
오분류 불순도 노드 T에서 ωi가 발생할 확률은

예제: 불순도 측정

노드에서 질문 선택 불순도 감소량 또는 투잉 기준이 최대인 질문을 취함 불순도 감소량 투잉 기준

노드에서 질문 생성 비계량인 경우 xi=α? 계량인 경우 xi<α? 이산 이산 값에 따라 α를 결정 연속
실수 범위를 구간화 하여 α 결정 또는 샘플의 값 분포를 보고 두 값의 가운데를 α로 결정

예제: 후보 질문 생성 x3의 값의 분포를 조사하면,

예제: 불순도 감소량

결정 트리 (Decision Trees) 장점 (Pros) 빠른 수행 시간 생성된 규칙에 대한 해석이 매우 쉬움
대용량 데이터 집합에 대해 적절한 크기의 결정 트리의 크기가 만들어짐 다차원 데이터 처리가 용이함 단점 (Cons) 특징 간의 상관 관계를 활용할 수 없음 각 특징 값의 구간 나눔이 축에 수직으로만 이루어짐

회귀 분석 (Regression) 데이터 아이템 값들을 하나의 실수값으로 변환하는 함수 생성
E.g., linear regression Risk score=0.01*(Balance)-0.3*(Age)+4*(HouseOwned)

군집화 (Clustering) 고객 그룹을 만들고 각 그룹의 특성 파악 : 비지도 학습 (Unsupervised learning) : 분류 작업 (classification)과는 달리 각 데이터에 대해 클래스 인덱스가 사전에 주어지지 않고 학습 결과로 클래스 인덱스 할당 가능 같은 군집에 속한 객체들을 상호 유사한 특성 공유, 다른 군집에 속한 객체들은 서로 다른 특성 소유

K-means Ask user how many clusters they’d like. (e.g. k=5)

K-means Ask user how many clusters they’d like. (e.g. k=5)
Randomly guess k cluster Center locations

Randomly guess k cluster Center locations Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints)

Randomly guess k cluster Center locations Each datapoint finds out which Center it’s closest to. Each Center finds the centroid of the points it owns

Randomly guess k cluster Center locations Each datapoint finds out which Center it’s closest to. Each Center finds the centroid of the points it owns… …and jumps there …Repeat until terminated!

K-means Start Advance apologies: in Black and White this example will deteriorate Example generated by Dan Pelleg’s super-duper fast K-means system: Dan Pelleg and Andrew Moore. Accelerating Exact k-means Algorithms with Geometric Reasoning. Proc. Conference on Knowledge Discovery in Databases 1999, (KDD99) (available on

K-means continues…

K-means terminates

사례 연구 : 고객 이탈 (Customer Attrition: Case Study)
Situation: 모바일 폰 고객 이탈율은 매년 25-30% ! Task: 지난 N 개월간의 고객 정보를 갖고, 어느 고객이 다음 달에 이탈할 가능성이 높은 가를 예측 각 고객의 가치를 예측하고 비용 대비 효과가 가장 큰 상품 제안

고객 이탈 연구 결과 Verizon 이통사는 고객 데이터웨어 하우스 운영 이탈 가능성 높은 고객 선정
복수의 지역 단위 상품 모델 개발 상품 제안을 받아들일 가능성이 높은 고객군 집중 공략 고객 이탈률을 월 2%에서 1.5%로 낮춤 (huge impact, with >30 M subscribers)

사례 연구 : 신용도 평가 Assessing Credit Risk: Case Study
Situation: 대출 신청 고객에 대해 Task: 대출 허용할 것인가를 판단 Note: 신용도가 아주 높은 고객은 대출 신청을 하지 않고 신용도가 아주 낮은 고객은 대출 상환을 하지 않으므로, 은행에서 관심을 갖는 고객은 신용도가 중간 정도인 고객

신용도 평가 결과 Credit Risk - Results
다양한 기계 학습 기법을 이용하여 고객 신용도 모델 개발 고객이 채무불이행 할 가능성을 평가하여 담보대출 및 신용카드 발급 업무에 사용

사례 연구 : 전자상거래 Successful e-commerce – Case Study
고객이 Amazon.com에서 책 구입 Task: 고객이 구매할 가능성이 높은 다른 책 추천 구매가 된 책들에 대한 군집화 수행 : “Advances in Knowledge Discovery and Data Mining” 책을 구입한 고객은 “Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations” 책도 구입 매우 성공적인 책 추천 시스템

사례연구 : 유전자 미세배열 분석 Genomic Microarrays – Case Study
환자 집단으로부터 얻은 유전자 미세배열 자료를 분석하여, 정확한 질병 진단에 사용 치료 방법 제안에 활용 치료 결과 예측

사례 연구 : 보안 및 사기 방지 Security and Fraud Detection - Case Study
신용 카드 사기 감지 돈 세탁 (Money laundering) 감지 FAIS (US Treasury) 보안 사기 감지 NASDAQ KDD system 전화 사기 감지 AT&T, Bell Atlantic, British Telecom/MCI 생화학 테러 감지 Salt Lake Olympics 2002

상용 데이터 마이닝 소프트웨어 Commercial Data Mining Software
시장 규모 급속히 팽창 Depends on what you call “data mining” 응용 분야 개발보다는 마이닝 툴 기능 확대 주력 표준화 작업 XML CWM, PMML, GEML, Clinical Trial Data Model, … 웹 서비스 분야 ?

주요 데이터 마이닝 소프트웨어 Top Data Mining Vendors Today
SAS 800 Pound Gorilla in the data analysis space SPSS Insightful (formerly Mathsoft/S-Plus) Well respected statistical tools, now moving into mining Oracle Integrated data mining into the database Angoss One of the first data mining applications (as opposed to tools) HNC Very specific analytic solutions Unica Great mining technology, focusing less on analytics these days

표준화 동향 Standards In Data Mining
Predictive Model Markup Language (PMML) The Data Mining Group ( XML based (DTD) Java Data Mining API spec request (JSR ) Oracle, Sun, IBM, … Support for data mining APIs on J2EE platforms Build, manage, and score models programmatically OLE DB for Data Mining Microsoft Table based Incorporates PMML

개인 정보 보호 이슈 Privacy Issues
DM 이 고객에 관한 통계 자료를 얻는 채널은 – Credit card use – Store card – Subscription – Book, video, etc rental – and via more sources… 개인 정보가 보호되는 범위 내에서 DM이 활용할 수 있는 조치 필요

참고 문헌 Resources and References
Good overview book: Data Mining Techniques by Michael Berry and Gordon Linoff Web: Knowledge Discovery Nuggets (ref-tutorials) (ref-tutorials) DataMine Mailing List send message “subscribe datamine-l”

Questions?

데이터 마이닝 소개 Introduction to Data Mining

Similar presentations

Presentation on theme: "데이터 마이닝 소개 Introduction to Data Mining"— Presentation transcript:

Similar presentations

About project

지원

로그인

Auth with social network:

데이터 마이닝 소개 Introduction to Data Mining

Similar presentations

Presentation on theme: "데이터 마이닝 소개 Introduction to Data Mining"— Presentation transcript:

Similar presentations

About project

지원