Cluster Analysis (군집 분석)

Slides:

Advertisements

Similar presentations

Theory and Design for Mechanical Measurements Prof. Bumkyoo Choi Depart. of Mechanical Engineering.

Advertisements

Marketing Research 1  군집분석의 개념과 적용  군집분석 (cluster analysis) : 다수의 대상들 ( 소비자, 제품, 기타 ) 을 그들이 소유하는 특 성을 토대로 유사한 대상들끼리 그룹핑하는 다변량 통계기법 → 군집내의 구성원들은 가급 적.

CHAPTER 5 KARNAUGH MAPS( 카노 맵 ) This chapter in the book includes: Objectives Study Guide 5.1Minimum Forms of Switching Functions 5.2Two- and Three-Variable.

Ch.4 수요관리와 수요예측 Ch.2 수요예측생산 ∙ 운영관리 1. 제 1 절 수요관리의 개념과 중요성 1. 수요관리의 필요성 정확한 수요예측은 사업의 성과를 좌우하는 매우 중요한 과제이다. – 수요는 판매량과 다르다. – 하지만 온갖 불확실성 요소가 난무하는 사업환경에서.

Association Rule Sequential Pattern Classification Clustering Data Mining A B C D 2.

영업기획실무 자료.

5장, 마케팅조사의 종류와 마케팅자료 마케팅 조사원론.

한국 영화계 네트웍 분석.

Lecture Notes for Chapter 2

스테레오 비젼을 위한 3장 영상의 효율적인 영상정렬 기법

Development and Initial Validation of Quality-of-Life Questionnaires for Intermittent Exotropia Ophthalmology 2010;117:163–168 Pf. 임혜빈 / R2 정병주.

Keller: Stats for Mgmt & Econ, 7th Ed

Neural Network - Perceptron

Chapter 3 데이터와 신호 (Data and Signals).

Database Marketing(DBM)의 효율적 활용방안 연구 (B to C 및 금호그룹의 서비스산업 중심으로)

기술 통계학 (Descriptive Statistics)

데이터 마이닝을 이용한 분류 분석.

판별분석의 개념과 적용(→ 추계통계적 성격)

의료의 질 평가 분석 기법 김 민 경.

Mesh Saliency 김 종 현.

Feature Extraction Lecture 5 영상 분할.

논문을 위한 통계 집단간 평균 차이: t-test, ANOVA 하성욱 한성대학교 대학원.

실습 (using SPSS) Department of Biostatistics, Samsung Biomedical Research Institute Samsung Medical Center.

제1장 과학과 사회조사방법 과학적 지식(scientific knowledge): 과학적 방법에 의해 얻어진 지식, 즉 논리적, 체계적, 경험적, 객관적 절차를 통해 얻어진 지식 과학적 지식의 특성 1) 재생가능성(reproducibility) 2) 경험가능성(empiricism)

New Product Planning 효과적인 신제품개발 신제품 판매 시장 규모 효율적인 R & D 조직

데이터마이닝의 소개 Data Mining Introduction

捨小就大 “시장환경 변화에 따른 삼성화재 브랜드 위상 강화를 위한 커뮤니케이션 전략 “

영업양수, 인수/합병실무 (가치평가, 회계 및 세무).

제4장 측정과 척도 (Measurement and scale)

EPS Based Motion Recognition algorithm Comparison

CHAPTER 21 UNIVARIATE STATISTICS

8차시: 측정시스템 분석(MSA) 학 습 목 표 학 습 내 용 1. 측정시스템 분석(MSA) 개념 이해

군집분석: 비지도 학습 효율적 군집분석 급내 (intra-class) 유사성이 높고

1 도시차원의 쇠퇴실태와 경향 Trends and Features of Urban Decline in Korea

Information Retrieval (Chapter 5: 질의연산)

군집 분석 (Cluster Analysis) 2016년 가을학기 강원대학교 컴퓨터과학전공 문양세.

Medical Instrumentation

4-1 Gaussian Distribution

Parallel software Lab. 박 창 규

2014년 가을학기 손시운 지도 교수: 문양세 교수님 군집 2014년 가을학기 손시운 지도 교수: 문양세 교수님.

Data Mining Final Project

세일즈분석/분석CRM을 위한 데이터마이닝 활용방안

군집분석 (Cluster analysis)

Modeling one measurement variable against another Regression analysis (회귀분석) Chapter 12.

정보 추출기술 (Data Mining Techniques ) : An Overview

Product-Market Strategy

Inferences concerning two populations and paired comparisons

Pizza Hut Brand Power No.1을 위한 전략

Association between two measurement variables Correlation

5장, 마케팅조사의 종류와 마케팅자료 마케팅 조사원론.

: Two Sample Test - paired t-test - t-test - modified t-test

Statistical inference I (통계적 추론)

Machine Learning using Neural Networks

현대백화점 CRM 구축 사례.

The normal distribution (정규분포)

측정과 척도 경영학과 최동훈 소프트웨어학부 유제민 경영학과 정지송

사용자 경험 측정 (Measuring User Experience)

정보처리학회논문지 B 제10-B권 제1호(2003.2) 김만선, 이상용

Chapter 4: 통계적 추정과 검정 Pilsung Kang

제2장 통계학의 기초 1절 확률 기본정의 확률의 기본 공리와 법칙 2절 확률변수와 확률분포 3절 정규분포와 관련 분포 정규분포

Internet Computing KUT Youn-Hee Han

제 5강 지각.

Modeling one measurement variable against another Regression analysis (회귀분석) Chapter 12.

히스토그램 그리고 이진화 This course is a basic introduction to parts of the field of computer vision. This version of the course covers topics in 'early' or 'low'

Definitions (정의) Statistics란?

Analysis of Customer Behavior and Service Modeling Final Team Project

9장. 특징 선택 오일석, 패턴인식, 교보문고, © 오일석, 전북대학교 컴퓨터공학.

Progress Seminar 권순빈.

Progress Seminar 이준녕.

Presentation transcript:

Cluster Analysis (군집 분석)

Cluster Analysis란 (1) Primary Objective : 사전에 고려된 변수들에 기초를 두고, 다양한 특성을 지닌 대상들을 상대적으로 동질적인 집단으로 분류하는 것 (2) Basic Principle : High internal (Within-cluster) homogeneity and high external (between-cluster) heterogeneity 군집내의 소비자들은 서로 유사하고 한 군집의 소비자는 다른 군집 의 소비자와 서로 다르게 군집을 선택한다.

(3) Application ⅰ) Market Segmentation /Benefit Segmentation ⅱ) 구매행동 이해 : 동질구매집단 분류를 통한 특성 파악 ⅲ) 신제품 기회요인 도출 : brand와 Product를 clustering ⅳ) Test market 선정 ⅴ) Data 축소 (4) Cluster Vs. Factor Analysis cluster : 대상 분류 Factor : 변수(variable) 분류 (5) Cluster Vs. Discriminant Analysis - Object Classification Cluster : Cluster나 Group에 대한 사전 정보(분류기준)가 없는 경우 (독립 관계 분석) Discriminant : Cluster나 Group에 대한 사전 정보가 있는 경우 (종속 관계 분석)

Cluster Analysis 방법 Formulating the problem Selecting a Distance Measure Selecting a Clustering Procedure Deciding on the Number of Clusters Interpreting and Profiling Clusters Assessing the Validity of Clustering

▣ Basic Concept ● An Ideal Clustering Situation ● A Practical Clustering Situation ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Variable 1 ● Variable 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Variable 2 Variable 2

(1) Formulating the Problem : clustering의 기초가 되는 변수 선정 ⅰ) 군집되는 대상의 특성 분류 ⅱ) Cluster Analysis의 목적과 연결 (2) Similarity Measure : Distance Measure가 주로 이용됨 (주어진 질문에 대해 대답 간 차이의 제곱의 합으로 계산) ① Euclidean distance r dijE = ∑ (Xik - Xjk)2 (k=1,.....r) k=1 Xik : k차원에서 대상 i의 좌표 Xjk : k차원에서 대상 j의 좌표

Normalized distance function : Raw data를 Normalization (Mean=0, Variance=1) 하여 scale상의 차이로 발생된 bias를 해결한 Euclidean distance ② Squared Euclidean distance Dij = ∑(Xik - Xjk)2 i=1 An example of Euclidean distance between two objects measured on two variables – X and Y. Y ● (X2-Y2) (Y2-Y1) Object 1 ● (X1-Y1) (X1-Y1) X Distance = (X2-X1) + (Y2-Y1) 2 2

③ City-block distance (Manhattan distance) r dijc = ∑ Xik - Xjk i=1 [문제점] ⅰ) 변수간에 correlation이 없다는 가정 ⅱ) Characteristic을 측정하는 단위(Scales)이 상이성이 가능 -------------------------------------------------------------- Object Purchase Commercial Distance Citi-block Probability(%) Viewing Time(min) (min) (second) A 60 3.0 AB 25.25 61 B 65 3.5 AC 10.00 153 C 64 4.0 BC 4.25 40

④ Mahalanobis distance ⅰ) Standard Deviation으로 scaling해서 data 표준화 ⅱ) intercorrelation을 조정하기 위해서 within-group variance-covariance 합산하는 접근 방식 ⅲ) 변수간에 서로 correlated 되었을 때 가장 적합 ⑤ Minkowski distance dijM = [∑(Xik - Xjk)p]1/r

(3) Clustering Algorithms Clustering Procedures Hierarchical Nonhierarchical Hierarchical Divisive Sequential Threshold Parallel Threshold Optimizing Partitioning Linkage Methods Variance Methods Centroid Methods Ward’s Method Single Linkage Complete Linkage Average Linkage

1) 계층적 군집방법 (Hierarchical Cluster Procedure) ① Agglomerative Procedure : 한 개의 대상에서 출발하여, 주위의 대상이나 cluster를 군집화하여 최종적으로 1개의 cluster로 만드는 방법 ⅰ) Single Linkage : minimum distance rule 군집이나 대상간의 최소거리로 군집화 ⅱ) Complete Linkage : maximum distance rule ⅲ) Average Linkage ⅳ) Ward's Method : W ● Within-cluster variance minimization rule ● Within-cluster distance의 전체 sum of square의 증가가 최소가 되게 cluster ⅴ) Centroid Method ● 대상이나 cluster의 Centroid(mean)간의 거리 최소화 ● 단점 : Metric data에만 적용 가능

② Decisive Method : 큰 한 개의 cluster로 부터 분리시켜 가는 방법 Dendrogram illustrating hierarchical clustering. 01 02 03 04 05 06 07 08 Observation number 1 2 3 4 5 6 7

[Single Linkage : 단일기준 결합 방식] 1.5 A D 1.2 1.55 D 1.4 B B C C 1.3 [Complete Linkage : 완전기준 결합방식] A A 1.5 1.55 D D B B C C

[Average Linkage : 평균기준 결합방식] 1.45 D D 1.425 B B C C

[Ward Method] [Centroid Method] ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

2) 비계층적 군집방법 (Nonhierarchical Clustering Procedures) = k-means clustering ⅰ) Sequential threshold procedure ① 하나의 cluster center를 선택하고 미리 산정된 거리 내에 있는 모든 대상을 그 cluster안에 포함시킨다. ② 두 번째 cluster center를 선택하고 미리 산정된 거리 내에 있는 ⅱ) parallel threshold Procedure ① 초기에 여러 개의 cluster center를 선정하여 가장 가까운 center 로 대상을 포함시킨다 ② threshold 거리는 조절될 수 있다 ⅲ) Optimizing Partitioning Method : 전체적인 optimizing criterion (e.g.,within-cluster distance의 평균)에 따라 나중에 대상을 cluster별로 재편입 시킬 수 있다

▣ Nonhierarchical Clustering의 단점 ② Cluster Center 선정이 임의적이다 ③ 결과가 data의 순서에 의존적이다 ▣ Nonhierarchical Clustering의 장점 ① center 선정에 있어서 nonrandorn ② Clustering 속도가 빠르다

3) 군집방법 선택 : Hierarchical Vs. Nonhierarchical ⅰ) Hierarchical + Ward's Method + average linkage ⇒ 처음에 잘못 clustering되면 지속적으로 영향을 미친다 ⅱ) Hierarchical + Nonhierarchical ① Hierarchical procedure을 사용하여 최초 clustering 결과도출 (Ward Method + average linkage) ② 얻어진 cluster 숫자와 cluster centroid를 optimizing partitioning method의 input으로 사용

(4) Cluster 숫자 결정 ⅰ) 이론적, 개념적, 실제적 목적 고려 ⅱ) cluster간의 거리로 판단 ⅲ) Nonhierarchical clustering에서 Within Group Variance ---------------------------- 을 도식화시켜 b/w Group Variance 꺾이는 부분을 찾아내어 cluster 숫자로 사용 ⅳ) cluster내에 case의 숫자로 판단 (one case를 가진 cluster는 바람직하지 않음)

(5) Cluster 해석 ⅰ) 보통 cluster centroid로 해석 ⅱ) Discriminant analysis 이용 (6) Validation ⅰ) 여러 가지 distance measure를 사용한 결과 비교 ⅱ) 여러 가지 Algorithm을 사용한 결과 비교 ⅲ) data를 임의로 둘로 나누어 각각의 cluster centroids 비교 ⅳ) 일부 data를 임의로 빼고 나머지에 대한 결과를 비교 ⅴ) Nonhierarchical Clustering은 자료의 순서에 의존적이므로 자료의 순서를 바꾸어 여러번 clustering하고 가장 안정적인 결과선택

Examples (1) Example 1 ■ 목적 : 신형 자동차를 출시하기 위해서 기존 시장의 15차종에 대한 특성 파악 ■ 자동차 분류기준 (사전조사결과) : 외형크기와 배기량 ■ 외형 크기와 배기량은 표준화 자 동 차 종류 표준화된 승용차 속성의 평가 점수 외형적 크기 엔진 배기량 A B C D E F G H I J K L M N O 2.50 2.25 3.00 0.25 0.50 -0.25 -2.00 -1.50 -2.50 2.00 1.75 1.00 -0.50 -1.75 -2.25

■ 승용차 특성을 2차원 도식화 [그림 18-4] X2 (엔진배기량) 승용차 특성의 2차원 도표 X1 (외향적 크기) A B K L M N O A B D C J I H E F G

→ Classification cluster center를 계산하여 ■ SPSS의 Quick Cluster → Classification cluster center를 계산하여 각 cluster의 평균을 계산하여 다시 입력자료로 사용하는 방법 [그림 18-5] 단일결합방식 에 의한 결과 A D B C E G J F H I K N M O L 1 3 2 5 4 6 7 8 9 11 10 12 13 14

→ Classification cluster center를 계산하여 ■ SPSS의 Quick Cluster → Classification cluster center를 계산하여 각 cluster의 평균을 계산하여 다시 입력자료로 사용하는 방법 [그림 18-6] 완전결합방식 에 의한 결과 A D B C E G J F H I K N M O L 1 3 2 4 5 6 6 7 9 10 11 12 13 14

(2) Example 2 ■ 목적 : 회사 특성의 중요성 평가에 따른 고객 분류 (Stage 1) Partitioning Step 1 : Hierarchical cluster Analysis 1) Similarity measure : Squared Euclidean distances 2) Algorithm : Ward's method ⇒ within-cluster difference를 최소화 3) cluster 수 결정 : Two cluster가 최선안으로 결정

Agglomeration Coefficient to Next Level TABLE 7.2 Analysis of Agglomeration Coefficient for Hierarchical Cluster Analysis Percentage Change in Agglomeration Coefficient to Next Level Number of] Clusters 10 9 8 7 6 5 4 3 2 1 8.9 8.5 9.2 9.3 12.1 17.0 17.6 61.9 -

Step 2 : Nonhierarchical Cluster Analysis → hierarchical procedure 결과를 Fine-tune ⇒ Hierarchical procedure의 결과 확인 Results of Nonhierarchical Cluster Analysis with Initial Seed Points from Hierarchical Results Mean Values* Cluster X1 X2 X3 X4 X5 X6 X7 Cluster Size Classification cluster centers 1 2 4.40 2.43 1.39 3.22 8.70 6.74 5.09 5.69 2.94 2.87 2.65 2.87 5.91 8.10 Final cluster centers 1 2 4.38 2.57 1.58 3.21 8.90 6.80 4.92 5.60 2.96 2.87 2.52 2.82 5.90 8.13 52 48

Variables Cluster M.S. Df Error M.S df F Ratio Probability Significance Testing of Differences Between Cluster Centers X1 X2 X3 X4 X5 X6 X7 Delivery speed Price level Price flexibility Manufacturer’s image Overall service Sales force’s image Product quality 81.5631 66.4571 109.6372 11.3023 .1883 2.1233 123.3719 1 .9298 .7661 .8233 1.1778 .5682 .5786 1.2797 98.0 87.7172 86.7526 133.1750 9.5959 .3314 3.6697 96.4042 .000 .003 .566 .058 * X1 = Delivery speed : X2 = Price level : X3 = Price flexibility : X4 = Manufacturer’s image : X5 = Overall service : X6 = Sales force’s image : X7 = Product quality.

Stage Two : Interpretation Group Means and Significance Level for Two-Group Nonhierarchical Cluster Solution Cluster Variables 1 2 F Ratio Significance Stage Two : Interpretation X1 X2 X3 X4 X5 X6 X7 Delivery speed Price level Price flexibility Manufacturer’s image Overall service Sales force’s image Product quality 4.460 1.576 8.900 4.926 2.992 2.510 5.904 2.570 3.152 6.888 5.570 2.840 2.820 8.038 105.00 76.61 111.30 8.73 1.02 4.17 82.68 .0000 .0039 .3141 .0438 Stage Three : Profiling Other variables of interest X9 Usage level X10 Satisfaction level 49.88 5.16 42.32 4.38 21.312 26.545 .0000

Stage Two : Interpretation - Table 7.4 참조 - X5는 두 그룹 사이에 차이가 없는 것으로 평가됨 - Cluster 1 focuses ⅰ) delivery speed ⅱ) price flexibility Cluster 2 focuses ⅰ) price ⅱ) manufacturer's image ⅲ) sales force image ⅳ) product quality Stage Three : Validation - Table 7.5 참조 (결과의 consistency 확인) ⇒ 무작위로 선택한 subset으로 clustering하여 비교

Cluster X1 X2 X3 X4 X5 X6 X7 Cluster Size TABLE 7.5 Results of Nonhierarchical Cluster Analysis with Randomly Selected Initial Seed Points Mean Values* Cluster X1 X2 X3 X4 X5 X6 X7 Cluster Size Classification cluster centers 1 2 4.95 1.76 1.14 2.70 9.03 6.87 6.55 5.50 3.21 1.97 3.79 2.70 5.09 8.45 Final cluster centers 1 2 4.47 2.63 1.57 3.10 8.93 6.94 4.99 5.49 2.99 2.84 2.57 2.75 5.78 8.07 48 52 Significance Testing of Differences Between Cluster Centers Variables Cluster M.S. Df Error M.S df F Value Probability X1 X2 X3 X4 X5 X6 X7 Delivery speed Price level Price flexibility Manufacturer’s image Overall service Sales force’s image Product quality 84.3339 58.6837 98.5164 6.2640 .5883 .7477 131.1200 1 .9016 .8454 .9367 1.2292 .5641 .5927 1.2007 98.0 93.5415 69.4175 105.1700 5.0958 1.0428 1.2616 109.2055 .000 .026 .310 .264 * X1 = Delivery speed : X2 = Price level : X3 = Price flexibility : X4 = Manufacturer’s image : X5 = Overall service : X6 = Sales force’s image : X7 = Product quality.