소프트웨어시스템 실습 머신러닝 Machine Learning (1) 학기.

소프트웨어시스템 실습 머신러닝 Machine Learning (1) 2016. 2 학기

Basic Learning Process  Data storage: fact 데이터 저장  utilizes observation, memory, and recall to provide a factual basis  Abstraction: 데이터 변환  Involves the translation of stored data into broader representations and concepts.  Generalization: 학습 ( 일반화 )  uses abstracted data to create knowledge and models  Evaluation: 평가 ( 성능개선 지향 )  provides a feedback mechanism to measure the utility of learned knowledge and inform potential improvements.

Machine Learning ( 기계학습 ) 감독형 학습 (Supervised Learning) 자동분류 (Classification) 회귀분석 (Regression) => 예측 모델 (Prediction model) 의 도출 비감독형, 자율 학습 (Unsupervised Learning) 클러스터링 (Clustering), 연관규칙 마이닝 (Association) => 설명 모델 (Description model) 의 도출 강화학습 (Reinforcement Learning) Agent : (State, Action) -> Reward 3

Machine Learning in Practice  Data collection:  In most cases, the data will need to be combined into a single source like a text file, spreadsheet, or database.  Data exploration and preparation:  Data understanding -> feature selection, model selection  Data cleansing -> fixing or cleaning so-called "messy" data, eliminating unnecessary data  Data transformation -> recoding the data to conform to the learner's expected inputs.  Model training:  Machine learning algorithm selection -> Model construction  Model evaluation:  evaluate the accuracy of the model using a test dataset  develop measures of performance specific to the intended application.  Model improvement:  utilize more advanced strategies to augment the performance of the model  Augment the current training data  Use another ML algorithm

Input Data Numeric Nominal (or categorical) Ordinal

Types of Machine Learning Algorithms  Supervised Learning: prediction model 을 생성  Classification: 주어진 데이터에 적합한 category 를 결정  Category: class 컬럼에 존재하는 값들  Regression: 주어진 데이터에 대한 수치형 값 ( 예 : income, laboratory values, test scores, or counts of items)  Unsupervised Learning: description model 을 생성  Association rule mining (pattern discovery): basket data analysis  Clustering: segmentation analysis  Meta-learning: 상위 수준의 learning 방법의 설계  focus on learning how to learn more effectively

Supervised Learning: Classification, Regression  학습 알고리즘에 따라 예측 ( 분류 ) 모델 형태가 다름 7 Support Vector Machine Statistics (ex) Bayesian Network k-Nearest Neighbors Decision TreesNeural Network

Classification 시스템 구조  기본 개념 분류 ( 예측 ) 모델 8

Prediction 모델의 생성 : 의사결정 트리 (Decision Tree) acceptreject salary < 20000 no yes no yes accept Education in graduate Credit Analysis 학습 데이터 학습 분류 모델 레이블 ( 클래스 ) 9

여행을 즐기는 직장인 골프를 즐기는 부자 노년층 Unsupervised Learning: Clustering

Unsupervised Learning: Association Mining  Given: 상품 구매 기록으로부터 상품간의 연관성을 측정하여 함께 거래될 가능성을 규 칙으로 표현 일명 : 장바구니 분석

Data Understanding

Exploring the structure of data

Exploring numeric variables

Visualizing numeric variables: box-plot

 boxplot 의 해석 boxplot

Visualizing numeric variables: histogram

Measuring the central tendency : mode

Exploring categorical variables

Exploring relationships between variables  Visualizing relationships – scatterplots

Examining relationships – two-way cross-tabulations

k-Nearest Neighbors (instance-based learning) Supervised Learning

k-Nearest Neighbors

K-NN: example

Euclidean distance

K-NN: example When k = 1, tomato’s neighbors : orange When k = 3, tomato’s neighbors : orange, grape, nuts Class 컬럼

Choosing an appropriate k Noisy data 의 영향을 줄임 작지만 중요한 패턴을 놓칠 수 있음 Over-fitting 가능성이 커짐 작지만 중요한 패턴을 포착할 수 있음

Weighted voting The vote of the closer neighbors is considered more authoritative than the vote of the far away neighbors.

Rescaling  Min-max normalization  z-score standardization

Coding  The Euclidean distance formula is not defined for nominal data.  To calculate the distance between nominal features,  we need to convert them into a numeric format.  => dummy coding, where a value of 1 indicates one category, and 0, the other.

K-NN : Lazy learning  원칙적으로 lazy learning 은 진정한 learning 이 아님  Prediction 단계 이전에 training data 를 저장만 함  그래서 Prediction 단계는 다른 알고리즘에 비해 시간이 오래 걸림  별칭  Instance-based learning  Rote learning

Example: diagnosing breast cancer  Step 1 – collecting data  Wisconsin Breast Cancer Diagnostic dataset from the UCI Machine Learning Repository ( http://archive.ics.uci.edu/ml )  measurements from digitized images of fine-needle aspirate of a breast mass.  569 examples with 32 features.

Example: diagnosing breast cancer  Step 2 – exploring and preparing the data  Importing the CSV file  Browsing the structure

Example: diagnosing breast cancer  Step 2 – exploring and preparing the data Target feature (class 컬럼 ) 은 factor 형으로

Example: diagnosing breast cancer  Step 2 – exploring and preparing the data  Transformation – normalizing numeric data

Example: diagnosing breast cancer  Step 2 – exploring and preparing the data  Transformation – normalizing numeric data normalize() 함수 이용 !

Example: diagnosing breast cancer  Step 2 – exploring and preparing the data  Transformation – normalizing numeric data normalize() 함수 이용

Example: diagnosing breast cancer  Step 2 – exploring and preparing the data  Transformation – normalizing numeric data

Example: diagnosing breast cancer  Step 2 – exploring and preparing the data  Data preparation – creating training and test datasets

Example: diagnosing breast cancer  Step 3 – Training a model & evaluating model performance  knn() 함수는 class 패키지에 있음

Example: diagnosing breast cancer  Step 4 – improving model performance  One method => Transformation – z-score standardization

Example: diagnosing breast cancer  Step 4 – improving model performance  Another method => Testing alternative values of k

Probabilistic Learning: Naïve Bayes Classification Supervised Learning

Understanding probability Joint probability 사건 A, B 가 서로 독립이면 P(A ∩ B) = P(A) * P(B) 사건 A, B 가 서로 독립이면 P(A ∩ B) = P(A|B) * P(B) P(A ∩ B) = P(B|A) * P(A)

Bayes’ Theorem

P(spam ∩ Viagra) =P(Viagra|spam) * P(spam) = (4/20) * (20/100) = 0.04 P(spam|Viagra) =P(Viagra|spam) * P(spam) / P(Viagra) = (4/20) * (20/100) / (5/100) = 0.80

Naïve Bayes algorithm

 Classification with Naive Bayes 어떤 email 이 ‘Viagra’, ‘Unsubscribe’ 단어는 포함하고, ‘Money’, ‘Groceries’ 단어는 포함하고 있 지 않을 때, 이 email 이 spam 인지에 대한 posterior probability 는 ?

Naïve Bayes algorithm 이대로 계산하기에는 너무 복잡 P(w1, w2) = P(w1) * P(w2|w2) P(w1, w2|s) = P(w1|s) * P(w2|w1, s) P(w1, w2, w3|s) = P(w1|s) * P(w2, w3|w1, s) = P(w1|s) * P(w2|w1, s) * P(w3|w1, w2, s) P(w1, w2, w3, w4|s) = P(w1|s) * P(w2, w3, w4|w1, s) = P(w1|s) * P(w2|w1, s) * P(w3, w4|w1, w2, s) = P(w1|s) * P(w2|w1, s) * P(w3|w1, w2, s) * P(w4|w1, w2, w3, s) 같은 class ( 예 : spam) 상에서 만약 단어 ( 사건 ) 간에 서로 독립이라면 class-conditional independence = P(w1|s) * P(w2|s) * P(w3|s) * P(w4|s)

Naïve Bayes algorithm 분모 부분은 class 에 상관없이 동일한 값을 가지므로

Naïve Bayes algorithm = = 요약하면, Probability 값을 갖도록 조정

Naïve Bayes algorithm  Likelihood 계산과정에서 한가지 문제가 있음  예를 들어, ‘Viagra’, ‘Groceries’, ‘Money’, ‘Unsubscribe’ 단어를 가지는 email2 이 있을 때, Naïve Bayes 알고리즘에 따라 spam 에 대한 likelihood 값 P(‘Viagra’, ‘Groceries’, ‘Money’, ‘Unsubscribe | spam) 은 ?  P(‘Viagra’, ‘Groceries’, ‘Money’, ‘Unsubscribe | spam) =  해결책 : Laplace estimator

Naïve Bayes algorithm  Laplace estimator  frequency table 의 각 셀에 작은 수 ( 예 : 1) 을 보태줌 5/2417/24 11/24 1/24 21/24 13/24 9/24 24 2/8480/84 15/84 67/84 9/84 72/84 24/84 58/84 84 7/10897/108 26/108 78/108 10/108 93/108 37/108 67/108 108

Naïve Bayes algorithm  P(‘Viagra’, ‘Groceries’, ‘Money’, ‘Unsubscribe | spam) =  P(spam|‘Viagra’, ‘Groceries’, ‘Money’, ‘Unsubscribe) = 0.0004/(0.0004+0.0001) = 0.8  P( ham|‘Viagra’, ‘Groceries’, ‘Money’, ‘Unsubscribe) = 0.0001/(0.0004+0.0001) = 0.2 5/2417/24 11/24 1/24 21/24 13/24 9/24 24 2/8480/84 15/84 67/84 9/84 72/84 24/84 58/84 84 7/10897/108 26/108 78/108 10/108 93/108 37/108 67/108 108

Naïve Bayes algorithm  Using numeric features with Naive Bayes  numeric features => 이산화 (discretization)  즉, 전체 수치값의 영역을 구역 (bin) 별로 나누어 카테고리화 시킴  예 ) 하루에 email 을 받은 시간 feature 를 추가하여 spam 여부를 구분

Naïve Bayes algorithm  Example – filtering mobile phone spam

Naïve Bayes algorithm  exploring and preparing the data

Naïve Bayes algorithm  Data preparation – cleaning and standardizing text data  tm 패키지 활용  일단, corpus 객체를 생성함  cf) PCorpus() : DB 와 같은 저장소에 permanent corpus 를 생성 sms_raw$text 벡터로부터 Source object 생성

Naïve Bayes algorithm  Data preparation – cleaning and standardizing text data  실제 text 내용을 보기 위해서는,

Naïve Bayes algorithm  Data preparation – cleaning and standardizing text data  다수의 문서를 보기 위해서, lapply() 함수 활용

Naïve Bayes algorithm  Data Preparation 숫자 제거 구두점 제거 white space 제거

Naïve Bayes algorithm  Data Preparation: stopwords 제거

Naïve Bayes algorithm  Data preparation – cleaning and standardizing text data

Naïve Bayes algorithm  Data preparation – splitting text documents into words  Document-Term Matrix 생성

Naïve Bayes algorithm  Data preparation – creating training and test datasets training data test data training data: 전체 데이터에서 70-80% 정도의 비율 모델 평가를 위해 class 컬럼 정보 저장

Naïve Bayes algorithm  Visualizing text data – word clouds

Naïve Bayes algorithm  Data preparation – feature (word) selection  어떤 단어는 classification 에 도움이 되지 않음  frequent word 만을 가진 DTM 을 생성

Naïve Bayes algorithm  Data preparation – data transformation  DTM matrix 에 있는 값은 numeric -> categorical 값으로 변환 필요

Naïve Bayes algorithm  Training a model on the data  Evaluating the model

Naïve Bayes algorithm  Improving the model

Decision Trees Supervised Learning

Decision Trees

 Recursive partitioning (or Divide and Conquer)  영화 흥행 예측

Decision Trees  Recursive partitioning (or Divide and Conquer)

Decision Trees  C5.0 decision tree algorithm: DT 의 표준 알고리즘

Decision Trees  Choosing the best split 원칙적으로, 분할 영역의 데이터가 하나의 클래스 값을 가져야 함 분할과정에서, 각 분할영역이 하나의 클래스를 가지는 정도 (Purity) 를 측정 해야 함 C5.0 에서는 purity 측정을 위해 entropy 를 이용

Decision Trees  Entropy 계산 P (red) = 0.6 P (blue) = 0.4 50-50 split 일 때, 최대 entropy

Decision Trees  Information Gain:  The change in homogeneity (entropy) D D1D2 AiAi DvDv... 가장 큰 InfoGain 값을 가지는 컬럼을 선택하여 분할

CS583, Bing Liu, UIC 79 Decision Trees (C5.0)

Decision Trees  Overfitting: training data 에 지나치게 적합화  training data 에는 정확하지만, test data 에서는 오류가 커짐  결과적으로 tree 의 형태가 가지가 커지고, 깊어지는 형태가 됨  Overfitting 을 피하는 방법  Pre-pruning (early stopping): 적당한 시점에 분할을 정지  그 시점을 알기가 매우 어려움  Post-pruning: 최대한 트리를 성장시킨 후에 classification 도움되지 않는 가 지를 제거  pruning 을 위해 validation set 설정

Decision Trees  또 다른 Purity 측정  Gini Index P (red) = 0.6 P (blue) = 0.4

Decision Trees overfitting: 분류경계선이 지나치게 training data 에 적합

Decision Trees  Example: identifying risky bank loans  Exploring and preparing the data

Decision Trees  Exploring and preparing the data

Decision Trees  Data preparation – creating random training and test datasets

Decision Trees  Training a model on the data

Decision Trees

 Evaluating the model

Decision Trees  Improving the model: C5.0 은 Boosting 기법을 포함

소프트웨어시스템 실습 머신러닝 Machine Learning (1) 학기.

Similar presentations

Presentation on theme: "소프트웨어시스템 실습 머신러닝 Machine Learning (1) 학기."— Presentation transcript:

Similar presentations

About project

지원

로그인

Auth with social network:

소프트웨어시스템 실습 머신러닝 Machine Learning (1) 학기.

Similar presentations

Presentation on theme: "소프트웨어시스템 실습 머신러닝 Machine Learning (1) 학기."— Presentation transcript:

Similar presentations

About project

지원