1. Association rule analysis

Slides:

Advertisements

Similar presentations

Datamining Lab 이아람.  How to count the matches The cat ate the bird.  Token : 5/Type : 4.

Advertisements

Lesson 2 A Caring Friend. Making true friends is hard. Keeping them is even harder. To keep a good friendship, you need to care about others. Then, how.

목차 1 엑셀화면 구성 알아보기 2 저장 불러오기, 셀 이동 복사 3 텍스트 입력수정 특수화 기호 / 글꼴 서식, 맞춤 서식 / 표시형식, 테두리 및 채우기 1 4 엑셀 셀 삽입 삭제 / 워크시트 관리.

© DBLAB, SNU 화일구조. 강의 소개 - 화일구조  Instructor : Prof. Sukho Lee (301 동 404 호 )  홈페이지 :  교과목 개요 – 이 과목은 데이타 관리와 응용을 위한 화일 구조의 설계와.

Association Rule Sequential Pattern Classification Clustering Data Mining A B C D 2.

연관규칙기법과 분류모형을 결합한 상품 추천 시스템:

의문사 + to 부정사 주어 To study hard is important.

ALL IN ONE WORKING HOLIDAY!

Chapter 9. 컴퓨터설계기초 9-1 머리말 9-2 데이터 처리장치 (Datapath)

Project #2-2. Pintos User Program

Chapter 7 ARP and RARP.

Introduction to Django

[별첨] 특허 DB 구축 및 토픽 모델링 수행 과정 Flowchart, File List

IT Application Development Dept. Financial Team May 24, 2005

달력 만들기(10월) 2011학년도 중학교 1학년 1반 담임 이민정.

SAP QUERY SAP R/3 4.6C.

텍스트마이닝 실습 (R 이용).

Chapter 02 JAVA 프로그래밍 시작하기 01 실무에서 사용하는 JAVA 개발 환경 02 JAVA 프로그램 작성

SOLID MODELING – 1주차 강의.

제주지역대학 제주 새별오름 들불축제 지역 식생(植生) 변화 조사 연구

Delivery and Routing of IP Packets

제 6 장 데이터 타입 6.1 데이터 타입 및 타입 정보 6.2 타입의 용도 6.3 타입 구성자 6.4 사례 연구

Information Technology

7장 : 캐시와 메모리.

데이터마이닝의 소개 Data Mining Introduction

미래 예측 3가지 방법론.

특수조명 Program Manual M.D.I Solution

Word2Vec Tutorial 박 영택 숭실대학교.

CHAPTER 21 UNIVARIATE STATISTICS

ER-Win 사용 방법.

8. 빅데이터 기법(텍스트마이닝).

Chapter 2. Finite Automata Exercises

Cluster Analysis (군집 분석)

숭실대학교 마이닝연구실 김완섭 2009년 2월 8일 아이디어 - 상관분석에 대한 연구

운영체제 (Operating Systems)

발표자 : 홍익대학교 소프트웨어 공학 연구실 변은영 지도교수 : 김영철

MINITAB for Six Sigma.

GPU Gems 3 Chapter 13. Volumetric Light Scattering as a Post-Process

1. Log in WCMS에서 사용하는 ID와 PW를 동일하게 사용.

R for Data Mining.

Data Mining Final Project

세일즈분석/분석CRM을 위한 데이터마이닝 활용방안

Chapter4. 연관성 분석.

Modeling one measurement variable against another Regression analysis (회귀분석) Chapter 12.

정보 추출기술 (Data Mining Techniques ) : An Overview

Introduction to Programming Language

카카오톡 속의 우리 모습 이 부 일 충남대학교 정보통계학과

Inferences concerning two populations and paired comparisons

Course Guide - Algorithms and Practice -

Association between two measurement variables Correlation

Progress Seminar 신희안.

Statistical inference I (통계적 추론)

EndNote 정기교육 - STEP 2- 일자: 2012년9월4일(화) 시간: 13:00-13:30(30분) 장소: 의학도서관.

Frequency distributions and Graphic presentation of data

-느라고 어제 왜 학교에 안 왔어요? 아파서 병원에 가느라고 못 왔어요 Sogang Korean 3B UNIT 6 “-느라고”

Text Mining (Parsing) with R

The normal distribution (정규분포)

Operating System Multiple Access Chatting Program using Multithread

분임조원고 작성의 이해 임 헌 길.

Modeling one measurement variable against another Regression analysis (회귀분석) Chapter 12.

창 병 모 숙명여대 전산학과 자바 언어를 위한 CFA 창 병 모 숙명여대 전산학과

1. 관계 데이터 모델 (1) 관계 데이터 모델 정의 ① 논리적인 데이터 모델에서 데이터간의 관계를 기본키(primary key) 와 이를 참조하는 외래키(foreign key)로 표현하는 데이터 모델 ② 개체 집합에 대한 속성 관계를 표현하기 위해 개체를 테이블(table)

Analysis of Customer Behavior and Service Modeling Final Team Project

Web based Presentation & Controller Service

Hongik Univ. Software Engineering Laboratory Jin Hyub Lee

Steps for Writing a Paragraph

ADLAD System MANUAL [ ] SEM. Digital Appliance ADLAD System ?

Speaking -여섯 번째 강의 (Review ) RACHEL 선생님

Presentation transcript:

R for Data Mining 1. association rule 2. decision tree 3. text analysis

1. Association rule analysis

1.1 Data preparation and set up Let "C:/Rtest“ contain "mydata_association.csv" Set up Rtest as default directory > setwd("c:/Rtest") ② Install and upload arules which has association rules > install.packages("arules") > library(arules) ③

Read data seperated by comma(,) and save it into result라는 > result <- read.transactions("mydata_association.csv", format="basket", sep=", ") > result > summary(result) > image(result) ② ① Read data seperated by comma(,) and save it into result라는 ③ ④ Show column and row structure in result Show graph that has data in result에 Show result from analysis in result Transactions(Rows) Items(Columns)

2.2 Apply algorithm > as(result, "data.frame") > rules=apriori(result, parameter=list(supp=0.1, conf=0.1)) > inspect(rules) ① ② ③ Convert data in result to table structure Save output from result apriori analysis to rules - minimum support and confidence · · · · · · · ·

2.3 Analysis of output > rules=apriori(result, parameter=list(supp=0.3, conf=0.1)) > inspect(rules) ② ① ③ ④ 실제 이번 슬라이드에서는 제대로 데이터가 나오지 않는군. 이 대목에서 트러블슈팅에 대해서 알려주도록 하자. 아니면, 이것이 실제 데이터 분석 프로젝트에 있어서 사전에 마트를 정말로 만들기 위한 하나의 파일럿으로 이렇게 돌려보아 이번 접근이 여의치 않다고 판단되면 다른 접근방법을 강구해야 하는 것이다 이를 테면, 군집을 나누지 말고 더 모집단 전체로 한다거나, 아니면, 시간적 구매순서가 반영된 연속규칙을 토대로 다시 트랜잭션을 만들어 해나가는 트러블슈팅이 분석에서는 중요하다는 것임. 산행하다 길이 여의치 않으면, 네비게이션으로 가다 길이 예상외로 막히면, 우히해서 가는것은 당연한 일. 답은 정해져 있지 않으며, 답을 찾아가는 과정자체가 바로 답인 것이며, 그 길은 하나가 아닌 여러 개라는 점이다. 대학원에서 가장 많이 하는 분석이 아마 요인분석일 것임. 잘 안묶이는 경우 다양한 이론을 토대로 다시 논리적 추론하여 변수를 넣다 뺐다 하게 되는데, 이게 다 같은 것임 ㅋㅋㅋ 이론대로 안묶이는 건 당연하다. 선행연구들의 컨택스트와 지금 나의 연구 컨택스트가 다르므로 같이 않을 가능성이 얼마나 있는 것이다. 이른바 외적타당성(시간적·공간적으로 연구결과의 반복가능성·신뢰성) ⑤

Code for association rule analysis with R ##### association analysis setwd("c:/Rtest") install.packages("arules") library(arules) result <- read.transactions("mydata_association.csv", format="basket", sep=",") result summary(result) image(result) as(result, "data.frame") rules=apriori(result, parameter=list(supp=0.1, conf=0.1)) inspect(rules) rules=apriori(result, parameter=list(supp=0.3, conf=0.1))

2. Decision Tree

2.1 Data preparation and set up ① Let "C:/Rtest＂ have "mydata_classification.csv"  using memo or excel, prepare data with name.csv로 Set up Rtest as default directory > setwd("c:/Rtest") ② Install party which has decision tree algorithm > install.packages("party") > library(party) ③

Read data > result <- read.csv("mydata_classification.csv", header=FALSE) > View(result) > install.packages("reshape") > library(reshape) > result <- rename(result, c(V1="total", V2="price", V3="period", V4="variety", V5="response")) ① ② ③ With mydata_classfication.csv, read Data and save it into result Install reshape Make each column unstandable with names total, price, period, variety, response

2.2 Decision tree algorithm > set.seed(1234) > resultsplit <- sample(2, nrow(result), replace=TRUE, prob=c(0.7, 0.3)) > trainD <- result[resultsplit==1,] > testD <- result[resultsplit==2,] > rawD <- response ~ total + price + period+ variety > trainModel <- ctree(rawD, data=trainD) ① ② ③ Generate random number when sampling Divide the data into two by the ratio of 7:3 n trainD(training data) testD(test data) Specify nr(no response), low(one) high(many) total price, period variety in response of result의 Specify model to use

2.3 Analysis > table(predict(trainModel), trainD$response) > print(trainModel) ① Classify values in response using trainModel Test data: 112 Price Period are important classification variables

2.4 visualization of decision tree > plot(trainModel) > plot(trainModel, type="simple") ① ② Show in tree form Show tree in simplified form

2.5 Test model > testModel <- predict(trainModel, newdata=testD) > table(testModel, testD$response) ① Test the model ② Test the model with test Model

Code for decision tree with R ##### classification analysis setwd("c:/Rtest") install.packages("party") library(party) result <- read.csv("mydata_classification.csv", header=FALSE) View(result) install.packages("reshape") library(reshape) result <- rename(result, c(V1="total", V2="price", V3="period", V4="variety", V5="response")) set.seed(1234) resultsplit <- sample(2, nrow(result), replace=TRUE, prob=c(0.7, 0.3)) trainD <- result[resultsplit==1,] testD <- result[resultsplit==2,] rawD <- response ~ total + price + period+ variety trainModel <- ctree(rawD, data=trainD) table(predict(trainModel), trainD$response) print(trainModel) plot(trainModel) plot(trainModel, type="simple") testModel <- predict(trainModel, newdata=testD) table(testModel, testD$response)

Text Mining (Parsing) with R

1. Introduction A case on “What they think abut tax?” : what are their major interests on tax?

2. Prepare the lab for parsing ① ① copy "tax.txt" into "C:/Rtest“  use memo pad to edit the data and the data should have extension .txt ② Open R and set path as below > setwd("c:/Rtest") ② ③ Install packages for dealing Korean languages (KoNLP), wordcloud (words clouding) RColorBrewer (coloring words) - KoNLP needs (JRE: Java Runtime Environment) > install.packages("KoNLP") > install.packages("RColorBrewer") > install.packages("wordcloud") > library(KoNLP) > library(RColorBrewer) > library(wordcloud) ③

3. Read data and text analysis > result <- file("tax.txt", encoding="UTF-8") > result2 <- readLines(result) > head(result2, 3) > result3 <- sapply(result2, extractNoun, USE.NAMES=F) > head(unlist(result3), 20) > write(unlist(result3), "tax_word.txt") ① The nouns extracted are saved in result3 and they can be used later for other uses. ② ③ Import the contents in Tax.txt and save it into result - Copy content in result row by row. Save then to result2 Extract nouns from each line in result2 and save it to result3 - print about 20 nouns from result3

4. Text analysis > myword <- read.table("tax_word.txt") > nrow(myword) > wordcount <- table(myword) > head(sort(wordcount, decreasing=T), 20) ① ② Read contents in tax_word.txt and save them into myword - check each record inside of myword Count words frequency in myword and save it into wordcount - show 20 words ranked by frequency in wordcount. The words are sorted in descending order

5. Text analysis > palete <- brewer.pal(9, "Set1") > wordcloud( + names(wordcount), + freq=wordcount, + scale=c(5, 1), + rot.per=0.5, + min.freq=4, + random.order=F, + random.color=T, + colors=palete + ) ① Upload wordcloud and RColorBrewer for coloring the words - put color on the words ② Create graphic window for clouding words Finding weight of words in wordcloud by adjusting index values in Wordcloud function - scale: control word size - rot.per: control distance between - min.freq: control frequency of words mentioned ④ ③ How to edit a long line of code? - Pushing Shift + Enter will create (+) for next line - (+) means just a component in a line of code wordcloud(names(wordcount), freq=wordcount,scale=c(5, 1),rot.per=0.5,min.freq=4,random.order=F,random.color=T,colors=palete)

6. Visualizing with wordcloud Initial word could- ‘것’, ‘저’, ‘원’ etl are not important words Modified word cloud > result2 <- gsub("것", "", result2) > result2 <- gsub("저", "", result2) > result2 <- gsub("원", "", result2) Remove unimportant words from initial word could