R for Data Mining 1. association rule 2. decision tree 3. text analysis
1. Association rule analysis
1.1 Data preparation and set up Let "C:/Rtest“ contain "mydata_association.csv" Set up Rtest as default directory > setwd("c:/Rtest") ② Install and upload arules which has association rules > install.packages("arules") > library(arules) ③
Read data seperated by comma(,) and save it into result라는 > result <- read.transactions("mydata_association.csv", format="basket", sep=", ") > result > summary(result) > image(result) ② ① Read data seperated by comma(,) and save it into result라는 ③ ④ Show column and row structure in result Show graph that has data in result에 Show result from analysis in result Transactions(Rows) Items(Columns)
2.2 Apply algorithm > as(result, "data.frame") > rules=apriori(result, parameter=list(supp=0.1, conf=0.1)) > inspect(rules) ① ② ③ Convert data in result to table structure Save output from result apriori analysis to rules - minimum support and confidence · · · · · · · ·
2.3 Analysis of output > rules=apriori(result, parameter=list(supp=0.3, conf=0.1)) > inspect(rules) ② ① ③ ④ 실제 이번 슬라이드에서는 제대로 데이터가 나오지 않는군. 이 대목에서 트러블슈팅에 대해서 알려주도록 하자. 아니면, 이것이 실제 데이터 분석 프로젝트에 있어서 사전에 마트를 정말로 만들기 위한 하나의 파일럿으로 이렇게 돌려보아 이번 접근이 여의치 않다고 판단되면 다른 접근방법을 강구해야 하는 것이다 이를 테면, 군집을 나누지 말고 더 모집단 전체로 한다거나, 아니면, 시간적 구매순서가 반영된 연속규칙을 토대로 다시 트랜잭션을 만들어 해나가는 트러블슈팅이 분석에서는 중요하다는 것임. 산행하다 길이 여의치 않으면, 네비게이션으로 가다 길이 예상외로 막히면, 우히해서 가는것은 당연한 일. 답은 정해져 있지 않으며, 답을 찾아가는 과정자체가 바로 답인 것이며, 그 길은 하나가 아닌 여러 개라는 점이다. 대학원에서 가장 많이 하는 분석이 아마 요인분석일 것임. 잘 안묶이는 경우 다양한 이론을 토대로 다시 논리적 추론하여 변수를 넣다 뺐다 하게 되는데, 이게 다 같은 것임 ㅋㅋㅋ 이론대로 안묶이는 건 당연하다. 선행연구들의 컨택스트와 지금 나의 연구 컨택스트가 다르므로 같이 않을 가능성이 얼마나 있는 것이다. 이른바 외적타당성(시간적·공간적으로 연구결과의 반복가능성·신뢰성) ⑤
Code for association rule analysis with R ##### association analysis setwd("c:/Rtest") install.packages("arules") library(arules) result <- read.transactions("mydata_association.csv", format="basket", sep=",") result summary(result) image(result) as(result, "data.frame") rules=apriori(result, parameter=list(supp=0.1, conf=0.1)) inspect(rules) rules=apriori(result, parameter=list(supp=0.3, conf=0.1))
2. Decision Tree
2.1 Data preparation and set up ① Let "C:/Rtest" have "mydata_classification.csv" using memo or excel, prepare data with name.csv로 Set up Rtest as default directory > setwd("c:/Rtest") ② Install party which has decision tree algorithm > install.packages("party") > library(party) ③
Read data > result <- read.csv("mydata_classification.csv", header=FALSE) > View(result) > install.packages("reshape") > library(reshape) > result <- rename(result, c(V1="total", V2="price", V3="period", V4="variety", V5="response")) ① ② ③ With mydata_classfication.csv, read Data and save it into result Install reshape Make each column unstandable with names total, price, period, variety, response
2.2 Decision tree algorithm > set.seed(1234) > resultsplit <- sample(2, nrow(result), replace=TRUE, prob=c(0.7, 0.3)) > trainD <- result[resultsplit==1,] > testD <- result[resultsplit==2,] > rawD <- response ~ total + price + period+ variety > trainModel <- ctree(rawD, data=trainD) ① ② ③ Generate random number when sampling Divide the data into two by the ratio of 7:3 n trainD(training data) testD(test data) Specify nr(no response), low(one) high(many) total price, period variety in response of result의 Specify model to use
2.3 Analysis > table(predict(trainModel), trainD$response) > print(trainModel) ① Classify values in response using trainModel Test data: 112 Price Period are important classification variables
2.4 visualization of decision tree > plot(trainModel) > plot(trainModel, type="simple") ① ② Show in tree form Show tree in simplified form
2.5 Test model > testModel <- predict(trainModel, newdata=testD) > table(testModel, testD$response) ① Test the model ② Test the model with test Model
Code for decision tree with R ##### classification analysis setwd("c:/Rtest") install.packages("party") library(party) result <- read.csv("mydata_classification.csv", header=FALSE) View(result) install.packages("reshape") library(reshape) result <- rename(result, c(V1="total", V2="price", V3="period", V4="variety", V5="response")) set.seed(1234) resultsplit <- sample(2, nrow(result), replace=TRUE, prob=c(0.7, 0.3)) trainD <- result[resultsplit==1,] testD <- result[resultsplit==2,] rawD <- response ~ total + price + period+ variety trainModel <- ctree(rawD, data=trainD) table(predict(trainModel), trainD$response) print(trainModel) plot(trainModel) plot(trainModel, type="simple") testModel <- predict(trainModel, newdata=testD) table(testModel, testD$response)
Text Mining (Parsing) with R
1. Introduction A case on “What they think abut tax?” : what are their major interests on tax?
2. Prepare the lab for parsing ① ① copy "tax.txt" into "C:/Rtest“ use memo pad to edit the data and the data should have extension .txt ② Open R and set path as below > setwd("c:/Rtest") ② ③ Install packages for dealing Korean languages (KoNLP), wordcloud (words clouding) RColorBrewer (coloring words) - KoNLP needs (JRE: Java Runtime Environment) > install.packages("KoNLP") > install.packages("RColorBrewer") > install.packages("wordcloud") > library(KoNLP) > library(RColorBrewer) > library(wordcloud) ③
3. Read data and text analysis > result <- file("tax.txt", encoding="UTF-8") > result2 <- readLines(result) > head(result2, 3) > result3 <- sapply(result2, extractNoun, USE.NAMES=F) > head(unlist(result3), 20) > write(unlist(result3), "tax_word.txt") ① The nouns extracted are saved in result3 and they can be used later for other uses. ② ③ Import the contents in Tax.txt and save it into result - Copy content in result row by row. Save then to result2 Extract nouns from each line in result2 and save it to result3 - print about 20 nouns from result3
4. Text analysis > myword <- read.table("tax_word.txt") > nrow(myword) > wordcount <- table(myword) > head(sort(wordcount, decreasing=T), 20) ① ② Read contents in tax_word.txt and save them into myword - check each record inside of myword Count words frequency in myword and save it into wordcount - show 20 words ranked by frequency in wordcount. The words are sorted in descending order
5. Text analysis > palete <- brewer.pal(9, "Set1") > wordcloud( + names(wordcount), + freq=wordcount, + scale=c(5, 1), + rot.per=0.5, + min.freq=4, + random.order=F, + random.color=T, + colors=palete + ) ① Upload wordcloud and RColorBrewer for coloring the words - put color on the words ② Create graphic window for clouding words Finding weight of words in wordcloud by adjusting index values in Wordcloud function - scale: control word size - rot.per: control distance between - min.freq: control frequency of words mentioned ④ ③ How to edit a long line of code? - Pushing Shift + Enter will create (+) for next line - (+) means just a component in a line of code wordcloud(names(wordcount), freq=wordcount,scale=c(5, 1),rot.per=0.5,min.freq=4,random.order=F,random.color=T,colors=palete)
6. Visualizing with wordcloud Initial word could- ‘것’, ‘저’, ‘원’ etl are not important words Modified word cloud > result2 <- gsub("것", "", result2) > result2 <- gsub("저", "", result2) > result2 <- gsub("원", "", result2) Remove unimportant words from initial word could