Simpel Regresi Linier, Logistik dan K-Means Clustering Menggunakan R

Portal-Statistik | Hari ini kita akan berbagi bagaimana untuk melakukan analisis data dengan beberapa metode sekaligus, Analisis Regresi Linear Sederhana, Analisis Regresi Logistik dan K-Means Clustering dengan menggunakan aplikasi R.


Fungsi untuk visualisasi

add_title <- function(vis, ..., x_lab = "X units", title = "Plot Title") 
{
  add_axis(vis, "x", title = x_lab) %>% 
    add_axis("x", orient = "top", ticks = 0, title = title,
             properties = axis_props(
               axis = list(stroke = "white"),
               labels = list(fontSize = 0)
             ), ...)
}
Library yang diperlukan

library(ggplot2)
library(ggvis)
library(NbClust)#cluster
library(fpc)#analisis diskriminan
library(lmtest)
library(rattle)
1. Analisis Regresi

Pemodelan dengan dataset Trees
Langkah pertama cek distribusi data untuk setiap variabel Girth terhadap Volume:
trees %>% ggvis (~Girth,~Volume) %>% 
  layer_boxplots() %>% 
  add_title(title = "Indentifying Anomaly data", x_lab = "#Girth session")
dan Height terhadap Volume:
trees %>% ggvis (~Height,~Volume) %>% 
  layer_boxplots() %>% 
  add_title(title = "Indentifying Anomaly data", x_lab = "#Height")
Dari uji diagnostik terlihat ada pola linier (positif) untuk setiap variabel. Tampak juga ada beberapa variansi data yang besar dari “height” terhadap Volume.
Untuk melanjutkan analisis regresi agar menghasilkan model yang baik, maka beberapa yang terindentifikasi outlier (atau anomali) dapat dihilangkan dalam perhitungan model.
Metode yang digunakan untuk hal ini dengan menggunakan jarak Cook.
mod <- lm(data=trees, Volume~Girth+Height )
cooksd <- cooks.distance(mod)
#ploting cooksd
plot(cooksd, pch="*", cex=2, main="Influential Obs by Cooks distance") 
abline(h = 4*mean(cooksd, na.rm=T), col="red")
text(x=1:length(cooksd)+1, y=cooksd, labels=ifelse(cooksd>4*mean(cooksd, na.rm=T),names(cooksd),""), col="red") 
Kemudian, row data yang teridentifikasi sebagai outlier dapat dieliminasi dalam model
influential <- as.numeric(names(cooksd)[(cooksd > 4*mean(cooksd, na.rm=T))])  
clean.data <- data.frame(trees[-influential, ])
summary(model <-lm(data=clean.data, Volume~Girth+Height))
## 
## Call:
## lm(formula = Volume ~ Girth + Height, data = clean.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6396 -1.2786 -0.4938  2.0262  6.4599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -52.2362     8.0390  -6.498 5.76e-07 ***
## Girth         4.4773     0.2518  17.781  < 2e-16 ***
## Height        0.2992     0.1179   2.538   0.0172 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.49 on 27 degrees of freedom
## Multiple R-squared:  0.9437, Adjusted R-squared:  0.9395 
## F-statistic: 226.3 on 2 and 27 DF,  p-value: < 2.2e-16

Kesimpulan yang didapat, bahwa Girth dan Height memiliki pengaruh yang kuat terhadap Volume pohon. Sehingga dapat dikatakan juga untuk menghitung Volume sebuah pohon 93% dapat digambarkan menggunakan nilai Girth dan Height nya.
Sebagai tambahan, dapat juga dilakukan uji kebaikan model dengan uji heteroskedastik error, model dikatakan baik jika H0 gagal ditolak (homoskedastik). Error yang homogen dapat mengidentifikasikan hasil estimasi stabil (selisih antara estimasi dan nilai asli tidak terlalu bervariasi)
bptest(model)
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 1.8397, df = 2, p-value = 0.3986

Uji autokorelasi dapat diabaikan untuk kasus ini karena termasuk data cross section, bukan time series.
Normalitas error, jika error yang dihasilkan model berdistirbusi normal maka dapat dikatakan model baik
res=residuals(model,type="response")
shapiro.test(res)
## 
##  Shapiro-Wilk normality test
## 
## data:  res
## W = 0.96008, p-value = 0.3113

Kemudian uji multikoliniearitas. Di anggap baik Jika nilai VIF dibawah 5.
library(fmsb)
VIF(model)
## [1] 17.76253

Karena model teridentifikasi multikolinieritas, maka model dapat disusun ulang sebagai berikut dengan mempertimbangkan nilai t statistik terbesar untuk variabel independen.
summary(model1 <-lm(data=clean.data, Volume~Girth))
## 
## Call:
## lm(formula = Volume ~ Girth, data = clean.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5036 -2.3834 -0.0489  2.3951  6.3726 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -33.3104     3.2784  -10.16 6.76e-11 ***
## Girth         4.7619     0.2464   19.33  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.813 on 28 degrees of freedom
## Multiple R-squared:  0.9303, Adjusted R-squared:  0.9278 
## F-statistic: 373.6 on 1 and 28 DF,  p-value: < 2.2e-16

Kesimpulan terkahir didapat, untuk memperkirakan Volume (atau bahkan height) dapat digambarkan 93% menggunakan Girth.

2. Logistic Regression
library(datasets)
 dim(Titanic)
## [1] 4 2 2 2

Mempersiapkan data
 classes <- dimnames(Titanic)[[1]]
 genders <- dimnames(Titanic)[[2]]
 ages    <- dimnames(Titanic)[[3]]

 class <- gender <- age <- N <- Y <- rep(0,4*2*2)
 count<-0
 for(i1 in 1:4){
   for(i2 in 1:2){
     for(i3 in 1:2){
       count <- count + 1
       class[count]  <- classes[i1]
       gender[count] <- genders[i2]
       age[count]    <- ages[i3]
       N[count]      <- sum(Titanic[i1,i2,i3,])
       Y[count]      <- Titanic[i1,i2,i3,2]
     }
   }
  }

   Xc1 <- ifelse(class=="1st",1,0)
   Xc2 <- ifelse(class=="2nd",1,0)
   Xc3 <- ifelse(class=="3rd",1,0)
   Xg  <- ifelse(gender=="Female",1,0)
   Xa  <- ifelse(age=="Adult",1,0)
   n   <- 16

   cbind(class,Xc1,Xc2,Xc3,gender,Xg,age,Xa,N,Y)
##       class  Xc1 Xc2 Xc3 gender   Xg  age     Xa  N     Y    
##  [1,] "1st"  "1" "0" "0" "Male"   "0" "Child" "0" "5"   "5"  
##  [2,] "1st"  "1" "0" "0" "Male"   "0" "Adult" "1" "175" "57" 
##  [3,] "1st"  "1" "0" "0" "Female" "1" "Child" "0" "1"   "1"  
##  [4,] "1st"  "1" "0" "0" "Female" "1" "Adult" "1" "144" "140"
##  [5,] "2nd"  "0" "1" "0" "Male"   "0" "Child" "0" "11"  "11" 
##  [6,] "2nd"  "0" "1" "0" "Male"   "0" "Adult" "1" "168" "14" 
##  [7,] "2nd"  "0" "1" "0" "Female" "1" "Child" "0" "13"  "13" 
##  [8,] "2nd"  "0" "1" "0" "Female" "1" "Adult" "1" "93"  "80" 
##  [9,] "3rd"  "0" "0" "1" "Male"   "0" "Child" "0" "48"  "13" 
## [10,] "3rd"  "0" "0" "1" "Male"   "0" "Adult" "1" "462" "75" 
## [11,] "3rd"  "0" "0" "1" "Female" "1" "Child" "0" "31"  "14" 
## [12,] "3rd"  "0" "0" "1" "Female" "1" "Adult" "1" "165" "76" 
## [13,] "Crew" "0" "0" "0" "Male"   "0" "Child" "0" "0"   "0"  
## [14,] "Crew" "0" "0" "0" "Male"   "0" "Adult" "1" "862" "192"
## [15,] "Crew" "0" "0" "0" "Female" "1" "Child" "0" "0"   "0"  
## [16,] "Crew" "0" "0" "0" "Female" "1" "Adult" "1" "23"  "20"

Membuat fungsi logistik dengan metode estimasi Likelihood
logistic_model <- "model{
   # Likelihood
   for(i in 1:n){
    Y[i] ~ dbin(q[i],N[i])
    logit(q[i]) <- beta[1] + beta[2]*Xc1[i] + beta[3]*Xc2[i] + 
                   beta[4]*Xc3[i] + beta[5]*Xg[i] + beta[6]*Xa[i]
   }

   #Priors
   for(j in 1:6){
    beta[j] ~ dnorm(0,0.01)
   }
  }"
Modelling menggunakan logistik model
library(rjags)
  dat    <- list(Y=Y,n=n,N=N,Xc1=Xc1,Xc2=Xc2,Xc3=Xc3,Xg=Xg,Xa=Xa)
  model1 <- jags.model(textConnection(logistic_model),data = dat,n.chains=3, quiet=TRUE) 
Menghitung nilai DIC
  dic1  <- dic.samples(model1, 
           variable.names=c("beta"), 
           n.iter=20000, progress.bar="none")
Konvergensi Diagnosis
samp <- coda.samples(model1, 
        variable.names=c("beta"), 
        n.iter=20000, progress.bar="none")
autocorr.plot(samp)
gelman.plot(samp)
summary(samp)
## 
## Iterations = 21001:41000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 20000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##            Mean     SD  Naive SE Time-series SE
## beta[1] -0.1643 0.2613 0.0010670       0.008718
## beta[2]  0.8569 0.1583 0.0006461       0.001258
## beta[3] -0.1651 0.1739 0.0007099       0.001575
## beta[4] -0.9280 0.1486 0.0006068       0.001882
## beta[5]  2.4314 0.1401 0.0005722       0.001060
## beta[6] -1.0711 0.2485 0.0010145       0.008102
## 
## 2. Quantiles for each variable:
## 
##            2.5%     25%     50%       75%   97.5%
## beta[1] -0.6739 -0.3353 -0.1651  0.008272  0.3573
## beta[2]  0.5465  0.7511  0.8572  0.963007  1.1664
## beta[3] -0.5081 -0.2825 -0.1631 -0.045948  0.1722
## beta[4] -1.2229 -1.0272 -0.9267 -0.827322 -0.6376
## beta[5]  2.1599  2.3363  2.4310  2.525828  2.7076
## beta[6] -1.5624 -1.2349 -1.0703 -0.907697 -0.5834

Kesimpulan:
  1. Perempuan (mean Beta 5 >0) bertahan hidup lebih tinggi (peluang) dibandingkan laki-laki (mean Beta 4 <0)
  2.  Crew Kelas kedua (Mean Beta 2 > 0) bertahan hidup lebih baik dibandingkan Crew kelas Pertama (Mean Beta 1 <0) dan Crew kelas ketiga (mean beta 3 <0)
  3. Orang dewasa (adults) memiliki kemungkinan bertahan lebih buruk (mean beta 6 <0) dibandingkan anak-anak (child) (mean beta 5> 0)
3. Clusteing Analisis
Langkah pertama melihat struktur data
head(wine,3)
##   Type Alcohol Malic  Ash Alcalinity Magnesium Phenols Flavanoids
## 1    1   14.23  1.71 2.43       15.6       127    2.80       3.06
## 2    1   13.20  1.78 2.14       11.2       100    2.65       2.76
## 3    1   13.16  2.36 2.67       18.6       101    2.80       3.24
##   Nonflavanoids Proanthocyanins Color  Hue Dilution Proline
## 1          0.28            2.29  5.64 1.04     3.92    1065
## 2          0.26            1.28  4.38 1.05     3.40    1050
## 3          0.30            2.81  5.68 1.03     3.17    1185

str(wine)
## 'data.frame':    178 obs. of  14 variables:
##  $ Type           : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Alcohol        : num  14.2 13.2 13.2 14.4 13.2 ...
##  $ Malic          : num  1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
##  $ Ash            : num  2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
##  $ Alcalinity     : num  15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
##  $ Magnesium      : int  127 100 101 113 118 112 96 121 97 98 ...
##  $ Phenols        : num  2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
##  $ Flavanoids     : num  3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
##  $ Nonflavanoids  : num  0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
##  $ Proanthocyanins: num  2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
##  $ Color          : num  5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
##  $ Hue            : num  1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
##  $ Dilution       : num  3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
##  $ Proline        : int  1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...

Melakukan standarisasi data, dan membuat kolom pertama (factor fields)
dataLatih <- scale(wine[-1])
Kemudian menentukan berapa jumlah cluster yang harus terbentuk hingga hasil optimal
nc <- NbClust(dataLatih,
              min.nc=2, max.nc=15,
              method="kmeans")
## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 4 proposed 2 as the best number of clusters 
## * 15 proposed 3 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## * 1 proposed 12 as the best number of clusters 
## * 1 proposed 14 as the best number of clusters 
## * 1 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
## * According to the majority rule, the best number of clusters is  3 

## *******************************************************************

barplot(table(nc$Best.n[1,]),
        xlab="Jumlah Cluster",
        ylab="Jumlah Variabel",
        main="Jumlah Cluster berdasarkan 26 Variabel")
wss <- 0
for (i in 1:15){
  wss[i] <-
    sum(kmeans(dataLatih, centers=i)$withinss)
}
plot(1:15,
  wss,
  type="b",
  xlab="Jumlah Cluster",
  ylab="WSS")
Dari 2 metode diatas,ditetapkan 3 sebagai nilai Jumlah cluster yang akan dibentuk.
Kemudian membentuk cluster menggunakan kmeans

modelCluster <- kmeans(dataLatih,3)
kemudian divisualisasi menggunakan metode diskriminan untuk mereduksi dimensi, (menggambarkan banyak variabel menjadi 2 dimensi)
plotcluster(dataLatih, modelCluster$cluster)
Terakhir, mengevaluasi model dengan menggunakan tabel, dengan mencocokan variabel type pada dataset wine dengan hasil cluster yang terbentuk
 table(wine$Type,modelCluster$cluster)

##      1  2  3
##   1 59  0  0
##   2  3  3 65
##   3  0 48  0

 Dari Hasil didapatkan dengan model clustering yang digunakan, kemudian dibandingkan dengan data asli maka ditemukan terdapat 6 unit yang missclassfied. Sehingga didapat akurasi kebaikan hasil cluster dibandingkan dengan data asli maka:
 (178-6)/178
 ## [1] 0.9662921


Thank You,
Have Fun.
Tag : DataMining, R
10 Komentar untuk "Simpel Regresi Linier, Logistik dan K-Means Clustering Menggunakan R"

Stress when used as a term in vocabulary does not seem dangerous but when we take into account the long term consequences of ‘stress’ you can see it takes a toll on your mental health and overall well-being. Gurgaon escorts have done a great deal in not only entertaining guests who come to this city but also in taking the stress factor out of their daily life. Escorts in Gurgaon

You are only required to let me recognize about your requirements, place, date and time. I shall assemble each and every other thing for you and would make confident you have a great and acceptable time with me.
Delhi Escorts

Hi Friends, I am Riya  Sharma, an Independent Escort in Bangalore. Call me on OOOOOOOOOOO and get instant reserving for VIP Escort services in Bangalore
Escorts Bangalore
Koramangala Call Girls
Kr Puram Call Girls
Call girls BTM Layout Bangalore
Call girls Electronic City 

Welcome to VIP hifi Model Escorts in Hebbal agency, we are the top charted agency for gentlemen elite hilarity services to provide industry top ranking model Escorts in Hebbal.
Call Girls Hebbal
Call Girls Hsr Layout
Call girls Indiranagar
Call girls JP Nagar
Call girls Malleswaram

They will make you annoyed to have more and more of the enjoyment that you seem for in your lone times.
Call girls Marathahalli
Call girls MG Road
Call girls Mysore Road
Call girls Rajajinagar
Call girls Whitefield

Calangute escorts agency offer service in Calangute, Colva and Condolim our hottest decision women for enjoyable and offer full satisfaction with final freelance feminine escorts in Calangute
Goa Escorts
Escorts Calangute Beach
Escorts Candolim Beach
Escorts Colva Beach

There is no one without dream of spending mind blowing time with a very hot escort girl in the town. There are hundreds of obstacles are previously standing in front of this dream and those factors made it an uneasy aim
Escorts Panaji
Escorts Margao
Escorts Vasco Da Gama

Ahmedabad is a standout between the most international cities in the India and explorers from around the world visit for business, delight and for shopping. Ahmedabad is only the place to welcome
Escorts Ahmedabad

Our extreme Companions are obvious for their longing to make this their own specific chosen calling, their  require
Escorts Pune
Escorts Hinjewadi

you can hire self-governing Escorts in Bangalore. These astonishing beauty know what your body stress. Their plea,loveliness, category, and approach are literally without equal.Independent Escorts Bangalore | Escorts Bangalore | Bangalore Escorts | Female Escorts Bangalore

Silahkan tinggalkan komentar, kritik, maupun saran dari sobat blogger tentang apa yang sobat rasakan setelah mengunjungi blog ini.

Back To Top