Fourth Blog Post

Li Wang 2022-07-20

rmarkdown::render("../_Rmd/2022-07-20-blog-post-Module11.Rmd", 
                  output_format ="github_document",
                  output_dir = "../_posts",output_options = list(html_preview= FALSE))

Fourth Blog Post - Machine Learning

We just finished our section on machine learning and learned a lot of machine learning methods. I think KNN is the most interesting!

The abbreviation KNN stands for “K-Nearest Neighbour”. It is a supervised machine learning algorithm. The algorithm can be used to solve both classification and regression problem statements. The number of nearest neighbours to a new unknown variable that has to be predicted or classified is denoted by the symbol ‘K’.

For an example:

1. load the libraries

library(caret)
library(gbm)
library(foreach)
library(magrittr)
library(plyr)

2. Load the ‘iris’ data

data(iris)

3. Split the <data:70%> train,30% test.

set.seed(111)
split <- createDataPartition(y = iris$Species, p = 0.7, list = FALSE)
train <- iris[split, ]
test <- iris[-split, ]

4. train the kNN model.Use repeated 10 fold cross-validation, with the number of repeats being 3, also preprocess the data by centering and scaling. Lastly, set the tuneGrid so that you are considering values of k of 1, 2, 3, . . . , 20.

kNNFit <- train(Species ~., data = train,
method = "knn",
trControl = trainControl(method = "repeatedcv", number = 10, repeats = 3),
preProcess = c("center", "scale"),
tuneGrid = data.frame(k = 1:20))
kNNFit

## k-Nearest Neighbors 
## 
## 105 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## Pre-processing: centered (4), scaled (4) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 93, 95, 95, 94, 95, 95, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.9581145  0.9368661
##    2  0.9496296  0.9237386
##    3  0.9456902  0.9178531
##    4  0.9550842  0.9320740
##    5  0.9712121  0.9562007
##    6  0.9614478  0.9414771
##    7  0.9715152  0.9567442
##    8  0.9671380  0.9502955
##    9  0.9748485  0.9618158
##   10  0.9684848  0.9522573
##   11  0.9687879  0.9527057
##   12  0.9654545  0.9477886
##   13  0.9718182  0.9573078
##   14  0.9654545  0.9479191
##   15  0.9650842  0.9474895
##   16  0.9590236  0.9383794
##   17  0.9556902  0.9332512
##   18  0.9526599  0.9288348
##   19  0.9432660  0.9148197
##   20  0.9317003  0.8975602
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.

The model gives the highest accuracy for K = 9, therefore, the final value used for the model was k = 9.

5. Check how well the model does on the test set using the confusionMatrix() function.

confusionMatrix(kNNFit, newdata = test)

## Cross-Validated (10 fold, repeated 3 times) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##             Reference
## Prediction   setosa versicolor virginica
##   setosa       33.3        0.0       0.0
##   versicolor    0.0       31.7       1.0
##   virginica     0.0        1.6      32.4
##                             
##  Accuracy (average) : 0.9746

The Accuracy (average) is 0.9746.