4. Day 3: Machine Learning and Data Mining in Bioinformatics

4.1. Overview

4.1.1. Lead Instructor

4.1.2. Co-Instructor(s)

4.1.3. Helper(s)

4.1.4. General Topics

  • Introduction to Machine Learning and Data Mining
    • Machine Learning basic concepts
    • Taxonomy of ML algorithms
      • Supervised classification,
      • Unsupervised classification,
      • Semi-supervised classification
    • Clustering validation and cross validation
  • Applications of ML in Bioinformatics
    • Examples of different ML/DM techniques that can be applied to different data analysis pipelines (DNASeq, RNASeq, Metagenomics)
    • How to chose the right ML technique?
  • Deep learning quick overview
  • Exploring some examples with R ML packages

4.2. Schedule

  • 09:00 - 10:00 Introduction to DM and ML, Machine Learning basic concepts

  • 10:00 - 11:00 Taxonomy of ML and examples of methods

    • Supervised classification,
    • Unsupervised classification,
    • Semi-supervised classification
  • 11:00 - 11:30 Coffee break

  • 11:30 - 12:30 Applications of ML in Bioinformatics

    • Examples of different ML/DM techniques that can be applied to different NGS data analysis pipelines
    • How to chose the right ML technique?
    • Deep learning overview
  • 12:30 - 14:00 Lunch break

  • 14:00 - 15:00 We will start practicing using the built-in R iris data set

  • 15:00 - 16:00 RNASeq analysis using clustering in R

  • 16:00 - 16:15 Coffee break

  • 16:15 - 18:00 RNASeq analysis in R to be continued

4.3. Learning Objectives

  • Learn Machine Learning basic concepts and jargon
  • Understand the Taxonomy of Machine Learning algorithms and differences between basic algorithms categories
  • Get familiar with the basic Machine Learning algorithms in supervised and unsupervised learning categories
  • Understand different parameters to take into consideration to choose the right Machine Learning technique for a given problem
  • Understand how to evaluate Machine Learning results in supervised and unsupervised classification
  • Learn about some applications of Machine Learning in Bioinformatics
  • Explore and apply some basic R packages to perform supervised and unsupervised classification

4.4. Introduction to Machine Learning

The slides of the talk are available here.

4.5. Let’s start doing some clustering on the IRIS dataset

To check all the R Built-in data sets, you can use the command:

data() #it will display all the available default datasets
iris #to check the dataset

Preprocessing the data: exclude column “Species” at position 5 and Standardize

F_iris <- iris[,-5] # Remove column 5
F_iris <- scale(F_iris) # Standardize: Scale is a generic function whose default method centers and/or scales the columns of a numeric matrix.

Let’s cluster the data using function eclust() (enhanced clustering, in factoextra). You will need to install factoextra package first.

Install Bioconductor packages. More details here

source("https://bioconductor.org/biocLite.R")
biocLite()

Install factoextra

biocLite("factoextra")

Load factoextra

library(factoextra)

For more info about the eclust() function, check this or type

help(eclust)
eclust(x, FUNcluster = "kmeans", hc_metric = "euclidean", ...)

Now let’s try to cluster the data with Kmeans and then use hierarchical clustering

eclust(F_iris, FUNcluster = "kmeans", hc_metric = "euclidean") #number of clusters not specified, will be detected automatically
# K-means clustering giving number of clusters= 3
km.res <- eclust(df, "kmeans", k = 3, nstart = 25, graph = FALSE)
# Visualize k-means clusters
fviz_cluster(km.res, geom = "point", ellipse.type = "norm", palette = "jco", ggtheme = theme_minimal())
hc.res <- eclust(F_iris, "hclust", k = 3, hc_metric = "euclidean", hc_method = "ward.D2", graph = FALSE)
# Visualize dendrograms
fviz_dend(hc.res, show_labels = FALSE, palette = "jco", as.ggplot = TRUE)

Let’s check the clusters quality, remember we talked this morning about clustering internal validation and the use of coefficients and indexes to measure clusters separability and intra clusters homogenity. We will compute the silhouette coefficient for the resulting K-means clustering

fviz_silhouette(km.res, palette = "jco", ggtheme = theme_classic())

A value close to 1 indicates good quality clustering (element assigned to the right cluster)
A value close to -1 indicates poor quality clustering (element would better fit in another cluster)

Now try to vary the number of clusters, run Kmeans and compute the the silhouette coefficient. What did you notice?

Many other coefficients could be computed to assess the clustering quality, some are available using the cluster.stats() function of the fpc package. You might need to install the fpc package first.

biocLite("fpc")
library(fpc)
kmIris_stats <- cluster.stats(dist(F_iris),  km.res$cluster)
kmIris_stats$dunn # if you wanna see only the dunn index value

Remember, iris dataset is labelled and we removed labels before performing clustering. You can perform a cross-tabulation between k-means clusters and the reference Species labels using table.

table(iris$Species, km.res$cluster)

The nbclust() function of the fviz package give an estimate of the optimal number of clusters, let’s try to estimate the number of clusters for the iris dataset based on the use of k-means clustering:

fviz_nbclust(F_iris, kmeans,method = "gap_stat")

Take few minutes to check the different functions of the Cluster package
.

4.6. Let’s try some supervised classification on the IRIS dataset

Remeber the iris dataset is a labelled one (the classes are known) but we removed labels in the previous exercice to perform clustering.
First We will need the caret package to be installed,

install.packages("caret", dependencies=c("Depends", "Suggests")) #installation of the caret packages and all packages that you might need
library(caret)

Let’s load the dataset from the CSV file as follows (useful when working with non R Built-in datasets):

wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data #the file iris.data will be downloaded in your current working directory, make sure you either specify the path while reading the file with R or copy it to your default R working directory
#backtoR
getwd()#check the default working directory
setwd(path) #to change the working directory
irisfile <- "iris.data"
# load the CSV file from the local directory
irisData <- read.csv(irisfile, header=FALSE)
# set the column names in the dataset
colnames(irisData) <- c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species")

All the iris data is labelled, to allow testing the model, let’s create a validation (test) sub-dataset. We will split the loaded dataset into two, 80% of which we will use to train our models and 20% that we will hold back as a validation dataset.

# create a list of 80% of the rows in the original dataset we can use for training
validation_index <- createDataPartition(irisData$Species, p=0.80, list=FALSE)
# select 20% of the data for validation
validation <- irisData[-validation_index,]
# use the remaining 80% of data to training and testing the models
irisData <- irisData[validation_index,]

You now have training data in the irisData variable and a validation set we will use later in the validation variable.
Let’s have a look on the data before starting building a model.

# you can check the dimensions of dataset
dim(irisData)
# list types for each attribute
sapply(irisData, class) # all the variables are numeric and the class value is a factor that has multiple class labels or levels
# list the levels for the class
levels(irisData$Species)
head(irisData) #if you want to check the first 5 lines of the dataset
tail(irisData) #if you want to check the last 5 lines of the dataset
# summarize the class distribution
percentage <- prop.table(table(irisData$Species)) * 100
cbind(freq=table(irisData$Species), percentage=percentage)
# summarize attribute distributions
summary(irisData)

4.7. Build models

Let’s evaluate 5 different algorithms:

  • Linear Discriminant Analysis (LDA)
  • Classification and Regression Trees (CART)
  • k-Nearest Neighbors (kNN)
  • Support Vector Machines (SVM) with a linear kernel
  • Random Forest (RF)

This is a good mixture of simple linear (LDA), nonlinear (CART, kNN) and complex nonlinear methods (SVM, RF). We reset the random number seed before reach run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.

Let’s build the five models on the training set:
The train function of the caret package has many parameters that have to be defined to perform the training and build a model. The different parameters to be used vary from one algorithm to another. Check this link for more details about different parameters that can be used.

# a) linear algorithms
set.seed(7)
fit.lda <- train(Species~., data=irisData, method="lda", metric=metric, trControl=control)
# b) nonlinear algorithms
# CART
set.seed(7)
fit.cart <- train(Species~., data=irisData, method="rpart", metric=metric, trControl=control)
# kNN
set.seed(7)
fit.knn <- train(Species~., data=irisData, method="knn", metric=metric, trControl=control)
#For the knn training, define a grid using expand.grid() function and change the tuneGrid parameter value
#example: expand.grid(k=c(5,7,10,15,19,21)
#fit.knn <- train(Species~., data=irisData, method="knn", metric=accuracy, trControl=fitcontrol,tuneGrid=grid)
# c) advanced algorithms
# SVM
set.seed(7)
fit.svm <- train(Species~., data=irisData, method="svmRadial", trControl=trainControl())
# Random Forest
set.seed(7)
fit.rf <- train(Species~., data=irisData, method="rf", trControl=trainControl())

After building the 5 different models, it’s time to assess the accuracy of the models

results <- resamples(list(lda=fit.lda, cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf))
summary(results)
# compare accuracy of models by plotting results
dotplot(results)

Note that The Kappa statistic (or value) is a metric that compares an Observed Accuracy with an Expected Accuracy (random chance). As you can see from the plots, lda was the best model

# summarize Best Model
print(fit.lda)

After training the different models on the training set, the test set (validation) can be used for a second round of checks of the accuracy of the model

# estimate skill of LDA on the validation dataset
predictions <- predict(fit.lda, validation)
confusionMatrix(predictions, validation$Species)

Very good accuracy obtained on this validation set (that represents 20% of the actual size of the dataset).