Supervised Learning: Classification and Regression in R Programming
Introduction
Supervised learning is a type of machine learning where the model is trained on labeled data. It involves using input-output pairs to learn the relationship between features and outcomes. In this tutorial, we will cover both classification and regression in R, with examples of logistic regression, decision trees, random forests, linear regression, and polynomial regression.
1. Classification
Classification is the task of predicting a categorical label based on input features. We will cover three popular classification techniques: logistic regression, decision trees, and random forests.
Step-by-Step Example of Logistic Regression:
In logistic regression, the output is a probability that the input belongs to a particular class. We will predict whether a person has a disease based on features like age and blood pressure.
# Sample data for logistic regression age <- c(25, 30, 35, 40, 45, 50, 55, 60) blood_pressure <- c(120, 130, 125, 140, 135, 145, 150, 155) disease <- c(0, 0, 0, 1, 1, 1, 1, 1) # Create a data frame data <- data.frame(age, blood_pressure, disease) # Build a logistic regression model model_logistic <- glm(disease ~ age + blood_pressure, data = data, family = binomial) # Summary of the model summary(model_logistic) # Prediction on new data new_data <- data.frame(age = c(28, 38), blood_pressure = c(130, 140)) predictions <- predict(model_logistic, new_data, type = "response") predictions
Explanation: We use the glm()
function to build the logistic regression model, specifying the family = binomial
argument for binary classification. The predict()
function is used to make predictions on new data, returning the probability of the class being 1 (disease present).
Step-by-Step Example of Decision Trees:
Decision trees split the data into smaller subsets based on the input features to classify data points into categories. We will use the rpart
package to build a decision tree.
# Install the rpart package (if not already installed) # install.packages("rpart") # Load the rpart package library(rpart) # Build a decision tree model model_tree <- rpart(disease ~ age + blood_pressure, data = data, method = "class") # Plot the decision tree plot(model_tree) text(model_tree, use.n = TRUE) # Prediction on new data predictions_tree <- predict(model_tree, new_data, type = "class") predictions_tree
Explanation: We use the rpart()
function to build the decision tree, specifying method = "class"
for classification. The plot()
function visualizes the decision tree, and the predict()
function is used to classify new data.
Step-by-Step Example of Random Forests:
Random forests build multiple decision trees and aggregate their predictions. We will use the randomForest
package to create a random forest model.
# Install the randomForest package (if not already installed) # install.packages("randomForest") # Load the randomForest package library(randomForest) # Build a random forest model model_rf <- randomForest(disease ~ age + blood_pressure, data = data) # Summary of the model summary(model_rf) # Prediction on new data predictions_rf <- predict(model_rf, new_data) predictions_rf
Explanation: The randomForest()
function is used to build the random forest model. We can then use the predict()
function to make predictions on new data, where the model combines the predictions of multiple decision trees.
2. Regression
Regression is the task of predicting a continuous output variable. We will cover two types of regression: linear regression and polynomial regression.
Step-by-Step Example of Linear Regression:
In linear regression, the output is a continuous variable that is predicted as a linear combination of input features. We will predict a person’s income based on their years of experience.
# Sample data for linear regression years_experience <- c(1, 2, 3, 4, 5, 6, 7, 8) income <- c(40000, 45000, 50000, 55000, 60000, 65000, 70000, 75000) # Create a data frame data_reg <- data.frame(years_experience, income) # Build a linear regression model model_linear <- lm(income ~ years_experience, data = data_reg) # Summary of the model summary(model_linear) # Prediction on new data new_data_reg <- data.frame(years_experience = c(9, 10)) predictions_linear <- predict(model_linear, new_data_reg) predictions_linear
Explanation: We use the lm()
function to build the linear regression model. The summary()
function gives the details of the model, including coefficients and p-values. The predict()
function is used to make predictions on new data.
Step-by-Step Example of Polynomial Regression:
Polynomial regression is an extension of linear regression that can model non-linear relationships by including polynomial terms of the input features.
# Build a polynomial regression model (degree = 2) model_poly <- lm(income ~ poly(years_experience, 2), data = data_reg) # Summary of the model summary(model_poly) # Prediction on new data predictions_poly <- predict(model_poly, new_data_reg) predictions_poly
Explanation: The poly()
function is used to add polynomial terms to the regression model. In this case, we specify degree = 2
to include the square of the input variable. This allows the model to fit a non-linear relationship between years of experience and income.
Conclusion
In this tutorial, we covered both classification and regression techniques in supervised learning, using R programming. We explored logistic regression, decision trees, and random forests for classification, and linear and polynomial regression for predicting continuous outcomes. These techniques are widely used in machine learning to build models that can make predictions based on labeled data.