Supervised Learning: Classification and Regression in R Programming

Introduction

Supervised learning is a type of machine learning where the model is trained on labeled data. It involves using input-output pairs to learn the relationship between features and outcomes. In this tutorial, we will cover both classification and regression in R, with examples of logistic regression, decision trees, random forests, linear regression, and polynomial regression.

1. Classification

Classification is the task of predicting a categorical label based on input features. We will cover three popular classification techniques: logistic regression, decision trees, and random forests.

Step-by-Step Example of Logistic Regression:

In logistic regression, the output is a probability that the input belongs to a particular class. We will predict whether a person has a disease based on features like age and blood pressure.

    # Sample data for logistic regression
    age <- c(25, 30, 35, 40, 45, 50, 55, 60)
    blood_pressure <- c(120, 130, 125, 140, 135, 145, 150, 155)
    disease <- c(0, 0, 0, 1, 1, 1, 1, 1)
    
    # Create a data frame
    data <- data.frame(age, blood_pressure, disease)
    
    # Build a logistic regression model
    model_logistic <- glm(disease ~ age + blood_pressure, data = data, family = binomial)
    
    # Summary of the model
    summary(model_logistic)
    
    # Prediction on new data
    new_data <- data.frame(age = c(28, 38), blood_pressure = c(130, 140))
    predictions <- predict(model_logistic, new_data, type = "response")
    predictions

Explanation: We use the glm() function to build the logistic regression model, specifying the family = binomial argument for binary classification. The predict() function is used to make predictions on new data, returning the probability of the class being 1 (disease present).

Step-by-Step Example of Decision Trees:

Decision trees split the data into smaller subsets based on the input features to classify data points into categories. We will use the rpart package to build a decision tree.

    # Install the rpart package (if not already installed)
    # install.packages("rpart")
    
    # Load the rpart package
    library(rpart)
    
    # Build a decision tree model
    model_tree <- rpart(disease ~ age + blood_pressure, data = data, method = "class")
    
    # Plot the decision tree
    plot(model_tree)
    text(model_tree, use.n = TRUE)
    
    # Prediction on new data
    predictions_tree <- predict(model_tree, new_data, type = "class")
    predictions_tree

Explanation: We use the rpart() function to build the decision tree, specifying method = "class" for classification. The plot() function visualizes the decision tree, and the predict() function is used to classify new data.

Step-by-Step Example of Random Forests:

Random forests build multiple decision trees and aggregate their predictions. We will use the randomForest package to create a random forest model.

    # Install the randomForest package (if not already installed)
    # install.packages("randomForest")
    
    # Load the randomForest package
    library(randomForest)
    
    # Build a random forest model
    model_rf <- randomForest(disease ~ age + blood_pressure, data = data)
    
    # Summary of the model
    summary(model_rf)
    
    # Prediction on new data
    predictions_rf <- predict(model_rf, new_data)
    predictions_rf

Explanation: The randomForest() function is used to build the random forest model. We can then use the predict() function to make predictions on new data, where the model combines the predictions of multiple decision trees.

2. Regression

Regression is the task of predicting a continuous output variable. We will cover two types of regression: linear regression and polynomial regression.

Step-by-Step Example of Linear Regression:

In linear regression, the output is a continuous variable that is predicted as a linear combination of input features. We will predict a person’s income based on their years of experience.

    # Sample data for linear regression
    years_experience <- c(1, 2, 3, 4, 5, 6, 7, 8)
    income <- c(40000, 45000, 50000, 55000, 60000, 65000, 70000, 75000)
    
    # Create a data frame
    data_reg <- data.frame(years_experience, income)
    
    # Build a linear regression model
    model_linear <- lm(income ~ years_experience, data = data_reg)
    
    # Summary of the model
    summary(model_linear)
    
    # Prediction on new data
    new_data_reg <- data.frame(years_experience = c(9, 10))
    predictions_linear <- predict(model_linear, new_data_reg)
    predictions_linear

Explanation: We use the lm() function to build the linear regression model. The summary() function gives the details of the model, including coefficients and p-values. The predict() function is used to make predictions on new data.

Step-by-Step Example of Polynomial Regression:

Polynomial regression is an extension of linear regression that can model non-linear relationships by including polynomial terms of the input features.

    # Build a polynomial regression model (degree = 2)
    model_poly <- lm(income ~ poly(years_experience, 2), data = data_reg)
    
    # Summary of the model
    summary(model_poly)
    
    # Prediction on new data
    predictions_poly <- predict(model_poly, new_data_reg)
    predictions_poly

Explanation: The poly() function is used to add polynomial terms to the regression model. In this case, we specify degree = 2 to include the square of the input variable. This allows the model to fit a non-linear relationship between years of experience and income.

Conclusion

In this tutorial, we covered both classification and regression techniques in supervised learning, using R programming. We explored logistic regression, decision trees, and random forests for classification, and linear and polynomial regression for predicting continuous outcomes. These techniques are widely used in machine learning to build models that can make predictions based on labeled data.

R Programming

Data Structure

Data Manipulation

Import Export

Data Visualization

Control Structure

Statistical Analysis

Machine Learning - R

Advance Topics