Unsupervised Learning: Clustering in R Programming

Introduction

Unsupervised learning is a type of machine learning where the model is trained on data that has no labels. The goal is to find hidden patterns or intrinsic structures in the input data. Clustering is one of the most common techniques in unsupervised learning, where the data points are grouped based on similarity. In this tutorial, we will cover two popular clustering methods: K-means clustering and hierarchical clustering.

1. K-means Clustering

K-means clustering is a method that partitions data into k clusters, where each data point belongs to the cluster with the nearest mean. The algorithm works iteratively to assign each point to a cluster and adjust the centroids until convergence.

Step-by-Step Example of K-means Clustering:

We will use the iris dataset to apply K-means clustering. The goal is to group the flowers into clusters based on their features.

    # Load the iris dataset
    data(iris)
    
    # Apply K-means clustering with 3 clusters
    set.seed(123)  # Set seed for reproducibility
    kmeans_result <- kmeans(iris[, 1:4], centers = 3)
    
    # View the cluster centers
    kmeans_result$centers
    
    # View the cluster assignments
    kmeans_result$cluster
    
    # Add the cluster assignments to the iris dataset
    iris$cluster <- as.factor(kmeans_result$cluster)
    
    # Visualize the clusters
    plot(iris$Sepal.Length, iris$Sepal.Width, col = iris$cluster, pch = 16, main = "K-means Clustering", xlab = "Sepal Length", ylab = "Sepal Width")

Explanation:

We use the built-in iris dataset, which contains measurements of sepals and petals for different species of iris flowers.
We apply the kmeans() function to perform K-means clustering, specifying centers = 3 to form 3 clusters.
The kmeans_result$centers gives the coordinates of the cluster centers, and kmeans_result$cluster provides the cluster assignments for each data point.
We add the cluster assignments to the original dataset and visualize the clusters using a scatter plot with different colors for each cluster.

2. Hierarchical Clustering

Hierarchical clustering creates a hierarchy of clusters, starting with each data point as a separate cluster and progressively merging the closest clusters. It can be visualized using a dendrogram, which shows how clusters are merged.

Step-by-Step Example of Hierarchical Clustering:

We will also use the iris dataset to apply hierarchical clustering. The goal is to create a hierarchy of clusters based on the features of the flowers.

    # Compute the distance matrix
    dist_matrix <- dist(iris[, 1:4])
    
    # Perform hierarchical clustering using the "complete" linkage method
    hclust_result <- hclust(dist_matrix, method = "complete")
    
    # Plot the dendrogram
    plot(hclust_result, main = "Hierarchical Clustering", xlab = "Data Points", ylab = "Height")
    
    # Cut the dendrogram to form 3 clusters
    clusters_hierarchical <- cutree(hclust_result, k = 3)
    
    # Add the cluster assignments to the iris dataset
    iris$cluster_hierarchical <- as.factor(clusters_hierarchical)
    
    # Visualize the clusters
    plot(iris$Sepal.Length, iris$Sepal.Width, col = iris$cluster_hierarchical, pch = 16, main = "Hierarchical Clustering", xlab = "Sepal Length", ylab = "Sepal Width")

Explanation:

We compute the distance matrix using the dist() function to measure the dissimilarity between data points.
The hclust() function is used to perform hierarchical clustering with the "complete" linkage method, which merges the closest clusters at each step.
We visualize the hierarchy using a dendrogram with the plot() function.
We use cutree() to cut the dendrogram into 3 clusters and add the cluster assignments to the dataset.
We then visualize the clusters with a scatter plot, similar to the K-means example, using different colors for each cluster.

Comparison of K-means and Hierarchical Clustering

Both K-means and hierarchical clustering are popular techniques, but they have different characteristics:

K-means is efficient for large datasets and is suitable when the number of clusters is known beforehand. However, it requires the number of clusters to be specified in advance, and the results can be sensitive to the initial cluster centers.
Hierarchical Clustering does not require the number of clusters to be specified. It provides a dendrogram that shows the hierarchy of clusters, but it can be computationally expensive for large datasets.

Conclusion

In this tutorial, we explored two common clustering techniques in unsupervised learning: K-means and hierarchical clustering. We used the iris dataset in R to demonstrate these techniques. K-means clustering partitions data into a fixed number of clusters, while hierarchical clustering builds a hierarchy of clusters. Both methods are useful for discovering patterns and structures in unlabeled data.

R Programming

Data Structure

Data Manipulation

Import Export

Data Visualization

Control Structure

Statistical Analysis

Machine Learning - R

Advance Topics