Unsupervised Learning: Clustering in R Programming
Introduction
Unsupervised learning is a type of machine learning where the model is trained on data that has no labels. The goal is to find hidden patterns or intrinsic structures in the input data. Clustering is one of the most common techniques in unsupervised learning, where the data points are grouped based on similarity. In this tutorial, we will cover two popular clustering methods: K-means clustering and hierarchical clustering.
1. K-means Clustering
K-means clustering is a method that partitions data into k
clusters, where each data point belongs to the cluster with the nearest mean. The algorithm works iteratively to assign each point to a cluster and adjust the centroids until convergence.
Step-by-Step Example of K-means Clustering:
We will use the iris
dataset to apply K-means clustering. The goal is to group the flowers into clusters based on their features.
# Load the iris dataset data(iris) # Apply K-means clustering with 3 clusters set.seed(123) # Set seed for reproducibility kmeans_result <- kmeans(iris[, 1:4], centers = 3) # View the cluster centers kmeans_result$centers # View the cluster assignments kmeans_result$cluster # Add the cluster assignments to the iris dataset iris$cluster <- as.factor(kmeans_result$cluster) # Visualize the clusters plot(iris$Sepal.Length, iris$Sepal.Width, col = iris$cluster, pch = 16, main = "K-means Clustering", xlab = "Sepal Length", ylab = "Sepal Width")
Explanation:
- We use the built-in
iris
dataset, which contains measurements of sepals and petals for different species of iris flowers. - We apply the
kmeans()
function to perform K-means clustering, specifyingcenters = 3
to form 3 clusters. - The
kmeans_result$centers
gives the coordinates of the cluster centers, andkmeans_result$cluster
provides the cluster assignments for each data point. - We add the cluster assignments to the original dataset and visualize the clusters using a scatter plot with different colors for each cluster.
2. Hierarchical Clustering
Hierarchical clustering creates a hierarchy of clusters, starting with each data point as a separate cluster and progressively merging the closest clusters. It can be visualized using a dendrogram, which shows how clusters are merged.
Step-by-Step Example of Hierarchical Clustering:
We will also use the iris
dataset to apply hierarchical clustering. The goal is to create a hierarchy of clusters based on the features of the flowers.
# Compute the distance matrix dist_matrix <- dist(iris[, 1:4]) # Perform hierarchical clustering using the "complete" linkage method hclust_result <- hclust(dist_matrix, method = "complete") # Plot the dendrogram plot(hclust_result, main = "Hierarchical Clustering", xlab = "Data Points", ylab = "Height") # Cut the dendrogram to form 3 clusters clusters_hierarchical <- cutree(hclust_result, k = 3) # Add the cluster assignments to the iris dataset iris$cluster_hierarchical <- as.factor(clusters_hierarchical) # Visualize the clusters plot(iris$Sepal.Length, iris$Sepal.Width, col = iris$cluster_hierarchical, pch = 16, main = "Hierarchical Clustering", xlab = "Sepal Length", ylab = "Sepal Width")
Explanation:
- We compute the distance matrix using the
dist()
function to measure the dissimilarity between data points. - The
hclust()
function is used to perform hierarchical clustering with the "complete" linkage method, which merges the closest clusters at each step. - We visualize the hierarchy using a dendrogram with the
plot()
function. - We use
cutree()
to cut the dendrogram into 3 clusters and add the cluster assignments to the dataset. - We then visualize the clusters with a scatter plot, similar to the K-means example, using different colors for each cluster.
Comparison of K-means and Hierarchical Clustering
Both K-means and hierarchical clustering are popular techniques, but they have different characteristics:
- K-means is efficient for large datasets and is suitable when the number of clusters is known beforehand. However, it requires the number of clusters to be specified in advance, and the results can be sensitive to the initial cluster centers.
- Hierarchical Clustering does not require the number of clusters to be specified. It provides a dendrogram that shows the hierarchy of clusters, but it can be computationally expensive for large datasets.
Conclusion
In this tutorial, we explored two common clustering techniques in unsupervised learning: K-means and hierarchical clustering. We used the iris
dataset in R to demonstrate these techniques. K-means clustering partitions data into a fixed number of clusters, while hierarchical clustering builds a hierarchy of clusters. Both methods are useful for discovering patterns and structures in unlabeled data.