I was reading about the BIRCH clustering algorithm which is used to cluster large data sets. Its quite popular in the database community. I will try to summarize the paper here.

The algorithm tries to address the problem of clustering large data sets, where the entire dataset cannot fit into memory. The popular clustering algorithms and some flaws with them in the context of large datasets are:

Probability based approaches: These assume that probability distributions on separate attributes are statistically independent of each other.

Distance based approaches: These assume that all data points are available at once and can be scanned frequently.

The primary problem is that algorithms dont consider the fact that the dataset can be too large to fit into the main memory. BIRCH is specially suited for large datasets.  BIRCH can typically find a good cluster with a single scan of the data and improve the cluster quality with additional scans. Its I/O is linear in the size of the data.

The keystones of BIRCH are the concepts of Clustering Features(CF vector) to describe a cluster and the CF Tree. A triplet is used to summarize each cluster as a CF vector, comprising of a) Number of points in cluster b) Linear sum of vectors in cluster c) Square sum of vectors in cluster. The CF vecor is efficient because it stores much less than all the datapoints in the cluster. The representation is optimal because it allows for all the calculations needed to make clustering decisions in BIRCH.

A CF Tree is a balanced tree used to store the clusters in a hierachial manner. The algorithm for insertion of points into the tree ensure that we end up with a good cluster at the end of a first scan of the data. I refer the reader to the paper for details of this data structure.

Clustering is now performed on the CF tree, which being a much reduced representation of the dataset, does fit into the main memory. There might be flaws in the clustering because of skews in the input order of the data. For example, the same data point, if inserted twice at different times might end up in differnt leaves of the tree. These anomalies are taken care of in the post processing steps.

An overview of the clustering is found in the figure below:

Advertisements