Clustering: process of partitioning a set of data in a set of meaningful subclasses. Every data in the subclass shares a common trait. It helps a user understand the natural grouping or structure in a data set.
Can be done as:
Scatter Gather: This approach is used for Text Clustering; user scatters documents into clusters, gathers the contents of 1 or more clusters & re-scatters them to form new clusters; In text clustering, the documents are represented as Vectors where each entry in the vector corresponds to a weighted feature; Similarity between 2 documents is the measure of word overlap between them.
Some Applications: Web Snippet, Document Retrieval, Data Mining
K Means Clustering: Here, K seeds are chosen to represent the centers of the k resulting clusters; Each document is assigned to the cluster with the most similar seed; It is a iterative process – Once every document has been assigned to a cluster, new seeds can be computed; The assignment process is repeated with these new seeds.
Classification: a technique used to predict group membership for data instances. For example, you may wish to use classification to predict whether the weather on a particular day will be “sunny”, “rainy” or “cloudy”.