..MindWrite..

Posts Tagged ‘web search engines’

Clustering and Classification

Posted by guptaradhesh on May 7, 2011

Clustering: process of partitioning a set of data in a set of meaningful subclasses. Every data in the subclass shares a common trait. It helps a user understand the natural grouping or structure in a data set.

Can be done as:

Scatter Gather: This approach is used for Text Clustering; user scatters documents into clusters, gathers the contents of 1 or more clusters & re-scatters them to form new clusters; In text clustering, the documents are represented as Vectors where each entry in the vector corresponds to a weighted feature; Similarity between 2 documents is the measure of word overlap between them.

Some Applications: Web Snippet, Document Retrieval, Data Mining

K Means Clustering: Here, K seeds are chosen to represent the centers of the k resulting clusters; Each document is assigned to the cluster with the most similar seed; It is a iterative process – Once every document has been assigned to a cluster, new seeds can be computed; The assignment process is repeated with these new seeds.

Classification: a technique used to predict group membership for data instances. For example, you may wish to use classification to predict whether the weather on a particular day will be “sunny”, “rainy” or “cloudy”.

Posted in tech | Tagged: , , , | Leave a Comment »

Stopwords

Posted by guptaradhesh on May 7, 2011

Most of the search engines including Google uses stemming technology. That is, when appropriate, it will search not only for your search terms, but also for words that are similar to some or all of those terms.

Most probably, Google is using a lexicon (stopwords) for stemming, rather than a stemming algorithm.

The “stopwords” are common words generally ignored in query and sometimes not indexed.

Some of the stopword lists found are:

1. stopwords1
2. stopwords2

Posted in tech | Tagged: , , , | Leave a Comment »