Re: categorization on crawl data

Константин Слисенко Mon, 13 Jan 2014 22:58:46 -0800

Hi Vikas!

For categorization any data you can try clustering algorithms, see this
link http://mahout.apache.org/users/clustering/clusteringyourdata.html.
Simple algorithms by my opinion is k-means
http://mahout.apache.org/users/clustering/k-means-clustering.html.


Which data do you have?

If it is text data, you should first extract text, then do some
preprocessing for better quality - remove stop-words (is, are, the, ...),
switch words to lower case, also use Porter stem filter (
http://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/analysis/PorterStemFilter.html).
This can be done by custom Lucene Analyzer. The result should be in mahout
sequence files format. Then you need to vectorize data (
http://mahout.apache.org/users/basics/creating-vectors-from-text.html).
Then run clustering algorithm and interpret results.

You can look at my experiments here
https://github.com/kslisenko/big-data-research/tree/master/Developments/stackexchange-analyses/stackexchange-analyses-hadoop-mahout


2014/1/13 Vikas Parashar <[email protected]>

> Hi folks,
>
> Have anyone tried to do categorization on crawl data. If yes then how can i
> achieve this? Which algorithm will help me?
>
> --
> Thanks & Regards:-
> Vikas Parashar
> Sr. Linux administrator Cum Developer
> Mobile: +91 958 208 8852
> Email: [email protected]
>

Re: categorization on crawl data

Reply via email to