Hi Vikas! For categorization any data you can try clustering algorithms, see this link http://mahout.apache.org/users/clustering/clusteringyourdata.html. Simple algorithms by my opinion is k-means http://mahout.apache.org/users/clustering/k-means-clustering.html.
Which data do you have? If it is text data, you should first extract text, then do some preprocessing for better quality - remove stop-words (is, are, the, ...), switch words to lower case, also use Porter stem filter ( http://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/analysis/PorterStemFilter.html). This can be done by custom Lucene Analyzer. The result should be in mahout sequence files format. Then you need to vectorize data ( http://mahout.apache.org/users/basics/creating-vectors-from-text.html). Then run clustering algorithm and interpret results. You can look at my experiments here https://github.com/kslisenko/big-data-research/tree/master/Developments/stackexchange-analyses/stackexchange-analyses-hadoop-mahout 2014/1/13 Vikas Parashar <[email protected]> > Hi folks, > > Have anyone tried to do categorization on crawl data. If yes then how can i > achieve this? Which algorithm will help me? > > -- > Thanks & Regards:- > Vikas Parashar > Sr. Linux administrator Cum Developer > Mobile: +91 958 208 8852 > Email: [email protected] >
