Thanks buddy, Actually, i have crawled data in my system. Let's say "data related to all firewall,switches and router domains". With nutch i have crawled all the data in my segments(according to depth).
Luckily, i have lucene solr on the top of hdfs. With the help of this, i can easily search(like a google search) in my data. Now, my pain points begin; when my client needs attributes type search. For e.g. I need to get all switches that have 24 ports. For that type of search, i supposed mahout will be in action. I don't know; i am going in right direction or not. But, what i am thinking, if i shall be able to trained my machine in such way so that it gives us desired results. We all know, that machine will take some time to give us some +ve result. Because, every machine need some time to become expert. But that is fine with me. But again, for that we need to categorize my crawled data in at-least 3 parts(according to above example). Any guess! how can i achieve this. On Tue, Jan 14, 2014 at 12:21 PM, Константин Слисенко <[email protected]>wrote: > Hi Vikas! > > For categorization any data you can try clustering algorithms, see this > link http://mahout.apache.org/users/clustering/clusteringyourdata.html. > Simple algorithms by my opinion is k-means > http://mahout.apache.org/users/clustering/k-means-clustering.html. > > Which data do you have? > > If it is text data, you should first extract text, then do some > preprocessing for better quality - remove stop-words (is, are, the, ...), > switch words to lower case, also use Porter stem filter ( > > http://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/analysis/PorterStemFilter.html > ). > This can be done by custom Lucene Analyzer. The result should be in mahout > sequence files format. Then you need to vectorize data ( > http://mahout.apache.org/users/basics/creating-vectors-from-text.html). > Then run clustering algorithm and interpret results. > > You can look at my experiments here > > https://github.com/kslisenko/big-data-research/tree/master/Developments/stackexchange-analyses/stackexchange-analyses-hadoop-mahout > > > 2014/1/13 Vikas Parashar <[email protected]> > > > Hi folks, > > > > Have anyone tried to do categorization on crawl data. If yes then how > can i > > achieve this? Which algorithm will help me? > > > > -- > > Thanks & Regards:- > > Vikas Parashar > > Sr. Linux administrator Cum Developer > > Mobile: +91 958 208 8852 > > Email: [email protected] > > > -- Thanks & Regards:- Vikas Parashar Sr. Linux administrator Cum Developer Mobile: +91 958 208 8852 Email: [email protected]
