It might seem like you would want do to entity extraction but that's not trivial and Mahout won't directly help in that area.
Bertrand On Tue, Jan 14, 2014 at 10:05 AM, Константин Слисенко <[email protected]>wrote: > Hi Vikas! > > As I understand, you need to improve indexing of your data for exact > search. You can look to classification algorithms ( > http://mahout.apache.org/users/classification/classifyingyourdata.html). > You can define topics and train classifier. Then classifier will split your > data into several groups and then you can index your data. > > But I'm not sure that mahout is good for exact search if you want to find > switches with exact 24 ports. I think it could be better if index your data > another way (using hadoop) and get exact parameters of every switch in > network, then you import this data into database with indexes. You can also > integrate Lucene to store database IDs. > > > 2014/1/14 Vikas Parashar <[email protected]> > > > Thanks buddy, > > > > Actually, i have crawled data in my system. Let's say "data related to > all > > firewall,switches and router domains". With nutch i have crawled all the > > data in my segments(according to depth). > > > > Luckily, i have lucene solr on the top of hdfs. With the help of this, i > > can easily search(like a google search) in my data. > > > > Now, my pain points begin; when my client needs attributes type search. > For > > e.g. I need to get all switches that have 24 ports. For that type of > > search, i supposed mahout will be in action. I don't know; i am going in > > right direction or not. But, what i am thinking, if i shall be able to > > trained my machine in such way so that it gives us desired results. We > all > > know, that machine will take some time to give us some +ve result. > Because, > > every machine need some time to become expert. But that is fine with me. > > > > But again, for that we need to categorize my crawled data in at-least 3 > > parts(according to above example). > > > > Any guess! how can i achieve this. > > > > > > > > > > > > > > On Tue, Jan 14, 2014 at 12:21 PM, Константин Слисенко > > <[email protected]>wrote: > > > > > Hi Vikas! > > > > > > For categorization any data you can try clustering algorithms, see this > > > link http://mahout.apache.org/users/clustering/clusteringyourdata.html > . > > > Simple algorithms by my opinion is k-means > > > http://mahout.apache.org/users/clustering/k-means-clustering.html. > > > > > > Which data do you have? > > > > > > If it is text data, you should first extract text, then do some > > > preprocessing for better quality - remove stop-words (is, are, the, > ...), > > > switch words to lower case, also use Porter stem filter ( > > > > > > > > > http://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/analysis/PorterStemFilter.html > > > ). > > > This can be done by custom Lucene Analyzer. The result should be in > > mahout > > > sequence files format. Then you need to vectorize data ( > > > http://mahout.apache.org/users/basics/creating-vectors-from-text.html > ). > > > Then run clustering algorithm and interpret results. > > > > > > You can look at my experiments here > > > > > > > > > https://github.com/kslisenko/big-data-research/tree/master/Developments/stackexchange-analyses/stackexchange-analyses-hadoop-mahout > > > > > > > > > 2014/1/13 Vikas Parashar <[email protected]> > > > > > > > Hi folks, > > > > > > > > Have anyone tried to do categorization on crawl data. If yes then how > > > can i > > > > achieve this? Which algorithm will help me? > > > > > > > > -- > > > > Thanks & Regards:- > > > > Vikas Parashar > > > > Sr. Linux administrator Cum Developer > > > > Mobile: +91 958 208 8852 > > > > Email: [email protected] > > > > > > > > > > > > > > > -- > > Thanks & Regards:- > > Vikas Parashar > > Sr. Linux administrator Cum Developer > > Mobile: +91 958 208 8852 > > Email: [email protected] > > >
