It might seem like you would want do to entity extraction but that's not
trivial and Mahout won't directly help in that area.

Bertrand

On Tue, Jan 14, 2014 at 10:05 AM, Константин Слисенко
<[email protected]>wrote:

> Hi Vikas!
>
> As I understand, you need to improve indexing of your data for exact
> search. You can look to classification algorithms (
> http://mahout.apache.org/users/classification/classifyingyourdata.html).
> You can define topics and train classifier. Then classifier will split your
> data into several groups and then you can index your data.
>
> But I'm not sure that mahout is good for exact search if you want to find
> switches with exact 24 ports. I think it could be better if index your data
> another way (using hadoop) and get exact parameters of every switch in
> network, then you import this data into database with indexes. You can also
> integrate Lucene to store database IDs.
>
>
> 2014/1/14 Vikas Parashar <[email protected]>
>
> > Thanks buddy,
> >
> > Actually, i have crawled data in my system. Let's say "data related to
> all
> > firewall,switches and router domains". With nutch i have crawled all the
> > data in my segments(according to depth).
> >
> > Luckily, i have lucene solr  on the top of hdfs. With the help of this, i
> > can easily search(like a google search) in my data.
> >
> > Now, my pain points begin; when my client needs attributes type search.
> For
> > e.g. I need to get all switches that have 24 ports. For that type of
> > search, i supposed mahout will be in action. I don't know; i am going in
> > right direction or not. But, what i am thinking, if i shall be able to
> > trained my machine in such way so that it gives us desired results. We
> all
> > know, that machine will take some time to give us some +ve result.
> Because,
> > every machine need some time to become expert. But that is fine with me.
> >
> > But again, for that we need to categorize my crawled data in at-least 3
> > parts(according to above example).
> >
> > Any guess! how can i achieve this.
> >
> >
> >
> >
> >
> >
> > On Tue, Jan 14, 2014 at 12:21 PM, Константин Слисенко
> > <[email protected]>wrote:
> >
> > > Hi Vikas!
> > >
> > > For categorization any data you can try clustering algorithms, see this
> > > link http://mahout.apache.org/users/clustering/clusteringyourdata.html
> .
> > > Simple algorithms by my opinion is k-means
> > > http://mahout.apache.org/users/clustering/k-means-clustering.html.
> > >
> > > Which data do you have?
> > >
> > > If it is text data, you should first extract text, then do some
> > > preprocessing for better quality - remove stop-words (is, are, the,
> ...),
> > > switch words to lower case, also use Porter stem filter (
> > >
> > >
> >
> http://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/analysis/PorterStemFilter.html
> > > ).
> > > This can be done by custom Lucene Analyzer. The result should be in
> > mahout
> > > sequence files format. Then you need to vectorize data (
> > > http://mahout.apache.org/users/basics/creating-vectors-from-text.html
> ).
> > > Then run clustering algorithm and interpret results.
> > >
> > > You can look at my experiments here
> > >
> > >
> >
> https://github.com/kslisenko/big-data-research/tree/master/Developments/stackexchange-analyses/stackexchange-analyses-hadoop-mahout
> > >
> > >
> > > 2014/1/13 Vikas Parashar <[email protected]>
> > >
> > > > Hi folks,
> > > >
> > > > Have anyone tried to do categorization on crawl data. If yes then how
> > > can i
> > > > achieve this? Which algorithm will help me?
> > > >
> > > > --
> > > > Thanks & Regards:-
> > > > Vikas Parashar
> > > > Sr. Linux administrator Cum Developer
> > > > Mobile: +91 958 208 8852
> > > > Email: [email protected]
> > > >
> > >
> >
> >
> >
> > --
> > Thanks & Regards:-
> > Vikas Parashar
> > Sr. Linux administrator Cum Developer
> > Mobile: +91 958 208 8852
> > Email: [email protected]
> >
>

Reply via email to