Thanks buddy,

Actually, i have crawled data in my system. Let's say "data related to all
firewall,switches and router domains". With nutch i have crawled all the
data in my segments(according to depth).

Luckily, i have lucene solr  on the top of hdfs. With the help of this, i
can easily search(like a google search) in my data.

Now, my pain points begin; when my client needs attributes type search. For
e.g. I need to get all switches that have 24 ports. For that type of
search, i supposed mahout will be in action. I don't know; i am going in
right direction or not. But, what i am thinking, if i shall be able to
trained my machine in such way so that it gives us desired results. We all
know, that machine will take some time to give us some +ve result. Because,
every machine need some time to become expert. But that is fine with me.

But again, for that we need to categorize my crawled data in at-least 3
parts(according to above example).

Any guess! how can i achieve this.






On Tue, Jan 14, 2014 at 12:21 PM, Константин Слисенко
<[email protected]>wrote:

> Hi Vikas!
>
> For categorization any data you can try clustering algorithms, see this
> link http://mahout.apache.org/users/clustering/clusteringyourdata.html.
> Simple algorithms by my opinion is k-means
> http://mahout.apache.org/users/clustering/k-means-clustering.html.
>
> Which data do you have?
>
> If it is text data, you should first extract text, then do some
> preprocessing for better quality - remove stop-words (is, are, the, ...),
> switch words to lower case, also use Porter stem filter (
>
> http://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/analysis/PorterStemFilter.html
> ).
> This can be done by custom Lucene Analyzer. The result should be in mahout
> sequence files format. Then you need to vectorize data (
> http://mahout.apache.org/users/basics/creating-vectors-from-text.html).
> Then run clustering algorithm and interpret results.
>
> You can look at my experiments here
>
> https://github.com/kslisenko/big-data-research/tree/master/Developments/stackexchange-analyses/stackexchange-analyses-hadoop-mahout
>
>
> 2014/1/13 Vikas Parashar <[email protected]>
>
> > Hi folks,
> >
> > Have anyone tried to do categorization on crawl data. If yes then how
> can i
> > achieve this? Which algorithm will help me?
> >
> > --
> > Thanks & Regards:-
> > Vikas Parashar
> > Sr. Linux administrator Cum Developer
> > Mobile: +91 958 208 8852
> > Email: [email protected]
> >
>



-- 
Thanks & Regards:-
Vikas Parashar
Sr. Linux administrator Cum Developer
Mobile: +91 958 208 8852
Email: [email protected]

Reply via email to