Hi,
   As this topic concerns both nutch and mahout, I am forwarding my request for 
direction which I posted on the nutch user mailing list, to this list
-Arijit


----- Forwarded Message -----
From: arijit <[email protected]>
To: "[email protected]" <[email protected]> 
Sent: Saturday, October 27, 2012 5:51 PM
Subject: Injecting Mahout in the nutch-solr mix
 

Hi,
   I have been using nutch to crawl some wiki sites and using the following in 
my plugin:
   o a subclass of HtmlParseFilter to do some learning of the crawled data for 
pattern and
   o use the learning from the earlier step in a sublclass of IndexingFilter to 
add additional indexes when adding the index info into solr.

   It works. However, it means that I need to spend time doing some specific 
coding for understanding these various classes of documents. I am looking at 
Mahout to help me with this intermediate job - and the clustering functionality 
seems pretty suited to help me cluster the crawled pages to help add the 
specific dimensions into solr.

   Do you think this is a good way forward? Should I try and use Mahout as a 
library help me do the
 plugin stuff that I described earlier? Or is there any better way to achieve 
the clustering before I add indexes into solr?

   Any help, direction on this is much appreciated.
-Arijit

Reply via email to