Hi,
   I have been using nutch to crawl some wiki sites and using the following in 
my plugin:
   o a subclass of HtmlParseFilter to do some learning of the crawled data for 
pattern and
   o use the learning from the earlier step in a sublclass of IndexingFilter to 
add additional indexes when adding the index info into solr.

   It works. However, it means that I need to spend time doing some specific 
coding for understanding these various classes of documents. I am looking at 
Mahout to help me with this intermediate job - and the clustering functionality 
seems pretty suited to help me cluster the crawled pages to help add the 
specific dimensions into solr.

   Do you think this is a good way forward? Should I try and use Mahout as a 
library help me do the plugin stuff that I described earlier? Or is there any 
better way to achieve the clustering before I add indexes into solr?

   Any help, direction on this is much apprecuated.
-Arijit

Reply via email to