Hi, I have been using nutch to crawl some wiki sites and using the following in my plugin: o a subclass of HtmlParseFilter to do some learning of the crawled data for pattern and o use the learning from the earlier step in a sublclass of IndexingFilter to add additional indexes when adding the index info into solr.
It works. However, it means that I need to spend time doing some specific coding for understanding these various classes of documents. I am looking at Mahout to help me with this intermediate job - and the clustering functionality seems pretty suited to help me cluster the crawled pages to help add the specific dimensions into solr. Do you think this is a good way forward? Should I try and use Mahout as a library help me do the plugin stuff that I described earlier? Or is there any better way to achieve the clustering before I add indexes into solr? Any help, direction on this is much apprecuated. -Arijit

