Hi Arijit, Depending on what size your data set(s) are you may wish to have a quick look into Behemoth[0] for facilitating some of this work. What you are attempting to do seems entirely reasonable and via a tool such as Behemoth should be easily achievable at large scale... ideal for web documents.
hth Lewis [0] https://github.com/DigitalPebble/behemoth On Sat, Oct 27, 2012 at 1:21 PM, arijit <[email protected]> wrote: > Hi, > I have been using nutch to crawl some wiki sites and using the following > in my plugin: > o a subclass of HtmlParseFilter to do some learning of the crawled data > for pattern and > o use the learning from the earlier step in a sublclass of IndexingFilter > to add additional indexes when adding the index info into solr. > > It works. However, it means that I need to spend time doing some specific > coding for understanding these various classes of documents. I am looking at > Mahout to help me with this intermediate job - and the clustering > functionality seems pretty suited to help me cluster the crawled pages to > help add the specific dimensions into solr. > > Do you think this is a good way forward? Should I try and use Mahout as a > library help me do the plugin stuff that I described earlier? Or is there any > better way to achieve the clustering before I add indexes into solr? > > Any help, direction on this is much apprecuated. > -Arijit -- Lewis

