Hi Arijit,

Depending on what size your data set(s) are you may wish to have a
quick look into Behemoth[0] for facilitating some of this work. What
you are attempting to do seems entirely reasonable and via a tool such
as Behemoth should be easily achievable at large scale... ideal for
web documents.

hth

Lewis

[0] https://github.com/DigitalPebble/behemoth

On Sat, Oct 27, 2012 at 1:21 PM, arijit <[email protected]> wrote:
> Hi,
>    I have been using nutch to crawl some wiki sites and using the following 
> in my plugin:
>    o a subclass of HtmlParseFilter to do some learning of the crawled data 
> for pattern and
>    o use the learning from the earlier step in a sublclass of IndexingFilter 
> to add additional indexes when adding the index info into solr.
>
>    It works. However, it means that I need to spend time doing some specific 
> coding for understanding these various classes of documents. I am looking at 
> Mahout to help me with this intermediate job - and the clustering 
> functionality seems pretty suited to help me cluster the crawled pages to 
> help add the specific dimensions into solr.
>
>    Do you think this is a good way forward? Should I try and use Mahout as a 
> library help me do the plugin stuff that I described earlier? Or is there any 
> better way to achieve the clustering before I add indexes into solr?
>
>    Any help, direction on this is much apprecuated.
> -Arijit



-- 
Lewis

Reply via email to