Re: Focused Crawling with Nutch (IndexingFilter:filter)

Michael Erickson Wed, 09 May 2012 11:55:46 -0700

On May 9, 2012, at 1:18 PM, Markus Jelsma wrote:

> Hi,
> 
> On Wed, 9 May 2012 13:07:11 -0500, Michael Erickson 
> <[email protected]> wrote:
>> Hello all,
>> 
>> I'd like to try to do a focused crawl [1][2] using Nutch.  I have a
>> classifier trained on a large corpus of hand-curated data.  My goal is
>> to have Nutch run a crawl, but for each page it finds, run the
>> contents of the page through my classifier to see if that page is
>> interesting to me.  If it is, I'll have Nutch proceed as normal.
>> However, if the page is not interesting to me, I want to avoid
>> indexing the page and prevent its outbound links from being added to
>> the frontier.
>> 
>> After reviewing the documentation, it appears that writing an
>> `IndexingFilter` plugin might help.  Specifically, using the `filter`
>> method to return NULL if I'm not interested in this page.  What I
>> can't tell is if returning NULL from the `filter` method will just
>> stop that page from being inserted into the index, or if it will also
>> prevent that page's outbound links from being added to the frontier.
>> Can anyone clarify this for me?
> 
> An indexing filter is one step too late. Implement a parse filter instead and 
> you're good to go.
>


Thanks Markus!

> cheers
> 
>> 
>> Best regards,
>> --mike
>> 
>> Michael Erickson
>> [email protected]
>> 
>> 
>> [1] http://www8.org/w8-papers/5a-search-query/crawling/
>> [2] http://www.cse.iitb.ac.in/~soumen/focus/
>> [3]
>> http://nutch.apache.org/apidocs-1.3/org/apache/nutch/indexer/IndexingFilter.html
> 
> -- 
> Markus Jelsma - CTO - Openindex

Michael Erickson
[email protected]

Re: Focused Crawling with Nutch (IndexingFilter:filter)

Reply via email to