Re: Focused Crawling with Nutch (IndexingFilter:filter)

Markus Jelsma Wed, 09 May 2012 11:17:19 -0700

Hi,

On Wed, 9 May 2012 13:07:11 -0500, Michael Erickson<[email protected]> wrote:

Hello all,


I'd like to try to do a focused crawl [1][2] using Nutch.  I have a

classifier trained on a large corpus of hand-curated data. My goalis

to have Nutch run a crawl, but for each page it finds, run the
contents of the page through my classifier to see if that page is
interesting to me.  If it is, I'll have Nutch proceed as normal.
However, if the page is not interesting to me, I want to avoid
indexing the page and prevent its outbound links from being added to
the frontier.

After reviewing the documentation, it appears that writing an
`IndexingFilter` plugin might help.  Specifically, using the `filter`
method to return NULL if I'm not interested in this page.  What I
can't tell is if returning NULL from the `filter` method will just
stop that page from being inserted into the index, or if it will also
prevent that page's outbound links from being added to the frontier.
Can anyone clarify this for me?

An indexing filter is one step too late. Implement a parse filterinstead and you're good to go.


cheers


Best regards,
--mike

Michael Erickson
[email protected]


[1] http://www8.org/w8-papers/5a-search-query/crawling/
[2] http://www.cse.iitb.ac.in/~soumen/focus/
[3]

http://nutch.apache.org/apidocs-1.3/org/apache/nutch/indexer/IndexingFilter.html


--
Markus Jelsma - CTO - Openindex

Re: Focused Crawling with Nutch (IndexingFilter:filter)

Reply via email to