Hi,
On Wed, 9 May 2012 13:07:11 -0500, Michael Erickson
<[email protected]> wrote:
Hello all,
I'd like to try to do a focused crawl [1][2] using Nutch. I have a
classifier trained on a large corpus of hand-curated data. My goal
is
to have Nutch run a crawl, but for each page it finds, run the
contents of the page through my classifier to see if that page is
interesting to me. If it is, I'll have Nutch proceed as normal.
However, if the page is not interesting to me, I want to avoid
indexing the page and prevent its outbound links from being added to
the frontier.
After reviewing the documentation, it appears that writing an
`IndexingFilter` plugin might help. Specifically, using the `filter`
method to return NULL if I'm not interested in this page. What I
can't tell is if returning NULL from the `filter` method will just
stop that page from being inserted into the index, or if it will also
prevent that page's outbound links from being added to the frontier.
Can anyone clarify this for me?
An indexing filter is one step too late. Implement a parse filter
instead and you're good to go.
cheers
Best regards,
--mike
Michael Erickson
[email protected]
[1] http://www8.org/w8-papers/5a-search-query/crawling/
[2] http://www.cse.iitb.ac.in/~soumen/focus/
[3]
http://nutch.apache.org/apidocs-1.3/org/apache/nutch/indexer/IndexingFilter.html
--
Markus Jelsma - CTO - Openindex