Focused Crawling with Nutch (IndexingFilter:filter)

Michael Erickson Wed, 09 May 2012 11:07:42 -0700

Hello all,

I'd like to try to do a focused crawl [1][2] using Nutch.  I have a classifier 
trained on a large corpus of hand-curated data.  My goal is to have Nutch run a 
crawl, but for each page it finds, run the contents of the page through my 
classifier to see if that page is interesting to me.  If it is, I'll have Nutch 
proceed as normal.  However, if the page is not interesting to me, I want to 
avoid indexing the page and prevent its outbound links from being added to the 
frontier.


After reviewing the documentation, it appears that writing an `IndexingFilter` 
plugin might help.  Specifically, using the `filter` method to return NULL if 
I'm not interested in this page.  What I can't tell is if returning NULL from 
the `filter` method will just stop that page from being inserted into the 
index, or if it will also prevent that page's outbound links from being added 
to the frontier.  Can anyone clarify this for me?

Best regards,
--mike

Michael Erickson
[email protected]


[1] http://www8.org/w8-papers/5a-search-query/crawling/ 
[2] http://www.cse.iitb.ac.in/~soumen/focus/ 
[3] 
http://nutch.apache.org/apidocs-1.3/org/apache/nutch/indexer/IndexingFilter.html

Focused Crawling with Nutch (IndexingFilter:filter)

Reply via email to