On May 9, 2012, at 1:18 PM, Markus Jelsma wrote: > Hi, > > On Wed, 9 May 2012 13:07:11 -0500, Michael Erickson > <[email protected]> wrote: >> Hello all, >> >> I'd like to try to do a focused crawl [1][2] using Nutch. I have a >> classifier trained on a large corpus of hand-curated data. My goal is >> to have Nutch run a crawl, but for each page it finds, run the >> contents of the page through my classifier to see if that page is >> interesting to me. If it is, I'll have Nutch proceed as normal. >> However, if the page is not interesting to me, I want to avoid >> indexing the page and prevent its outbound links from being added to >> the frontier. >> >> After reviewing the documentation, it appears that writing an >> `IndexingFilter` plugin might help. Specifically, using the `filter` >> method to return NULL if I'm not interested in this page. What I >> can't tell is if returning NULL from the `filter` method will just >> stop that page from being inserted into the index, or if it will also >> prevent that page's outbound links from being added to the frontier. >> Can anyone clarify this for me? > > An indexing filter is one step too late. Implement a parse filter instead and > you're good to go. >
Thanks Markus! > cheers > >> >> Best regards, >> --mike >> >> Michael Erickson >> [email protected] >> >> >> [1] http://www8.org/w8-papers/5a-search-query/crawling/ >> [2] http://www.cse.iitb.ac.in/~soumen/focus/ >> [3] >> http://nutch.apache.org/apidocs-1.3/org/apache/nutch/indexer/IndexingFilter.html > > -- > Markus Jelsma - CTO - Openindex Michael Erickson [email protected]

