I'm pretty new to Nutch and have encountered a problem. I have tried googling which led me to a few posts on this mailing list which I didn't understand and seemed only somewhat related to my problem. But my use case sounds like a fairly common one so I would like to know what I'm missing.
My use case involves crawling the internet and building a collection of documents related to health and healthcare. We already have a classifier created to classify a document as being health-related or not. I would like to inject this classifier into a Nutch workflow. I think it would flow something like this: 1. Perform a crawl. 2. Outside of Nutch, find all documents that have been pulled since the last classification job, and classify them. Put a new key (in the WebDB?) specifying whether the document is health related or not. 3. When starting the next crawl, don't follow links generated from non-health documents My questions are: - I'm curious where to embed the classification process. Should it be a separate job run against the parsed content? Should it be a plugin to Nutch? - Where should I mark the classification result, mark all the outbound links in the linkDB, or the URL in the webDB? - Where is the best place to create a plugin to ensure it doesn't follow links from non-health related content (or at least limit the crawl depth)? Sorry if any of this information has already been answered before.

