Crawling documents based on classification.

Tristan Lohman Mon, 19 Aug 2013 13:09:04 -0700

I'm pretty new to Nutch and have encountered a problem. I have tried
googling which led me to a few posts on this mailing list which I didn't
understand and seemed only somewhat related to my problem. But my use case
sounds like a fairly common one so I would like to know what I'm missing.


My use case involves crawling the internet and building a collection of
documents related to health and healthcare. We already have a classifier
created to classify a document as being health-related or not. I would like
to inject this classifier into a Nutch workflow. I think it would flow
something like this:

   1. Perform a crawl.
   2. Outside of Nutch, find all documents that have been pulled since the
   last classification job, and classify them. Put a new key (in the WebDB?)
   specifying whether the document is health related or not.
   3. When starting the next crawl, don't follow links generated from
   non-health documents

My questions are:

   - I'm curious where to embed the classification process. Should it be a
   separate job run against the parsed content? Should it be a plugin to Nutch?
   - Where should I mark the classification result, mark all the outbound
   links in the linkDB, or the URL in the webDB?
   - Where is the best place to create a plugin to ensure it doesn't follow
   links from non-health related content (or at least limit the crawl depth)?

Sorry if any of this information has already been answered before.

Crawling documents based on classification.

Reply via email to