Hi I completely agree with Markus that classifying as part of the parsing is the best way of doing. We used our library for text classification ( https://github.com/DigitalPebble/TextClassification) within a custom parsing filter on several Nutch projects for our clients and it worked a treat e.g. for classifying adult content.
Julien On 19 August 2013 21:20, Markus Jelsma <[email protected]> wrote: > Hi, > > We use SVM for some classification and made a parse plugin for the job. > This is the best place because it allows you to write your float or boolean > back to parse meta data. This parse meta data field holding your > classification result can be passed to the CrawlDB via > db.parsemeta.to.crawldb. When it is also in the CrawlDB you can do many > thing with it, such as prefer it when generating a new fetch list. If you > don't want to follow outlinks of non-positive URL's you can do this as well > via a scoring filter. > > Cheers, > Markus > > > > -----Original message----- > > From:Tristan Lohman <[email protected]> > > Sent: Monday 19th August 2013 22:08 > > To: [email protected] > > Subject: Crawling documents based on classification. > > > > I'm pretty new to Nutch and have encountered a problem. I have tried > > googling which led me to a few posts on this mailing list which I didn't > > understand and seemed only somewhat related to my problem. But my use > case > > sounds like a fairly common one so I would like to know what I'm missing. > > > > My use case involves crawling the internet and building a collection of > > documents related to health and healthcare. We already have a classifier > > created to classify a document as being health-related or not. I would > like > > to inject this classifier into a Nutch workflow. I think it would flow > > something like this: > > > > 1. Perform a crawl. > > 2. Outside of Nutch, find all documents that have been pulled since > the > > last classification job, and classify them. Put a new key (in the > WebDB?) > > specifying whether the document is health related or not. > > 3. When starting the next crawl, don't follow links generated from > > non-health documents > > > > My questions are: > > > > - I'm curious where to embed the classification process. Should it be > a > > separate job run against the parsed content? Should it be a plugin to > Nutch? > > - Where should I mark the classification result, mark all the outbound > > links in the linkDB, or the URL in the webDB? > > - Where is the best place to create a plugin to ensure it doesn't > follow > > links from non-health related content (or at least limit the crawl > depth)? > > > > Sorry if any of this information has already been answered before. > > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

