Re: Crawling documents based on classification.

Julien Nioche Tue, 20 Aug 2013 05:17:11 -0700

Hi

I completely agree with Markus that classifying as part of the parsing is
the best way of doing. We used our library for text classification (
https://github.com/DigitalPebble/TextClassification) within a custom
parsing filter on several Nutch projects for our clients and it worked a
treat e.g. for classifying adult content.


Julien


On 19 August 2013 21:20, Markus Jelsma <[email protected]> wrote:

> Hi,
>
> We use SVM for some classification and made a parse plugin for the job.
> This is the best place because it allows you to write your float or boolean
> back to parse meta data. This parse meta data field holding your
> classification result can be passed to the CrawlDB via
> db.parsemeta.to.crawldb. When it is also in the CrawlDB you can do many
> thing with it, such as prefer it when generating a new fetch list. If you
> don't want to follow outlinks of non-positive URL's you can do this as well
> via a scoring filter.
>
> Cheers,
> Markus
>
>
>
> -----Original message-----
> > From:Tristan Lohman <[email protected]>
> > Sent: Monday 19th August 2013 22:08
> > To: [email protected]
> > Subject: Crawling documents based on classification.
> >
> > I'm pretty new to Nutch and have encountered a problem. I have tried
> > googling which led me to a few posts on this mailing list which I didn't
> > understand and seemed only somewhat related to my problem. But my use
> case
> > sounds like a fairly common one so I would like to know what I'm missing.
> >
> > My use case involves crawling the internet and building a collection of
> > documents related to health and healthcare. We already have a classifier
> > created to classify a document as being health-related or not. I would
> like
> > to inject this classifier into a Nutch workflow. I think it would flow
> > something like this:
> >
> >    1. Perform a crawl.
> >    2. Outside of Nutch, find all documents that have been pulled since
> the
> >    last classification job, and classify them. Put a new key (in the
> WebDB?)
> >    specifying whether the document is health related or not.
> >    3. When starting the next crawl, don't follow links generated from
> >    non-health documents
> >
> > My questions are:
> >
> >    - I'm curious where to embed the classification process. Should it be
> a
> >    separate job run against the parsed content? Should it be a plugin to
> Nutch?
> >    - Where should I mark the classification result, mark all the outbound
> >    links in the linkDB, or the URL in the webDB?
> >    - Where is the best place to create a plugin to ensure it doesn't
> follow
> >    links from non-health related content (or at least limit the crawl
> depth)?
> >
> > Sorry if any of this information has already been answered before.
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Crawling documents based on classification.

Reply via email to