RE: Crawling documents based on classification.

Markus Jelsma Mon, 19 Aug 2013 13:22:05 -0700

Hi,

We use SVM for some classification and made a parse plugin for the job. This is 
the best place because it allows you to write your float or boolean back to 
parse meta data. This parse meta data field holding your classification result 
can be passed to the CrawlDB via db.parsemeta.to.crawldb. When it is also in 
the CrawlDB you can do many thing with it, such as prefer it when generating a 
new fetch list. If you don't want to follow outlinks of non-positive URL's you 
can do this as well via a scoring filter.


Cheers,
Markus

 
 
-----Original message-----
> From:Tristan Lohman <[email protected]>
> Sent: Monday 19th August 2013 22:08
> To: [email protected]
> Subject: Crawling documents based on classification.
> 
> I'm pretty new to Nutch and have encountered a problem. I have tried
> googling which led me to a few posts on this mailing list which I didn't
> understand and seemed only somewhat related to my problem. But my use case
> sounds like a fairly common one so I would like to know what I'm missing.
> 
> My use case involves crawling the internet and building a collection of
> documents related to health and healthcare. We already have a classifier
> created to classify a document as being health-related or not. I would like
> to inject this classifier into a Nutch workflow. I think it would flow
> something like this:
> 
>    1. Perform a crawl.
>    2. Outside of Nutch, find all documents that have been pulled since the
>    last classification job, and classify them. Put a new key (in the WebDB?)
>    specifying whether the document is health related or not.
>    3. When starting the next crawl, don't follow links generated from
>    non-health documents
> 
> My questions are:
> 
>    - I'm curious where to embed the classification process. Should it be a
>    separate job run against the parsed content? Should it be a plugin to 
> Nutch?
>    - Where should I mark the classification result, mark all the outbound
>    links in the linkDB, or the URL in the webDB?
>    - Where is the best place to create a plugin to ensure it doesn't follow
>    links from non-health related content (or at least limit the crawl depth)?
> 
> Sorry if any of this information has already been answered before.
>

RE: Crawling documents based on classification.

Reply via email to