On Friday 02 December 2011 16:23:42 [email protected] wrote:
> Hello everyone,
> 
> 
>    We've a set of urls to crawl, but we're interested in crawling only
> pages
> whose language is in our white list (e.g.: English, Italian, French),
> and reject all the others.
> 
> 
>    I don't know if Nutch has a built-in support for this,
> language-detector
> seems to be dedicated only to another task.
> 
You can use the field value added by the language detector to reject the page 
from being indexed. Create a custom indexing filter, skipping all documents 
you don't need.

> 
>    Which is the best way to achieve this with Nutch? Some configuration
> options, or it's needed to write a new plug-in ? (That for example,
> download
> the page, detect the content language, and if the language is ok,
> proceed,
> otherwise the page is skipped).
> 
> 
> Thanks,
> Alessio

-- 
Markus Jelsma - CTO - Openindex

Reply via email to