Hello everyone,

   We've a set of urls to crawl, but we're interested in crawling only
pages
whose language is in our white list (e.g.: English, Italian, French), 
and reject all the others.


   I don't know if Nutch has a built-in support for this,
language-detector
seems to be dedicated only to another task.


   Which is the best way to achieve this with Nutch? Some configuration
options, or it's needed to write a new plug-in ? (That for example,
download
the page, detect the content language, and if the language is ok,
proceed,
otherwise the page is skipped).


Thanks,
Alessio

Reply via email to