On Friday 02 December 2011 16:23:42 [email protected] wrote: > Hello everyone, > > > We've a set of urls to crawl, but we're interested in crawling only > pages > whose language is in our white list (e.g.: English, Italian, French), > and reject all the others. > > > I don't know if Nutch has a built-in support for this, > language-detector > seems to be dedicated only to another task. > You can use the field value added by the language detector to reject the page from being indexed. Create a custom indexing filter, skipping all documents you don't need.
> > Which is the best way to achieve this with Nutch? Some configuration > options, or it's needed to write a new plug-in ? (That for example, > download > the page, detect the content language, and if the language is ok, > proceed, > otherwise the page is skipped). > > > Thanks, > Alessio -- Markus Jelsma - CTO - Openindex

