The indexer part of this plugin will help you on your way. http://wiki.apache.org/nutch/WritingPluginExample-1.2
> Like i said, create an indexing filter. The example on the wiki is very > simply and clear. Just check the field created by the langid plugin and > decide what to do with it. The field, when the plugin is present, is > automatically added to NutchDocument which are passed through indexing > filters and later on transformed to SolrDocument obj. > > > Hello, > > > > After a lot of searching, i was unable to find update (Nutch1.4) info > > > > about how to use language id for filtering. Some info are very outdated, > > and doesn't work at all with Nutch 1.4. > > > > Basically we're testing Nutch for crawling 10M+ web pages, but we want > > > > to deal only with pages that are in EN,IT,DE,FR language, and skip > > others. In addition, when indexing with Solr, we need to store the field > > regarding the language id, to use it as a query filter (e.g.: "Only > > pages in XX language that contain Y"). > > > > We're new to Nutch, but this seems to be a very common pattern, but as > > > > stated, I was unable to find any update documentation. I think the > > solution may be useful to many. > > > > Please, point me to a related resource or hint to solve this task. I'm > > > > very happy to add this solution to the Wiki if it is possible. > > > > Thanks, > > Alessio > > > > -------- Original Message -------- > > Subject: Re: Filter by content language ID > > From: Markus Jelsma <[email protected]> > > Date: Fri, December 02, 2011 8:49 am > > To: [email protected] > > > > On Friday 02 December 2011 16:23:42 [email protected] > > wrote: > > > Hello everyone, > > > > > > > > > We've a set of urls to crawl, but we're interested in crawling only > > > pages > > > whose language is in our white list (e.g.: English, Italian, French), > > > and reject all the others. > > > > > > > > > I don't know if Nutch has a built-in support for this, > > > language-detector > > > seems to be dedicated only to another task. > > > > You can use the field value added by the language detector to reject the > > > > page from being indexed. Create a custom indexing filter, skipping all > > documents you don't need. > > > > > Which is the best way to achieve this with Nutch? Some configuration > > > options, or it's needed to write a new plug-in ? (That for example, > > > download > > > the page, detect the content language, and if the language is ok, > > > proceed, > > > otherwise the page is skipped). > > > > > > > > > Thanks, > > > Alessio

