Re: Filter by content language ID

Markus Jelsma Tue, 13 Dec 2011 02:17:39 -0800

The indexer part of this plugin will help you on your way.

http://wiki.apache.org/nutch/WritingPluginExample-1.2


> Like i said, create an indexing filter. The example on the wiki is very
> simply and clear. Just check the field created by the langid plugin and
> decide what to do with it. The field, when the plugin is present, is
> automatically added to NutchDocument which are passed through indexing
> filters and later on transformed to SolrDocument obj.
> 
> > Hello,
> > 
> >    After a lot of searching, i was unable to find update (Nutch1.4) info
> > 
> > about how to use language id for filtering. Some info are very outdated,
> > and doesn't work at all with Nutch 1.4.
> > 
> >    Basically we're testing Nutch for crawling 10M+ web pages, but we want
> > 
> > to deal only with pages that are in EN,IT,DE,FR language, and skip
> > others. In addition, when indexing with Solr, we need to store the field
> > regarding the language id, to use it as a query filter (e.g.: "Only
> > pages in XX language that contain Y").
> > 
> >    We're new to Nutch, but this seems to be a very common pattern, but as
> > 
> > stated, I was unable to find any update documentation. I think the
> > solution may be useful to many.
> > 
> >    Please, point me to a related resource or hint to solve this task. I'm
> > 
> > very happy to add this solution to the Wiki if it is possible.
> > 
> > Thanks,
> > Alessio
> > 
> >  -------- Original Message --------
> >  Subject: Re: Filter by content language ID
> >  From: Markus Jelsma <[email protected]>
> >  Date: Fri, December 02, 2011 8:49 am
> >  To: [email protected]
> >  
> >  On Friday 02 December 2011 16:23:42 [email protected]
> 
> wrote:
> >  > Hello everyone,
> >  > 
> >  > 
> >  > We've a set of urls to crawl, but we're interested in crawling only
> >  > pages
> >  > whose language is in our white list (e.g.: English, Italian, French),
> >  > and reject all the others.
> >  > 
> >  > 
> >  > I don't know if Nutch has a built-in support for this,
> >  > language-detector
> >  > seems to be dedicated only to another task.
> >  
> >  You can use the field value added by the language detector to reject the
> > 
> > page from being indexed. Create a custom indexing filter, skipping all
> > documents you don't need.
> > 
> >  > Which is the best way to achieve this with Nutch? Some configuration
> >  > options, or it's needed to write a new plug-in ? (That for example,
> >  > download
> >  > the page, detect the content language, and if the language is ok,
> >  > proceed,
> >  > otherwise the page is skipped).
> >  > 
> >  > 
> >  > Thanks,
> >  > Alessio

Re: Filter by content language ID

Reply via email to