RE: Filter by content language ID

contacts Mon, 02 Jan 2012 04:14:48 -0800

Hello everyone,

   I've just finished testing my plug-in 'language-id-filter' that
is used to filter the indexing of documents by language id.


   I've two questions:

   1) The plug-in works like a charm, it is an indexing filter. BUT
      I guess that even after indexing the content of filtered documents
      remains in the crawler segments, wasting a lot of disk space.

      How to optimize this behaviour ? I mean: i've to crawl and index
      only documents in a X, Y and Z languages. Of course, I don't know
      the language of a document, so I've to fetch it, check the
language,
      and if it is ok, store the content (and later, indexing it),
otherwise
      I only want to store miniumum information about skipped documents,
      or none at all. I'm new to nutch so I don't know about that.

   2) I would like to make the language-id-filter plug-in available in
      the standard Nutch distribution. Is it possible ?

Best Regards,
Alessio


-------- Original Message --------
Subject: Re: Filter by content language ID
From: Markus Jelsma <[email protected]>
Date: Tue, December 13, 2011 3:15 am
To: [email protected]

The indexer part of this plugin will help you on your way.

http://wiki.apache.org/nutch/WritingPluginExample-1.2

> Like i said, create an indexing filter. The example on the wiki is very
> simply and clear. Just check the field created by the langid plugin and
> decide what to do with it. The field, when the plugin is present, is
> automatically added to NutchDocument which are passed through indexing
> filters and later on transformed to SolrDocument obj.
> 
> > Hello,
> > 
> > After a lot of searching, i was unable to find update (Nutch1.4) info
> > 
> > about how to use language id for filtering. Some info are very outdated,
> > and doesn't work at all with Nutch 1.4.
> > 
> > Basically we're testing Nutch for crawling 10M+ web pages, but we want
> > 
> > to deal only with pages that are in EN,IT,DE,FR language, and skip
> > others. In addition, when indexing with Solr, we need to store the field
> > regarding the language id, to use it as a query filter (e.g.: "Only
> > pages in XX language that contain Y").
> > 
> > We're new to Nutch, but this seems to be a very common pattern, but as
> > 
> > stated, I was unable to find any update documentation. I think the
> > solution may be useful to many.
> > 
> > Please, point me to a related resource or hint to solve this task. I'm
> > 
> > very happy to add this solution to the Wiki if it is possible.
> > 
> > Thanks,
> > Alessio
> > 
> > -------- Original Message --------
> > Subject: Re: Filter by content language ID
> > From: Markus Jelsma <[email protected]>
> > Date: Fri, December 02, 2011 8:49 am
> > To: [email protected]
> > 
> > On Friday 02 December 2011 16:23:42 [email protected]
> 
> wrote:
> > > Hello everyone,
> > > 
> > > 
> > > We've a set of urls to crawl, but we're interested in crawling only
> > > pages
> > > whose language is in our white list (e.g.: English, Italian, French),
> > > and reject all the others.
> > > 
> > > 
> > > I don't know if Nutch has a built-in support for this,
> > > language-detector
> > > seems to be dedicated only to another task.
> > 
> > You can use the field value added by the language detector to reject the
> > 
> > page from being indexed. Create a custom indexing filter, skipping all
> > documents you don't need.
> > 
> > > Which is the best way to achieve this with Nutch? Some configuration
> > > options, or it's needed to write a new plug-in ? (That for example,
> > > download
> > > the page, detect the content language, and if the language is ok,
> > > proceed,
> > > otherwise the page is skipped).
> > > 
> > > 
> > > Thanks,
> > > Alessio

RE: Filter by content language ID

Reply via email to