RE: Filter by content language ID

contacts Tue, 03 Jan 2012 10:38:08 -0800

Hello,

   I spoke too fast. The filter works when I use our beta API for
language
detection (used for fast testing), but I want to use default Nutch
infrastructure,
and so, the default language identifier plug-in.
   
   Basically, using (in the filter):


      String langID = (String) doc.getFieldValue("lang");

   I always have a 'null' returned, while the field is correctly added
to
the index in Solr. Adding:

            <import plugin="language-identifier"/>      

   in the plugin.xml manifest didn't help.

   It seems that my plug-in is unable to retrieve the value of "lang"
field
added by 'language-identifier' plugin. Or may be it is executed before
the
language-identifier plug-in.

   How can I read and use the 'lang' value inside my plug-in ?

   Just a note: language-identifier plug-in has a lot of errors, I've
done
only a small test, but a lot of errors. I've read about:
   
      
http://shuyo.wordpress.com/2011/01/13/language-detection-plugin-for-apache-nutch/

   It seems to have a 99% accuracy, is its usage raccomended ?

Thanks,
Alessio



On Monday 02 January 2012 13:14:15 [email protected]
wrote:
> Hello everyone,
> 
> I've just finished testing my plug-in 'language-id-filter' that
> is used to filter the indexing of documents by language id.
> 
> I've two questions:
> 
> 1) The plug-in works like a charm, it is an indexing filter. BUT
> I guess that even after indexing the content of filtered documents
> remains in the crawler segments, wasting a lot of disk space.

Not possible. Delete the whole segment is the only way to go. Rebuilding
the 
segment is a waste of resources.

> 
> How to optimize this behaviour ? I mean: i've to crawl and index
> only documents in a X, Y and Z languages. Of course, I don't know
> the language of a document, so I've to fetch it, check the
> language,
> and if it is ok, store the content (and later, indexing it),
> otherwise
> I only want to store miniumum information about skipped documents,
> or none at all. I'm new to nutch so I don't know about that.

One possibility is to enable parsing during fetch time and use a parse
filter. 
When the document comes back you can get rid of the document by not
storing 
it. It won't end up in a segment.

> 
> 2) I would like to make the language-id-filter plug-in available in
> the standard Nutch distribution. Is it possible ?

Open a ticket at our Nutch Jira. 
https://issues.apache.org/jira/browse/NUTCH

> 
> Best Regards,
> Alessio
> 
> 
> -------- Original Message --------
> Subject: Re: Filter by content language ID
> From: Markus Jelsma <[email protected]>
> Date: Tue, December 13, 2011 3:15 am
> To: [email protected]
> 
> The indexer part of this plugin will help you on your way.
> 
> http://wiki.apache.org/nutch/WritingPluginExample-1.2
> 
> > Like i said, create an indexing filter. The example on the wiki is very
> > simply and clear. Just check the field created by the langid plugin and
> > decide what to do with it. The field, when the plugin is present, is
> > automatically added to NutchDocument which are passed through indexing
> > filters and later on transformed to SolrDocument obj.
> > 
> > > Hello,
> > > 
> > > After a lot of searching, i was unable to find update (Nutch1.4) info
> > > 
> > > about how to use language id for filtering. Some info are very
> > > outdated, and doesn't work at all with Nutch 1.4.
> > > 
> > > Basically we're testing Nutch for crawling 10M+ web pages, but we want
> > > 
> > > to deal only with pages that are in EN,IT,DE,FR language, and skip
> > > others. In addition, when indexing with Solr, we need to store the
> > > field regarding the language id, to use it as a query filter (e.g.:
> > > "Only pages in XX language that contain Y").
> > > 
> > > We're new to Nutch, but this seems to be a very common pattern, but as
> > > 
> > > stated, I was unable to find any update documentation. I think the
> > > solution may be useful to many.
> > > 
> > > Please, point me to a related resource or hint to solve this task. I'm
> > > 
> > > very happy to add this solution to the Wiki if it is possible.
> > > 
> > > Thanks,
> > > Alessio
> > > 
> > > -------- Original Message --------
> > > Subject: Re: Filter by content language ID
> > > From: Markus Jelsma <[email protected]>
> > > Date: Fri, December 02, 2011 8:49 am
> > > To: [email protected]
> > > 
> > > On Friday 02 December 2011 16:23:42 [email protected]
> > 
> > wrote:
> > > > Hello everyone,
> > > > 
> > > > 
> > > > We've a set of urls to crawl, but we're interested in crawling only
> > > > pages
> > > > whose language is in our white list (e.g.: English, Italian, French),
> > > > and reject all the others.
> > > > 
> > > > 
> > > > I don't know if Nutch has a built-in support for this,
> > > > language-detector
> > > > seems to be dedicated only to another task.
> > > 
> > > You can use the field value added by the language detector to reject
> > > the
> > > 
> > > page from being indexed. Create a custom indexing filter, skipping all
> > > documents you don't need.
> > > 
> > > > Which is the best way to achieve this with Nutch? Some configuration
> > > > options, or it's needed to write a new plug-in ? (That for example,
> > > > download
> > > > the page, detect the content language, and if the language is ok,
> > > > proceed,
> > > > otherwise the page is skipped).
> > > > 
> > > > 
> > > > Thanks,
> > > > Alessio

-- 
Markus Jelsma - CTO - Openindex

RE: Filter by content language ID

Reply via email to