Re: Filter by content language ID

Sebastian Nagel Tue, 03 Jan 2012 13:06:41 -0800

Hello Allessio,

> Basically, using (in the filter):
>
>        String langID = (String) doc.getFieldValue("lang");
>
> I always have a 'null' returned, while the field is correctly added
> to the index in Solr.

Looks like the language-identifier indexing filter is applied after your
plug-in. Try to set the order so that 
org.apache.nutch.analysis.lang.LanguageIndexingFilter
is called before your indexing filter class, see:

<property>
  <name>indexingfilter.order</name>
  <value></value>
  <description>The order by which index filters are applied.
  If empty, all available index filters (as dictated by properties
  plugin-includes and plugin-excludes above) are loaded and applied in system
  defined order. If not empty, only named filters are loaded and applied
  in given order. For example, if this property has value:
  org.apache.nutch.indexer.basic.BasicIndexingFilter 
org.apache.nutch.indexer.more.MoreIndexingFilter
  then BasicIndexingFilter is applied first, and MoreIndexingFilter second.

  Filter ordering might have impact on result if one filter depends on output of
  another filter.
  </description>
</property>

> 
http://shuyo.wordpress.com/2011/01/13/language-detection-plugin-for-apache-nutch/
> It seems to have a 99% accuracy, is its usage raccomended ?
I haven't tested it so I can't give any recommendations.
The language detection of Nutch 1.4 has moved to Tika:
 http://tika.apache.org/1.0/detection.html#Language_Detection
It works but is not perfect, see the discussion in
 https://issues.apache.org/jira/browse/TIKA-369
Finally, the Nutch's language-identifier plugin per default prefers
the language given in the HTTP or the HTML header, see the property 
lang.extraction.policy and
http://lucene.472066.n3.nabble.com/Garbage-with-languageidentifier-td3176733.html

Sebastian

On 01/03/2012 07:37 PM, [email protected] wrote:


Hello,

    I spoke too fast. The filter works when I use our beta API for
language
detection (used for fast testing), but I want to use default Nutch
infrastructure,
and so, the default language identifier plug-in.

    Basically, using (in the filter):

       String langID = (String) doc.getFieldValue("lang");

    I always have a 'null' returned, while the field is correctly added
to
the index in Solr. Adding:

             <import plugin="language-identifier"/>     

    in the plugin.xml manifest didn't help.

    It seems that my plug-in is unable to retrieve the value of "lang"
field
added by 'language-identifier' plugin. Or may be it is executed before
the
language-identifier plug-in.

    How can I read and use the 'lang' value inside my plug-in ?

    Just a note: language-identifier plug-in has a lot of errors, I've
done
only a small test, but a lot of errors. I've read about:


http://shuyo.wordpress.com/2011/01/13/language-detection-plugin-for-apache-nutch/

    It seems to have a 99% accuracy, is its usage raccomended ?

Thanks,
Alessio



On Monday 02 January 2012 13:14:15 [email protected]
wrote:

Hello everyone,

I've just finished testing my plug-in 'language-id-filter' that
is used to filter the indexing of documents by language id.

I've two questions:

1) The plug-in works like a charm, it is an indexing filter. BUT
I guess that even after indexing the content of filtered documents
remains in the crawler segments, wasting a lot of disk space.


Not possible. Delete the whole segment is the only way to go. Rebuilding
the
segment is a waste of resources.


How to optimize this behaviour ? I mean: i've to crawl and index
only documents in a X, Y and Z languages. Of course, I don't know
the language of a document, so I've to fetch it, check the
language,
and if it is ok, store the content (and later, indexing it),
otherwise
I only want to store miniumum information about skipped documents,
or none at all. I'm new to nutch so I don't know about that.


One possibility is to enable parsing during fetch time and use a parse
filter.
When the document comes back you can get rid of the document by not
storing
it. It won't end up in a segment.


2) I would like to make the language-id-filter plug-in available in
the standard Nutch distribution. Is it possible ?


Open a ticket at our Nutch Jira.
https://issues.apache.org/jira/browse/NUTCH


Best Regards,
Alessio


-------- Original Message --------
Subject: Re: Filter by content language ID
From: Markus Jelsma<[email protected]>
Date: Tue, December 13, 2011 3:15 am
To: [email protected]

The indexer part of this plugin will help you on your way.

http://wiki.apache.org/nutch/WritingPluginExample-1.2

Like i said, create an indexing filter. The example on the wiki is very
simply and clear. Just check the field created by the langid plugin and
decide what to do with it. The field, when the plugin is present, is
automatically added to NutchDocument which are passed through indexing
filters and later on transformed to SolrDocument obj.

Hello,

After a lot of searching, i was unable to find update (Nutch1.4) info

about how to use language id for filtering. Some info are very
outdated, and doesn't work at all with Nutch 1.4.

Basically we're testing Nutch for crawling 10M+ web pages, but we want

to deal only with pages that are in EN,IT,DE,FR language, and skip
others. In addition, when indexing with Solr, we need to store the
field regarding the language id, to use it as a query filter (e.g.:
"Only pages in XX language that contain Y").

We're new to Nutch, but this seems to be a very common pattern, but as

stated, I was unable to find any update documentation. I think the
solution may be useful to many.

Please, point me to a related resource or hint to solve this task. I'm

very happy to add this solution to the Wiki if it is possible.

Thanks,
Alessio

-------- Original Message --------
Subject: Re: Filter by content language ID
From: Markus Jelsma<[email protected]>
Date: Fri, December 02, 2011 8:49 am
To: [email protected]

On Friday 02 December 2011 16:23:42 [email protected]


wrote:

Hello everyone,


We've a set of urls to crawl, but we're interested in crawling only
pages
whose language is in our white list (e.g.: English, Italian, French),
and reject all the others.


I don't know if Nutch has a built-in support for this,
language-detector
seems to be dedicated only to another task.


You can use the field value added by the language detector to reject
the

page from being indexed. Create a custom indexing filter, skipping all
documents you don't need.

Which is the best way to achieve this with Nutch? Some configuration
options, or it's needed to write a new plug-in ? (That for example,
download
the page, detect the content language, and if the language is ok,
proceed,
otherwise the page is skipped).


Thanks,
Alessio

Re: Filter by content language ID

Reply via email to