Re: Filter by content language ID

Markus Jelsma Mon, 02 Jan 2012 04:25:41 -0800


On Monday 02 January 2012 13:14:15 [email protected] wrote:
> Hello everyone,
> 
>    I've just finished testing my plug-in 'language-id-filter' that
> is used to filter the indexing of documents by language id.
> 
>    I've two questions:
> 
>    1) The plug-in works like a charm, it is an indexing filter. BUT
>       I guess that even after indexing the content of filtered documents
>       remains in the crawler segments, wasting a lot of disk space.


Not possible. Delete the whole segment is the only way to go. Rebuilding the 
segment is a waste of resources.

> 
>       How to optimize this behaviour ? I mean: i've to crawl and index
>       only documents in a X, Y and Z languages. Of course, I don't know
>       the language of a document, so I've to fetch it, check the
> language,
>       and if it is ok, store the content (and later, indexing it),
> otherwise
>       I only want to store miniumum information about skipped documents,
>       or none at all. I'm new to nutch so I don't know about that.

One possibility is to enable parsing during fetch time and use a parse filter. 
When the document comes back you can get rid of the document by not storing 
it. It won't end up in a segment.

> 
>    2) I would like to make the language-id-filter plug-in available in
>       the standard Nutch distribution. Is it possible ?

Open a ticket at our Nutch Jira. 
https://issues.apache.org/jira/browse/NUTCH

> 
> Best Regards,
> Alessio
> 
> 
> -------- Original Message --------
> Subject: Re: Filter by content language ID
> From: Markus Jelsma <[email protected]>
> Date: Tue, December 13, 2011 3:15 am
> To: [email protected]
> 
> The indexer part of this plugin will help you on your way.
> 
> http://wiki.apache.org/nutch/WritingPluginExample-1.2
> 
> > Like i said, create an indexing filter. The example on the wiki is very
> > simply and clear. Just check the field created by the langid plugin and
> > decide what to do with it. The field, when the plugin is present, is
> > automatically added to NutchDocument which are passed through indexing
> > filters and later on transformed to SolrDocument obj.
> > 
> > > Hello,
> > > 
> > > After a lot of searching, i was unable to find update (Nutch1.4) info
> > > 
> > > about how to use language id for filtering. Some info are very
> > > outdated, and doesn't work at all with Nutch 1.4.
> > > 
> > > Basically we're testing Nutch for crawling 10M+ web pages, but we want
> > > 
> > > to deal only with pages that are in EN,IT,DE,FR language, and skip
> > > others. In addition, when indexing with Solr, we need to store the
> > > field regarding the language id, to use it as a query filter (e.g.:
> > > "Only pages in XX language that contain Y").
> > > 
> > > We're new to Nutch, but this seems to be a very common pattern, but as
> > > 
> > > stated, I was unable to find any update documentation. I think the
> > > solution may be useful to many.
> > > 
> > > Please, point me to a related resource or hint to solve this task. I'm
> > > 
> > > very happy to add this solution to the Wiki if it is possible.
> > > 
> > > Thanks,
> > > Alessio
> > > 
> > > -------- Original Message --------
> > > Subject: Re: Filter by content language ID
> > > From: Markus Jelsma <[email protected]>
> > > Date: Fri, December 02, 2011 8:49 am
> > > To: [email protected]
> > > 
> > > On Friday 02 December 2011 16:23:42 [email protected]
> > 
> > wrote:
> > > > Hello everyone,
> > > > 
> > > > 
> > > > We've a set of urls to crawl, but we're interested in crawling only
> > > > pages
> > > > whose language is in our white list (e.g.: English, Italian, French),
> > > > and reject all the others.
> > > > 
> > > > 
> > > > I don't know if Nutch has a built-in support for this,
> > > > language-detector
> > > > seems to be dedicated only to another task.
> > > 
> > > You can use the field value added by the language detector to reject
> > > the
> > > 
> > > page from being indexed. Create a custom indexing filter, skipping all
> > > documents you don't need.
> > > 
> > > > Which is the best way to achieve this with Nutch? Some configuration
> > > > options, or it's needed to write a new plug-in ? (That for example,
> > > > download
> > > > the page, detect the content language, and if the language is ok,
> > > > proceed,
> > > > otherwise the page is skipped).
> > > > 
> > > > 
> > > > Thanks,
> > > > Alessio

-- 
Markus Jelsma - CTO - Openindex

Re: Filter by content language ID

Reply via email to