RE: Filter by content language ID

contacts Tue, 13 Dec 2011 02:11:34 -0800

Hello,

   After a lot of searching, i was unable to find update (Nutch1.4) info
about how to use language id for filtering. Some info are very outdated,
and doesn't work at all with Nutch 1.4.

   Basically we're testing Nutch for crawling 10M+ web pages, but we
want
to deal only with pages that are in EN,IT,DE,FR language, and skip
others. In addition,
when indexing with Solr, we need to store the field regarding the
language id, to use it
as a query filter (e.g.: "Only pages in XX language that contain Y").

   We're new to Nutch, but this seems to be a very common pattern, but
as stated,
I was unable to find any update documentation. I think the solution may
be useful to many.

   Please, point me to a related resource or hint to solve this task.
I'm very happy
to add this solution to the Wiki if it is possible.

Thanks,
Alessio

  -------- Original Message --------
 Subject: Re: Filter by content language ID
 From: Markus Jelsma <[email protected]>
 Date: Fri, December 02, 2011 8:49 am
 To: [email protected]

 On Friday 02 December 2011 16:23:42 [email protected]
wrote:
 > Hello everyone,
 > 
 > 
 > We've a set of urls to crawl, but we're interested in crawling only
 > pages
 > whose language is in our white list (e.g.: English, Italian, French),
 > and reject all the others.
 > 
 > 
 > I don't know if Nutch has a built-in support for this,
 > language-detector
 > seems to be dedicated only to another task.
 > 
 You can use the field value added by the language detector to reject
the page 
 from being indexed. Create a custom indexing filter, skipping all
documents 
 you don't need.

 > 
 > Which is the best way to achieve this with Nutch? Some configuration
 > options, or it's needed to write a new plug-in ? (That for example,
 > download
 > the page, detect the content language, and if the language is ok,
 > proceed,
 > otherwise the page is skipped).
 > 
 > 
 > Thanks,
 > Alessio

 -- 
 Markus Jelsma - CTO - Openindex

RE: Filter by content language ID

Reply via email to