On Monday 02 January 2012 13:14:15 [email protected] wrote: > Hello everyone, > > I've just finished testing my plug-in 'language-id-filter' that > is used to filter the indexing of documents by language id. > > I've two questions: > > 1) The plug-in works like a charm, it is an indexing filter. BUT > I guess that even after indexing the content of filtered documents > remains in the crawler segments, wasting a lot of disk space.
Not possible. Delete the whole segment is the only way to go. Rebuilding the segment is a waste of resources. > > How to optimize this behaviour ? I mean: i've to crawl and index > only documents in a X, Y and Z languages. Of course, I don't know > the language of a document, so I've to fetch it, check the > language, > and if it is ok, store the content (and later, indexing it), > otherwise > I only want to store miniumum information about skipped documents, > or none at all. I'm new to nutch so I don't know about that. One possibility is to enable parsing during fetch time and use a parse filter. When the document comes back you can get rid of the document by not storing it. It won't end up in a segment. > > 2) I would like to make the language-id-filter plug-in available in > the standard Nutch distribution. Is it possible ? Open a ticket at our Nutch Jira. https://issues.apache.org/jira/browse/NUTCH > > Best Regards, > Alessio > > > -------- Original Message -------- > Subject: Re: Filter by content language ID > From: Markus Jelsma <[email protected]> > Date: Tue, December 13, 2011 3:15 am > To: [email protected] > > The indexer part of this plugin will help you on your way. > > http://wiki.apache.org/nutch/WritingPluginExample-1.2 > > > Like i said, create an indexing filter. The example on the wiki is very > > simply and clear. Just check the field created by the langid plugin and > > decide what to do with it. The field, when the plugin is present, is > > automatically added to NutchDocument which are passed through indexing > > filters and later on transformed to SolrDocument obj. > > > > > Hello, > > > > > > After a lot of searching, i was unable to find update (Nutch1.4) info > > > > > > about how to use language id for filtering. Some info are very > > > outdated, and doesn't work at all with Nutch 1.4. > > > > > > Basically we're testing Nutch for crawling 10M+ web pages, but we want > > > > > > to deal only with pages that are in EN,IT,DE,FR language, and skip > > > others. In addition, when indexing with Solr, we need to store the > > > field regarding the language id, to use it as a query filter (e.g.: > > > "Only pages in XX language that contain Y"). > > > > > > We're new to Nutch, but this seems to be a very common pattern, but as > > > > > > stated, I was unable to find any update documentation. I think the > > > solution may be useful to many. > > > > > > Please, point me to a related resource or hint to solve this task. I'm > > > > > > very happy to add this solution to the Wiki if it is possible. > > > > > > Thanks, > > > Alessio > > > > > > -------- Original Message -------- > > > Subject: Re: Filter by content language ID > > > From: Markus Jelsma <[email protected]> > > > Date: Fri, December 02, 2011 8:49 am > > > To: [email protected] > > > > > > On Friday 02 December 2011 16:23:42 [email protected] > > > > wrote: > > > > Hello everyone, > > > > > > > > > > > > We've a set of urls to crawl, but we're interested in crawling only > > > > pages > > > > whose language is in our white list (e.g.: English, Italian, French), > > > > and reject all the others. > > > > > > > > > > > > I don't know if Nutch has a built-in support for this, > > > > language-detector > > > > seems to be dedicated only to another task. > > > > > > You can use the field value added by the language detector to reject > > > the > > > > > > page from being indexed. Create a custom indexing filter, skipping all > > > documents you don't need. > > > > > > > Which is the best way to achieve this with Nutch? Some configuration > > > > options, or it's needed to write a new plug-in ? (That for example, > > > > download > > > > the page, detect the content language, and if the language is ok, > > > > proceed, > > > > otherwise the page is skipped). > > > > > > > > > > > > Thanks, > > > > Alessio -- Markus Jelsma - CTO - Openindex

