We are talking about this plug-in, correct?
http://wiki.apache.org/nutch/LanguageIdentifierPlugin -----Original Message----- From: ilhami Kalkan [mailto:[email protected]] Sent: Thursday, November 07, 2013 10:29 AM To: [email protected] Subject: Re: Language identification Hi Rulf, Short answer is no. This plugin run after language-idendifier plugin. Because, languge-identifier plugin marks metadata language and this plugin get this value to filter or accept language while parse phase. language-identifier plugin gets lang value from header or decide lang value with page content's n-gram. language-filter plugin get "language.filter.languages" entries which must be ISO-639 language codes and match them with metadata lang. Page languages like en-us were rejected. Thanks for heads-up. I added necessary control in patch to prevent this case. On 06-11-2013 23:52, Ralf R. Kotowski wrote: > Hi, > > I have run several passes, I no Langer get the bulk of foreign language > sites I used to, but some others which are supossed to I don't get either. > > Does this plug-in work trough the HTML header? Because I got one of the ones > that are not supossed to be there with this header: > > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> > <html xmlns="http://www.w3.org/1999/xhtml" lang="en-us"> > > -----Original Message----- > From: ilhami Kalkan [mailto:[email protected]] > Sent: Wednesday, November 06, 2013 9:08 AM > To: [email protected] > Subject: Re: Language identification > > Hi Ralf, > > language-identifier-agmlab is my test plugin name. I fixed the patch. > > NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663> > > On 06-11-2013 00:50, Ralf R. Kotowski wrote: >> I get following error in the logs: >> >> WARN plugin.PluginRepository - Missing dependency >> language-identifier-agmlab for plugin language-filter >> >> -----Original Message----- >> From: ilhami Kalkan [mailto:[email protected]] >> Sent: Tuesday, November 05, 2013 10:36 AM >> To: [email protected] >> Subject: Re: Language identification >> >> Hi Ralf, >> >> I patched language-filter plugin for filter or accept pages which >> specified languages while parse phase. >> >> NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663> >> >> >> On 02-11-2013 22:05, Julien Nioche wrote: >>> Ralf, >>> >>> The parameter http.accept.language tells the servers you are hitting that >>> they should provide you the content in the languages you specified but >> that >>> does not give you any guarantees nor allows you to filter the content. >> Look >>> at the languageidentifier plugin as a starting point, then you could add > a >>> custom mapreduce job to remove the pages which are not in the languages > of >>> interest. >>> >>> HTH >>> >>> Julien >>> >>> >>> >>> On 2 November 2013 17:15, Ralf R. Kotowski <[email protected]> wrote: >>> >>>> Hi, >>>> >>>> >>>> >>>> What is the correct process to only store documents in a desired >> language? >>>> >>>> I'm currently doing this: >>>> >>>> >>>> >>>> <property> >>>> <name>http.accept.language</name> >>>> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value> >>>> <description>Value of the "Accept-Language" request header field. >>>> This allows selecting non-English language as default one to retrieve. >>>> It is a useful setting for search engines build for certain national >> group. >>>> </description> >>>> </property> >>>> >>>> >>>> >>>> Using a seed.txt with URL's I know are in the language I want, but as > the >>>> crawl grows it seems I'm starting to get more and more docs in other >>>> languages. >>>> >>>> >>>> >>>> >>>> >>>> Thnx in advance >>>> >>>> >

