I get following error in the logs: WARN plugin.PluginRepository - Missing dependency language-identifier-agmlab for plugin language-filter
-----Original Message----- From: ilhami Kalkan [mailto:[email protected]] Sent: Tuesday, November 05, 2013 10:36 AM To: [email protected] Subject: Re: Language identification Hi Ralf, I patched language-filter plugin for filter or accept pages which specified languages while parse phase. NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663> On 02-11-2013 22:05, Julien Nioche wrote: > Ralf, > > The parameter http.accept.language tells the servers you are hitting that > they should provide you the content in the languages you specified but that > does not give you any guarantees nor allows you to filter the content. Look > at the languageidentifier plugin as a starting point, then you could add a > custom mapreduce job to remove the pages which are not in the languages of > interest. > > HTH > > Julien > > > > On 2 November 2013 17:15, Ralf R. Kotowski <[email protected]> wrote: > >> Hi, >> >> >> >> What is the correct process to only store documents in a desired language? >> >> >> >> I'm currently doing this: >> >> >> >> <property> >> <name>http.accept.language</name> >> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value> >> <description>Value of the "Accept-Language" request header field. >> This allows selecting non-English language as default one to retrieve. >> It is a useful setting for search engines build for certain national group. >> </description> >> </property> >> >> >> >> Using a seed.txt with URL's I know are in the language I want, but as the >> crawl grows it seems I'm starting to get more and more docs in other >> languages. >> >> >> >> >> >> Thnx in advance >> >> >

