Hi Ralf,
language-identifier-agmlab is my test plugin name. I fixed the patch.
NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>
On 06-11-2013 00:50, Ralf R. Kotowski wrote:
I get following error in the logs:
WARN plugin.PluginRepository - Missing dependency
language-identifier-agmlab for plugin language-filter
-----Original Message-----
From: ilhami Kalkan [mailto:[email protected]]
Sent: Tuesday, November 05, 2013 10:36 AM
To: [email protected]
Subject: Re: Language identification
Hi Ralf,
I patched language-filter plugin for filter or accept pages which
specified languages while parse phase.
NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>
On 02-11-2013 22:05, Julien Nioche wrote:
Ralf,
The parameter http.accept.language tells the servers you are hitting that
they should provide you the content in the languages you specified but
that
does not give you any guarantees nor allows you to filter the content.
Look
at the languageidentifier plugin as a starting point, then you could add a
custom mapreduce job to remove the pages which are not in the languages of
interest.
HTH
Julien
On 2 November 2013 17:15, Ralf R. Kotowski <[email protected]> wrote:
Hi,
What is the correct process to only store documents in a desired
language?
I'm currently doing this:
<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the "Accept-Language" request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national
group.
</description>
</property>
Using a seed.txt with URL's I know are in the language I want, but as the
crawl grows it seems I'm starting to get more and more docs in other
languages.
Thnx in advance