Re: Language identification

ilhami Kalkan Wed, 06 Nov 2013 00:09:12 -0800

Hi Ralf,

language-identifier-agmlab is my test plugin name. I fixed the patch.


NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>

On 06-11-2013 00:50, Ralf R. Kotowski wrote:

I get following error in the logs:

WARN  plugin.PluginRepository - Missing dependency
language-identifier-agmlab for plugin language-filter

-----Original Message-----
From: ilhami Kalkan [mailto:[email protected]]
Sent: Tuesday, November 05, 2013 10:36 AM
To: [email protected]
Subject: Re: Language identification

Hi Ralf,

I patched language-filter plugin for filter or accept pages which
specified languages while parse phase.

NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>


On 02-11-2013 22:05, Julien Nioche wrote:

Ralf,

The parameter http.accept.language tells the servers you are hitting that
they should provide you the content in the languages you specified but

that

does not give you any guarantees nor allows you to filter the content.

Look

at the languageidentifier plugin as a starting point, then you could add a
custom mapreduce job to remove the pages which are not in the languages of
interest.

HTH

Julien



On 2 November 2013 17:15, Ralf R. Kotowski <[email protected]> wrote:

Hi,



What is the correct process to only store documents in a desired

language?



I'm currently doing this:



<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the "Accept-Language" request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national

group.

</description>
</property>



Using a seed.txt with URL's I know are in the language I want, but as the
crawl grows it seems I'm starting to get more and more docs in other
languages.





Thnx in advance

Re: Language identification

Reply via email to