RE: Language identification

Ralf R. Kotowski Tue, 05 Nov 2013 14:51:50 -0800

I get following error in the logs:

WARN  plugin.PluginRepository - Missing dependency
language-identifier-agmlab for plugin language-filter


-----Original Message-----
From: ilhami Kalkan [mailto:[email protected]] 
Sent: Tuesday, November 05, 2013 10:36 AM
To: [email protected]
Subject: Re: Language identification

Hi Ralf,

I patched language-filter plugin for filter or accept pages which 
specified languages while parse phase.

NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>


On 02-11-2013 22:05, Julien Nioche wrote:
> Ralf,
>
> The parameter http.accept.language tells the servers you are hitting that
> they should provide you the content in the languages you specified but
that
> does not give you any guarantees nor allows you to filter the content.
Look
> at the languageidentifier plugin as a starting point, then you could add a
> custom mapreduce job to remove the pages which are not in the languages of
> interest.
>
> HTH
>
> Julien
>
>
>
> On 2 November 2013 17:15, Ralf R. Kotowski <[email protected]> wrote:
>
>> Hi,
>>
>>
>>
>> What is the correct process to only store documents in a desired
language?
>>
>>
>>
>> I'm currently doing this:
>>
>>
>>
>> <property>
>> <name>http.accept.language</name>
>> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
>> <description>Value of the "Accept-Language" request header field.
>> This allows selecting non-English language as default one to retrieve.
>> It is a useful setting for search engines build for certain national
group.
>> </description>
>> </property>
>>
>>
>>
>> Using a seed.txt with URL's I know are in the language I want, but as the
>> crawl grows it seems I'm starting to get more and more docs in other
>> languages.
>>
>>
>>
>>
>>
>> Thnx in advance
>>
>>
>

RE: Language identification

Reply via email to