Thank you very much,

I'm testing it right now, so far when trying with only this URL:
http://www.todalaprensa.com/ as a seed, nutch only retrieves this page and
nothing else. When using a larger seed list it seems to work, I'm currently
on the 3rd pass, I'll let you know how it goes as it is still running.

-----Original Message-----
From: ilhami Kalkan [mailto:[email protected]] 
Sent: Wednesday, November 06, 2013 9:08 AM
To: [email protected]
Subject: Re: Language identification

Hi Ralf,

language-identifier-agmlab is my test plugin name. I fixed the patch.

NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>

On 06-11-2013 00:50, Ralf R. Kotowski wrote:
> I get following error in the logs:
>
> WARN  plugin.PluginRepository - Missing dependency
> language-identifier-agmlab for plugin language-filter
>
> -----Original Message-----
> From: ilhami Kalkan [mailto:[email protected]]
> Sent: Tuesday, November 05, 2013 10:36 AM
> To: [email protected]
> Subject: Re: Language identification
>
> Hi Ralf,
>
> I patched language-filter plugin for filter or accept pages which
> specified languages while parse phase.
>
> NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>
>
>
> On 02-11-2013 22:05, Julien Nioche wrote:
>> Ralf,
>>
>> The parameter http.accept.language tells the servers you are hitting that
>> they should provide you the content in the languages you specified but
> that
>> does not give you any guarantees nor allows you to filter the content.
> Look
>> at the languageidentifier plugin as a starting point, then you could add
a
>> custom mapreduce job to remove the pages which are not in the languages
of
>> interest.
>>
>> HTH
>>
>> Julien
>>
>>
>>
>> On 2 November 2013 17:15, Ralf R. Kotowski <[email protected]> wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>> What is the correct process to only store documents in a desired
> language?
>>>
>>>
>>> I'm currently doing this:
>>>
>>>
>>>
>>> <property>
>>> <name>http.accept.language</name>
>>> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
>>> <description>Value of the "Accept-Language" request header field.
>>> This allows selecting non-English language as default one to retrieve.
>>> It is a useful setting for search engines build for certain national
> group.
>>> </description>
>>> </property>
>>>
>>>
>>>
>>> Using a seed.txt with URL's I know are in the language I want, but as
the
>>> crawl grows it seems I'm starting to get more and more docs in other
>>> languages.
>>>
>>>
>>>
>>>
>>>
>>> Thnx in advance
>>>
>>>
>


Reply via email to