RE: Language identification

Ralf R. Kotowski Fri, 08 Nov 2013 07:42:46 -0800

We are talking about this plug-in, correct?


http://wiki.apache.org/nutch/LanguageIdentifierPlugin



-----Original Message-----
From: ilhami Kalkan [mailto:[email protected]] 
Sent: Thursday, November 07, 2013 10:29 AM
To: [email protected]
Subject: Re: Language identification

Hi Rulf,

Short answer is no.
This plugin run after language-idendifier plugin. Because, 
languge-identifier plugin marks metadata language and this plugin get 
this value to filter or accept language while parse phase. 
language-identifier plugin gets lang value from header or decide lang 
value with page content's n-gram.
language-filter plugin get "language.filter.languages" entries which 
must be ISO-639 language codes and match them with metadata lang. Page 
languages like en-us were rejected. Thanks for heads-up. I added 
necessary control in patch to prevent this case.


On 06-11-2013 23:52, Ralf R. Kotowski wrote:
> Hi,
>
> I have run several passes, I no Langer get the bulk of foreign language
> sites I used to, but some others which are supossed to I don't get either.
>
> Does this plug-in work trough the HTML header? Because I got one of the
ones
> that are not supossed to be there with this header:
>
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd";>
> <html xmlns="http://www.w3.org/1999/xhtml"; lang="en-us">
>
> -----Original Message-----
> From: ilhami Kalkan [mailto:[email protected]]
> Sent: Wednesday, November 06, 2013 9:08 AM
> To: [email protected]
> Subject: Re: Language identification
>
> Hi Ralf,
>
> language-identifier-agmlab is my test plugin name. I fixed the patch.
>
> NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>
>
> On 06-11-2013 00:50, Ralf R. Kotowski wrote:
>> I get following error in the logs:
>>
>> WARN  plugin.PluginRepository - Missing dependency
>> language-identifier-agmlab for plugin language-filter
>>
>> -----Original Message-----
>> From: ilhami Kalkan [mailto:[email protected]]
>> Sent: Tuesday, November 05, 2013 10:36 AM
>> To: [email protected]
>> Subject: Re: Language identification
>>
>> Hi Ralf,
>>
>> I patched language-filter plugin for filter or accept pages which
>> specified languages while parse phase.
>>
>> NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>
>>
>>
>> On 02-11-2013 22:05, Julien Nioche wrote:
>>> Ralf,
>>>
>>> The parameter http.accept.language tells the servers you are hitting
that
>>> they should provide you the content in the languages you specified but
>> that
>>> does not give you any guarantees nor allows you to filter the content.
>> Look
>>> at the languageidentifier plugin as a starting point, then you could add
> a
>>> custom mapreduce job to remove the pages which are not in the languages
> of
>>> interest.
>>>
>>> HTH
>>>
>>> Julien
>>>
>>>
>>>
>>> On 2 November 2013 17:15, Ralf R. Kotowski <[email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> What is the correct process to only store documents in a desired
>> language?
>>>>
>>>> I'm currently doing this:
>>>>
>>>>
>>>>
>>>> <property>
>>>> <name>http.accept.language</name>
>>>> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
>>>> <description>Value of the "Accept-Language" request header field.
>>>> This allows selecting non-English language as default one to retrieve.
>>>> It is a useful setting for search engines build for certain national
>> group.
>>>> </description>
>>>> </property>
>>>>
>>>>
>>>>
>>>> Using a seed.txt with URL's I know are in the language I want, but as
> the
>>>> crawl grows it seems I'm starting to get more and more docs in other
>>>> languages.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Thnx in advance
>>>>
>>>>
>

RE: Language identification

Reply via email to