Re: Garbage with languageidentifier

Ken Krugler Sun, 17 Jul 2011 14:56:49 -0700

Hi Markus,

> The proposal is to configure the order of detection: meta,header,identifier 
> (which is the current order).


This issue of precedence also comes up when detecting charset information. From 
an earlier post I'd made to the Nutch list:

> See https://issues.apache.org/jira/browse/TIKA-539 for a Tika issue I'm 
> currently working on, which has to do with the charset detection algorithm.
> 
> There's the HTML5 proposal, where the priority is
> 
> - charset from Content-Type response header
> - charset from HTML <meta http-equiv content-type> element
> - charset detected from page contents
> 
> Reinhard Schwab proposed a variation on the HTML5 approach, which makes sense 
> to me; in my web crawling experience, too many servers lie to just blindly 
> trust the response header contents.
> 
> I've got a slight modification to Reinhard's approach, as describe in a 
> comment on the above issue:
> 
> https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=12928832&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12928832
> 
> I'm interested in comments.

See http://tools.ietf.org/html/draft-abarth-mime-sniff-03 for a writeup on how 
to extract charset info, which seems relevant to how to detect language as well.

-- Ken

On Jul 17, 2011, at 6:04am, Markus Jelsma wrote:

> 
>> Hi,
>> 
>> I've found a lot of garbage produced by the language identifier, most
>> likely caused by it relying on HTTP-header as the first hint for the
>> language.
>> 
>> Instead of a nice tight list of ISO-codes i've got an index full of garbage
>> making me unable to select a language. The lang field now contains a mess
>> including ISO-codes of various types (nl | ned, nl-NL | nederlands |
>> Nederlands | dutch | Dutch etc etc) and even comma-separated combinations.
>> It's impossible to do a simple fq:lang:nl due to this undeterminable set of
>> language identifiers. Apart from language identifiers that we as human
>> understand the headers also contains values such as {$plugin.meta.language}
>> | Weerribben zuivel | Array or complete sentences and even MIME-types and
>> more nonsens you can laugh about.
>> 
>> Why do we rely on HTTP-header at all? Isn't it well-known that only very
>> few developers and content management systems actually care about
>> returning proper information in HTTP headers?  This actually also goes for
>> finding out content- type, which is a similar problem in the index.
>> 
>> I know work is going on in Tika for improving MIME-type detection i'm not
>> sure if this is true for language identification. We still have to rely on
>> the Nutch plugin to do this work, right? If so, i propose to make it
>> configurable so we can choose if we wan't to rely on the current behaviour
>> or do N-gram detection straight-away.
>> 
>> Comments?
>> 
>> Thanks

--------------------------------------------
http://about.me/kkrugler
+1 530-210-6378

Re: Garbage with languageidentifier

Reply via email to