Garbage with languageidentifier

Markus Jelsma Sun, 17 Jul 2011 06:00:57 -0700

Hi,

I've found a lot of garbage produced by the language identifier, most likely 
caused by it relying on HTTP-header as the first hint for the language.


Instead of a nice tight list of ISO-codes i've got an index full of garbage 
making me unable to select a language. The lang field now contains a mess 
including ISO-codes of various types (nl | ned, nl-NL | nederlands | 
Nederlands | dutch | Dutch etc etc) and even comma-separated combinations. 
It's impossible to do a simple fq:lang:nl due to this undeterminable set of 
language identifiers. Apart from language identifiers that we as human 
understand the headers also contains values such as {$plugin.meta.language} | 
Weerribben zuivel | Array or complete sentences and even MIME-types and more 
nonsens you can laugh about.

Why do we rely on HTTP-header at all? Isn't it well-known that only very few 
developers and content management systems actually care about returning proper 
information in HTTP headers?  This actually also goes for finding out content-
type, which is a similar problem in the index.

I know work is going on in Tika for improving MIME-type detection i'm not sure 
if this is true for language identification. We still have to rely on the Nutch 
plugin to do this work, right? If so, i propose to make it configurable so we 
can choose if we wan't to rely on the current behaviour or do N-gram detection 
straight-away.

Comments?

Thanks

Garbage with languageidentifier

Reply via email to