Re: Garbage with languageidentifier

Markus Jelsma Sun, 17 Jul 2011 06:06:52 -0700

The proposal is to configure the order of detection: meta,header,identifier 
(which is the current order).


> Hi,
> 
> I've found a lot of garbage produced by the language identifier, most
> likely caused by it relying on HTTP-header as the first hint for the
> language.
> 
> Instead of a nice tight list of ISO-codes i've got an index full of garbage
> making me unable to select a language. The lang field now contains a mess
> including ISO-codes of various types (nl | ned, nl-NL | nederlands |
> Nederlands | dutch | Dutch etc etc) and even comma-separated combinations.
> It's impossible to do a simple fq:lang:nl due to this undeterminable set of
> language identifiers. Apart from language identifiers that we as human
> understand the headers also contains values such as {$plugin.meta.language}
> | Weerribben zuivel | Array or complete sentences and even MIME-types and
> more nonsens you can laugh about.
> 
> Why do we rely on HTTP-header at all? Isn't it well-known that only very
> few developers and content management systems actually care about
> returning proper information in HTTP headers?  This actually also goes for
> finding out content- type, which is a similar problem in the index.
> 
> I know work is going on in Tika for improving MIME-type detection i'm not
> sure if this is true for language identification. We still have to rely on
> the Nutch plugin to do this work, right? If so, i propose to make it
> configurable so we can choose if we wan't to rely on the current behaviour
> or do N-gram detection straight-away.
> 
> Comments?
> 
> Thanks

Re: Garbage with languageidentifier

Reply via email to