Hi,
I've found a lot of garbage produced by the language identifier, most likely
caused by it relying on HTTP-header as the first hint for the language.
Instead of a nice tight list of ISO-codes i've got an index full of garbage
making me unable to select a language. The lang field now contains a mess
including ISO-codes of various types (nl | ned, nl-NL | nederlands |
Nederlands | dutch | Dutch etc etc) and even comma-separated combinations.
It's impossible to do a simple fq:lang:nl due to this undeterminable set of
language identifiers. Apart from language identifiers that we as human
understand the headers also contains values such as {$plugin.meta.language} |
Weerribben zuivel | Array or complete sentences and even MIME-types and more
nonsens you can laugh about.
Why do we rely on HTTP-header at all? Isn't it well-known that only very few
developers and content management systems actually care about returning proper
information in HTTP headers? This actually also goes for finding out content-
type, which is a similar problem in the index.
I know work is going on in Tika for improving MIME-type detection i'm not sure
if this is true for language identification. We still have to rely on the Nutch
plugin to do this work, right? If so, i propose to make it configurable so we
can choose if we wan't to rely on the current behaviour or do N-gram detection
straight-away.
Comments?
Thanks