Hi Markus, > The proposal is to configure the order of detection: meta,header,identifier > (which is the current order).
This issue of precedence also comes up when detecting charset information. From an earlier post I'd made to the Nutch list: > See https://issues.apache.org/jira/browse/TIKA-539 for a Tika issue I'm > currently working on, which has to do with the charset detection algorithm. > > There's the HTML5 proposal, where the priority is > > - charset from Content-Type response header > - charset from HTML <meta http-equiv content-type> element > - charset detected from page contents > > Reinhard Schwab proposed a variation on the HTML5 approach, which makes sense > to me; in my web crawling experience, too many servers lie to just blindly > trust the response header contents. > > I've got a slight modification to Reinhard's approach, as describe in a > comment on the above issue: > > https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=12928832&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12928832 > > I'm interested in comments. See http://tools.ietf.org/html/draft-abarth-mime-sniff-03 for a writeup on how to extract charset info, which seems relevant to how to detect language as well. -- Ken On Jul 17, 2011, at 6:04am, Markus Jelsma wrote: > >> Hi, >> >> I've found a lot of garbage produced by the language identifier, most >> likely caused by it relying on HTTP-header as the first hint for the >> language. >> >> Instead of a nice tight list of ISO-codes i've got an index full of garbage >> making me unable to select a language. The lang field now contains a mess >> including ISO-codes of various types (nl | ned, nl-NL | nederlands | >> Nederlands | dutch | Dutch etc etc) and even comma-separated combinations. >> It's impossible to do a simple fq:lang:nl due to this undeterminable set of >> language identifiers. Apart from language identifiers that we as human >> understand the headers also contains values such as {$plugin.meta.language} >> | Weerribben zuivel | Array or complete sentences and even MIME-types and >> more nonsens you can laugh about. >> >> Why do we rely on HTTP-header at all? Isn't it well-known that only very >> few developers and content management systems actually care about >> returning proper information in HTTP headers? This actually also goes for >> finding out content- type, which is a similar problem in the index. >> >> I know work is going on in Tika for improving MIME-type detection i'm not >> sure if this is true for language identification. We still have to rely on >> the Nutch plugin to do this work, right? If so, i propose to make it >> configurable so we can choose if we wan't to rely on the current behaviour >> or do N-gram detection straight-away. >> >> Comments? >> >> Thanks -------------------------------------------- http://about.me/kkrugler +1 530-210-6378

