Hi, [Oops, my reply went first just to gene...@lucene. Here's a copy to tika-...@.]
On Wed, Jul 15, 2009 at 3:00 PM, Grant Ingersoll<gsing...@apache.org> wrote: > 3. Did something change such that CONTENT_LANGUAGE is now not being set for > HTML? We have a test in Solr that looks for that attribute, and it was > passing with 0.3 but is now not passing in 0.4. This is because of TIKA-208. We used to use the ICU4J charset detection mechanism to automatically detect the encoding of HTML files. ICU4J would also guess the content language based on the detected encoding (e.g. a document encoded in KOI8-R is most likely written in Russian). However, this mechanism wasn't as accurate as the encoding detection already present in NekoHtml and language detection based on just the encoding is often incorrect. See TIKA-209 for some ideas on how to make the language detection more generic and accurate. For now I think it's better to ship Tika 0.4 without the earlier flawed CONTENT_LANGUAGE implementation for HTML. BR, Jukka Zitting