OK, I change my vote to +1.  I'll update Solr as needed.

On Jul 15, 2009, at 9:30 AM, Jukka Zitting wrote:

Hi,

On Wed, Jul 15, 2009 at 3:00 PM, Grant Ingersoll<gsing...@apache.org> wrote:
3. Did something change such that CONTENT_LANGUAGE is now not being set for HTML? We have a test in Solr that looks for that attribute, and it was
passing with 0.3 but is now not passing in 0.4.

This is because of TIKA-208.

We used to use the ICU4J charset detection mechanism to automatically
detect the encoding of HTML files. ICU4J would also guess the content
language based on the detected encoding (e.g. a document encoded in
KOI8-R is most likely written in Russian).

However, this mechanism wasn't as accurate as the encoding detection
already present in NekoHtml and language detection based on just the
encoding is often incorrect.

See TIKA-209 for some ideas on how to make the language detection more
generic and accurate. For now I think it's better to ship Tika 0.4
without the earlier flawed CONTENT_LANGUAGE implementation for HTML.

BR,

Jukka Zitting


Reply via email to