Re: [VOTE] Apache Tika 0.4 Release Candidate 2

Grant Ingersoll Wed, 15 Jul 2009 11:36:34 -0700

OK, I change my vote to +1.  I'll update Solr as needed.


On Jul 15, 2009, at 9:30 AM, Jukka Zitting wrote:

Hi,

On Wed, Jul 15, 2009 at 3:00 PM, GrantIngersoll<gsing...@apache.org> wrote:

3. Did something change such that CONTENT_LANGUAGE is now not beingset forHTML? We have a test in Solr that looks for that attribute, and itwas
passing with 0.3 but is now not passing in 0.4.


This is because of TIKA-208.

We used to use the ICU4J charset detection mechanism to automatically
detect the encoding of HTML files. ICU4J would also guess the content
language based on the detected encoding (e.g. a document encoded in
KOI8-R is most likely written in Russian).

However, this mechanism wasn't as accurate as the encoding detection
already present in NekoHtml and language detection based on just the
encoding is often incorrect.

See TIKA-209 for some ideas on how to make the language detection more
generic and accurate. For now I think it's better to ship Tika 0.4
without the earlier flawed CONTENT_LANGUAGE implementation for HTML.

BR,

Jukka Zitting

Re: [VOTE] Apache Tika 0.4 Release Candidate 2

Reply via email to