Re: [VOTE] Apache Tika 0.4 Release Candidate 2

Jukka Zitting Wed, 15 Jul 2009 06:33:50 -0700

Hi,

[Oops, my reply went first just to gene...@lucene. Here's a copy to tika-...@.]

On Wed, Jul 15, 2009 at 3:00 PM, Grant Ingersoll<gsing...@apache.org> wrote:
> 3. Did something change such that CONTENT_LANGUAGE is now not being set for
> HTML?  We have a test in Solr that looks for that attribute, and it was
> passing with 0.3 but is now not passing in 0.4.

This is because of TIKA-208.

We used to use the ICU4J charset detection mechanism to automatically
detect the encoding of HTML files. ICU4J would also guess the content
language based on the detected encoding (e.g. a document encoded in
KOI8-R is most likely written in Russian).

However, this mechanism wasn't as accurate as the encoding detection
already present in NekoHtml and language detection based on just the
encoding is often incorrect.

See TIKA-209 for some ideas on how to make the language detection more
generic and accurate. For now I think it's better to ship Tika 0.4
without the earlier flawed CONTENT_LANGUAGE implementation for HTML.

BR,

Jukka Zitting

Re: [VOTE] Apache Tika 0.4 Release Candidate 2

Reply via email to