Re: Fails to detect language for UTF-8 file, but it works for ISO-latin

Jukka Zitting Tue, 24 Aug 2010 08:01:21 -0700

Hi,

On Sat, Aug 21, 2010 at 5:55 PM, Jan Høydahl / Cominvent
<[email protected]> wrote:
> Detected as english. The same is true for the other test language files.
> It does not detect language for UTF-8 encoded files.


The tika-app jar doesn't do language detection by default. The
language metadata you're seeing is a result of the encoding-based
language estimate that we get from the ICU4J code we're using.
Apparently that data set categorizes ISO-8859-1 as an English-specific
character encoding.

We already dropped encoding-based language estimates from the HTML
parser, and I think we should do the same also for plain text
documents.

BR,

Jukka Zitting

Re: Fails to detect language for UTF-8 file, but it works for ISO-latin

Reply via email to