Hi, On Sat, Aug 21, 2010 at 5:55 PM, Jan Høydahl / Cominvent <[email protected]> wrote: > Detected as english. The same is true for the other test language files. > It does not detect language for UTF-8 encoded files.
The tika-app jar doesn't do language detection by default. The language metadata you're seeing is a result of the encoding-based language estimate that we get from the ICU4J code we're using. Apparently that data set categorizes ISO-8859-1 as an English-specific character encoding. We already dropped encoding-based language estimates from the HTML parser, and I think we should do the same also for plain text documents. BR, Jukka Zitting
