Hi,

Playing around with the CLI tool. I tried to detect the language of the test 
files, e.g. the en.test located at 
tika/tika-core/src/test/resources/org/apache/tika/language/en.test

lap:tika janhoy$ java -jar tika-app/target/tika-app-0.8-SNAPSHOT.jar -m 
tika-core/src/test/resources/org/apache/tika/language/en.test 
Content-Encoding: UTF-8
Content-Length: 22427
Content-Type: text/plain
resourceName: en.test

As you can see, no language was detected.

Now I make a copy and converts it from UTF-8 to ISO-8859-1 and try again:
lap:tika janhoy$ iconv -f UTF-8 -t ISO-8859-1 
<tika-core/src/test/resources/org/apache/tika/language/en.test >en-iso.txt
lap:tika janhoy$ java -jar tika-app/target/tika-app-0.8-SNAPSHOT.jar -m 
en-iso.txt Content-Encoding: ISO-8859-1
Content-Language: en
Content-Length: 22417
Content-Type: text/plain
language: en
resourceName: en-iso.txt

Detected as english. The same is true for the other test language files. It 
does not detect language for UTF-8 encoded files.

Does anyone see what's wrong?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

Reply via email to