Dear list,

I am running Tika server 1.14 on a Debian jessie. I start the server with this 
command:

java -jar tika-server-1.14-SNAPSHOT.jar

If I send a file for metadata extraction like this

curl -T email.txt http://localhost:9998/meta

The response screws up any umlauts. 

The environment variables for the shell from which I start the server as well 
as execute the curl command are as follows:

LANG=en_US.UTF-8
LANGUAGE=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

I followed this page 
(https://perlgeek.de/en/article/set-up-a-clean-utf8-environment 
<https://perlgeek.de/en/article/set-up-a-clean-utf8-environment>) to set up a 
clean unicode environment. The test case mentioned on that page works fine.

I also tried to use tika-app, since I saw in --help that I can pass the 
--encoding parameter. So I ran:

(1) java -jar tika-app-1.14-SNAPSHOT.jar --encoding=unicode -m email.txt

and

(2) java -jar tika-app-1.14-SNAPSHOT.jar —encoding=UTF-8 -m email.txt

The output of umlauts does change, but in neither case is it right. For (1) the 
umlauts are represented by ‘??’; for (2) they are represented by 'ü’ (that is 
a capital A with a ~ on top, followed by the quarter sign 1/4).

How can I fix this problem? Ultimately, I want to run queries to Tika from a 
python script (with Chris Mattmann’s module). If this behaviour can be 
controlled from within python, that would be fine for me. But since I got the 
problem also using curl and tika-app, I thought that the problem is more likely 
to be found in tika itself.

I’d be very grateful for any assistance!
Best,
Philipp



Reply via email to