> On 16 May 2016, at 13:04 , Allison, Timothy B. <[email protected]> wrote: > > >>I also tried to use tika-app, since I saw in --help that I can pass the > >>--encoding parameter. So I ran: > > To clarify (you may already understand this, sorry)…the encoding parameter > specifies the output encoding; it is not a hint to Tika in encoding detection.
Hi Tim, yes, I do understand this. I guess the issues have become a little conflated. But judging from the response to the bug report, the issues have been taken apart and are dealt with separately, so I guess there is nothing for me to do at the moment. If you need further testing, let me know… Philipp > > With trunk and 1.12 in Tika app’s gui, I’m getting proper extraction with > “Testemail-empty-doesnotwork.eml”, but the umlauts are corrupt with > “Test-email-empty-works.txt”. I get the same behavior when I redirect the > output to a file: > > java –jar tika-app-1.12.jar Testemail-empty-doesnotwork.eml > testOut2.txt > > > > Bizarrely, it looks like both files are being parsed by the RFC822Parser, and > when I run the “detect” commandline option –d, on both files with 1.12 and > trunk, both say RFC822. > > > > > > > From: Philipp Steinkrüger [mailto:[email protected] > <mailto:[email protected]>] > Sent: Sunday, May 15, 2016 10:12 AM > To: [email protected] <mailto:[email protected]> > Subject: Tika response encoding problem > > Dear list, > > I am running Tika server 1.14 on a Debian jessie. I start the server with > this command: > > java -jar tika-server-1.14-SNAPSHOT.jar > > If I send a file for metadata extraction like this > > curl -T email.txt http://localhost:9998/meta <http://localhost:9998/meta> > > The response screws up any umlauts. > > The environment variables for the shell from which I start the server as well > as execute the curl command are as follows: > > LANG=en_US.UTF-8 > LANGUAGE=en_US.UTF-8 > LC_CTYPE="en_US.UTF-8" > LC_NUMERIC="en_US.UTF-8" > LC_TIME="en_US.UTF-8" > LC_COLLATE="en_US.UTF-8" > LC_MONETARY="en_US.UTF-8" > LC_MESSAGES="en_US.UTF-8" > LC_PAPER="en_US.UTF-8" > LC_NAME="en_US.UTF-8" > LC_ADDRESS="en_US.UTF-8" > LC_TELEPHONE="en_US.UTF-8" > LC_MEASUREMENT="en_US.UTF-8" > LC_IDENTIFICATION="en_US.UTF-8" > LC_ALL=en_US.UTF-8 > > I followed this page > (https://perlgeek.de/en/article/set-up-a-clean-utf8-environment > <https://perlgeek.de/en/article/set-up-a-clean-utf8-environment>) to set up a > clean unicode environment. The test case mentioned on that page works fine. > > I also tried to use tika-app, since I saw in --help that I can pass the > --encoding parameter. So I ran: > I also tried to use tika-app, since I saw in --help that I can pass the > --encoding parameter. So I ran: > > (1) java -jar tika-app-1.14-SNAPSHOT.jar --encoding=unicode -m email.txt > > and > > (2) java -jar tika-app-1.14-SNAPSHOT.jar —encoding=UTF-8 -m email.txt > > The output of umlauts does change, but in neither case is it right. For (1) > the umlauts are represented by ‘??’; for (2) they are represented by 'ü’ > (that is a capital A with a ~ on top, followed by the quarter sign 1/4). > > How can I fix this problem? Ultimately, I want to run queries to Tika from a > python script (with Chris Mattmann’s module). If this behaviour can be > controlled from within python, that would be fine for me. But since I got the > problem also using curl and tika-app, I thought that the problem is more > likely to be found in tika itself. > > I’d be very grateful for any assistance! > Best, > Philipp > > >
