[ https://issues.apache.org/jira/browse/TIKA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Peter Wolanin updated TIKA-324: ------------------------------- Attachment: TIKA-324.patch Attached is a patch against Tika 0.4. It resolves the bug for me, at least for the simple test case. {code} $ java -jar tika-app-0.4.jar -t ./test.txt Iñtërnâtiônàlizætiøn $ java -jar tika-app-0.4.jar -x ./test.txt <?xml version="1.0" encoding="UTF-8"?> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title/> </head> <body> <p>Iñtërnâtiônàlizætiøn </p> </body> </html> {code} > Tika CLI mangles utf-8 content in text (-t) mode > ------------------------------------------------ > > Key: TIKA-324 > URL: https://issues.apache.org/jira/browse/TIKA-324 > Project: Tika > Issue Type: Bug > Components: cli > Affects Versions: 0.3, 0.4 > Environment: Mac OS 10.5, java version "1.6.0_15" > Reporter: Peter Wolanin > Priority: Critical > Fix For: 0.5 > > Attachments: test.txt, TIKA-324.patch > > Original Estimate: 2h > Remaining Estimate: 2h > > When using the -t flag to tika, multi-byte content is destroyed in the output. > Example: > {code} > $ java -jar tika-app-0.4.jar -t ./test.txt > I?t?rn?ti?n?liz?ti?n > $ java -jar tika-app-0.4.jar -x ./test.txt > <?xml version="1.0" encoding="UTF-8"?> > <html xmlns="http://www.w3.org/1999/xhtml"> > <head> > <title/> > </head> > <body> > <p>Iñtërnâtiônàlizætiøn > </p> > </body> > </html> > {code} > see also: http://drupal.org/node/622508#comment-2267918 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.