[ https://issues.apache.org/jira/browse/TIKA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778144#action_12778144 ]
Peter Wolanin commented on TIKA-324: ------------------------------------ The code in the TikaCLI.java seems to have changed in trunk - not clear if the bug is still present. TikaGUI.java has something very similar to the code as altered in this patch, yet it correctly renders the test string in 0.4. The output seems to go via a StringWriter rather than directly to System.out, which my make the difference? > Tika CLI mangles utf-8 content in text (-t) mode > ------------------------------------------------ > > Key: TIKA-324 > URL: https://issues.apache.org/jira/browse/TIKA-324 > Project: Tika > Issue Type: Bug > Components: cli > Affects Versions: 0.3, 0.4 > Environment: Mac OS 10.5, java version "1.6.0_15" > Reporter: Peter Wolanin > Priority: Critical > Fix For: 0.5 > > Attachments: test.txt, TIKA-324.patch > > Original Estimate: 2h > Remaining Estimate: 2h > > When using the -t flag to tika, multi-byte content is destroyed in the output. > Example: > $ java -jar tika-app-0.4.jar -t ./test.txt > I?t?rn?ti?n?liz?ti?n > $ java -jar tika-app-0.4.jar -x ./test.txt > <?xml version="1.0" encoding="UTF-8"?> > <html xmlns="http://www.w3.org/1999/xhtml"> > <head> > <title/> > </head> > <body> > <p>Iñtërnâtiônàlizætiøn > </p> > </body> > </html> > see also: http://drupal.org/node/622508#comment-2267918 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.