[ https://issues.apache.org/jira/browse/TIKA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778133#action_12778133 ]
Peter Wolanin commented on TIKA-324: ------------------------------------ Examining the TikaCLI.java code, the xhtml versus text output is handled very differently. I'm not sure why the text one fails, but it seems to be easily rectified by applying the trasformer using "text" as the method. > Tika CLI mangles utf-8 content in text (-t) mode > ------------------------------------------------ > > Key: TIKA-324 > URL: https://issues.apache.org/jira/browse/TIKA-324 > Project: Tika > Issue Type: Bug > Components: cli > Affects Versions: 0.3, 0.4 > Environment: Mac OS 10.5, java version "1.6.0_15" > Reporter: Peter Wolanin > Priority: Critical > Fix For: 0.5 > > Attachments: test.txt > > Original Estimate: 2h > Remaining Estimate: 2h > > When using the -t flag to tika, multi-byte content is destroyed in the output. > Example: > {code} > $ java -jar tika-app-0.4.jar -t ./test.txt > I?t?rn?ti?n?liz?ti?n > $ java -jar tika-app-0.4.jar -x ./test.txt > <?xml version="1.0" encoding="UTF-8"?> > <html xmlns="http://www.w3.org/1999/xhtml"> > <head> > <title/> > </head> > <body> > <p>Iñtërnâtiônàlizætiøn > </p> > </body> > </html> > {code} > see also: http://drupal.org/node/622508#comment-2267918 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.