[ https://issues.apache.org/jira/browse/TIKA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting updated TIKA-324: ------------------------------- Affects Version/s: 0.5 Fix Version/s: (was: 0.5) Summary: Tika CLI mangles utf-8 content in text (-t) mode (on Mac OS X) (was: Tika CLI mangles utf-8 content in text (-t) mode) The problem here is that the default --text output uses the default encoding, which AFAIK for Java on Mac OS X is MacRoman even though OS X otherwise uses UTF-8. This is why all non-MacRoman characters get turned to question marks when printed to the console. The proposed patch changes the output encoding to UTF-8 (the default for TransformerHandler) for all platforms, which can cause problems on platforms with different default encodings. The reason why --text behaves differently from --xml and --html is that the latter are considered to output binary data that either contains explicit encoding information (<?xml version="1.0" encoding="UTF-8"?> for --xml) or works around the encoding issue in other ways (Iñtërnâtiônàlizætiøn for --html). The --text output is expected to be consumable by default text processing tools of the platform (grep "Iñtërnâtiônàlizætiøn"), so it needs to use the correct character encoding. To avoid breaking things on other platforms, I suggest that we only override the default encoding on Mac OS X, like this: Index: tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java =================================================================== --- tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java (revision 880772) +++ tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java (working copy) @@ -228,6 +228,10 @@ throws UnsupportedEncodingException { if (encoding != null) { return new OutputStreamWriter(System.out, encoding); + } else if (System.getProperty("os.name") + .toLowerCase().startsWith("mac os x")) { + // TIKA-324: Override the default encoding on Mac OS X + return new OutputStreamWriter(System.out, "UTF-8"); } else { return new OutputStreamWriter(System.out); } > Tika CLI mangles utf-8 content in text (-t) mode (on Mac OS X) > -------------------------------------------------------------- > > Key: TIKA-324 > URL: https://issues.apache.org/jira/browse/TIKA-324 > Project: Tika > Issue Type: Bug > Components: cli > Affects Versions: 0.3, 0.4, 0.5 > Environment: Mac OS 10.5, java version "1.6.0_15" > Reporter: Peter Wolanin > Priority: Critical > Attachments: test.txt, TIKA-324-0.5.patch, TIKA-324.patch, > TIKA-324.patch > > Original Estimate: 2h > Remaining Estimate: 2h > > When using the -t flag to tika, multi-byte content is destroyed in the output. > Example: > $ java -jar tika-app-0.4.jar -t ./test.txt > I?t?rn?ti?n?liz?ti?n > $ java -jar tika-app-0.4.jar -x ./test.txt > <?xml version="1.0" encoding="UTF-8"?> > <html xmlns="http://www.w3.org/1999/xhtml"> > <head> > <title/> > </head> > <body> > <p>Iñtërnâtiônàlizætiøn > </p> > </body> > </html> > see also: http://drupal.org/node/622508#comment-2267918 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.