[ 
https://issues.apache.org/jira/browse/TIKA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting updated TIKA-324:
-------------------------------

    Affects Version/s: 0.5
        Fix Version/s:     (was: 0.5)
              Summary: Tika CLI mangles utf-8 content in text (-t) mode (on Mac 
OS X)  (was: Tika CLI mangles utf-8 content in text (-t) mode)

The problem here is that the default --text output uses the default encoding, 
which AFAIK for Java on Mac OS X is 
MacRoman even though OS X otherwise uses UTF-8. This is why all non-MacRoman 
characters get turned to question marks when printed to the console.

The proposed patch changes the output encoding to UTF-8 (the default for 
TransformerHandler) for all platforms, which can cause problems on platforms 
with different default encodings.

The reason why --text behaves differently from --xml and --html is that the 
latter are considered to output binary data that either contains explicit 
encoding information (<?xml version="1.0" encoding="UTF-8"?> for --xml) or 
works around the encoding issue in other ways 
(I&ntilde;t&euml;rn&acirc;ti&ocirc;n&agrave;liz&aelig;ti&oslash;n for --html). 
The --text output is expected to be consumable by default text processing tools 
of the platform (grep  "Iñtërnâtiônàlizætiøn"), so it needs to use the correct 
character encoding.

To avoid breaking things on other platforms, I suggest that we only override 
the default encoding on Mac OS X, like this:

Index: tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java
===================================================================
--- tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java     (revision 
880772)
+++ tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java     (working copy)
@@ -228,6 +228,10 @@
             throws UnsupportedEncodingException {
         if (encoding != null) {
             return new OutputStreamWriter(System.out, encoding);
+        } else if (System.getProperty("os.name")
+                .toLowerCase().startsWith("mac os x")) {
+            // TIKA-324: Override the default encoding on Mac OS X
+            return new OutputStreamWriter(System.out, "UTF-8");
         } else {
             return new OutputStreamWriter(System.out);
         }


> Tika CLI mangles utf-8 content in text (-t) mode (on Mac OS X)
> --------------------------------------------------------------
>
>                 Key: TIKA-324
>                 URL: https://issues.apache.org/jira/browse/TIKA-324
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.3, 0.4, 0.5
>         Environment: Mac OS 10.5, java version "1.6.0_15"
>            Reporter: Peter Wolanin
>            Priority: Critical
>         Attachments: test.txt, TIKA-324-0.5.patch, TIKA-324.patch, 
> TIKA-324.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When using the -t flag to tika, multi-byte content is destroyed in the output.
> Example:
> $ java -jar tika-app-0.4.jar -t ./test.txt
> I?t?rn?ti?n?liz?ti?n
> $ java -jar tika-app-0.4.jar -x ./test.txt
> <?xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml";>
> <head>
> <title/>
> </head>
> <body>
> <p>Iñtërnâtiônàlizætiøn
> </p>
> </body>
> </html>
> see also:  http://drupal.org/node/622508#comment-2267918

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to