[ 
https://issues.apache.org/jira/browse/TIKA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Wolanin updated TIKA-324:
-------------------------------

    Attachment: TIKA-324.patch


Attached is a patch against Tika 0.4.  It resolves the bug for me, at least for 
the simple test case.

{code}
$ java -jar tika-app-0.4.jar -t ./test.txt 
Iñtërnâtiônàlizætiøn

$ java -jar tika-app-0.4.jar -x ./test.txt 
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml";>
<head>
<title/>
</head>
<body>
<p>Iñtërnâtiônàlizætiøn
</p>
</body>
</html>
{code}



> Tika CLI mangles utf-8 content in text (-t) mode
> ------------------------------------------------
>
>                 Key: TIKA-324
>                 URL: https://issues.apache.org/jira/browse/TIKA-324
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.3, 0.4
>         Environment: Mac OS 10.5, java version "1.6.0_15"
>            Reporter: Peter Wolanin
>            Priority: Critical
>             Fix For: 0.5
>
>         Attachments: test.txt, TIKA-324.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When using the -t flag to tika, multi-byte content is destroyed in the output.
> Example:
> {code}
> $ java -jar tika-app-0.4.jar -t ./test.txt
> I?t?rn?ti?n?liz?ti?n
> $ java -jar tika-app-0.4.jar -x ./test.txt
> <?xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml";>
> <head>
> <title/>
> </head>
> <body>
> <p>Iñtërnâtiônàlizætiøn
> </p>
> </body>
> </html>
> {code}
> see also:  http://drupal.org/node/622508#comment-2267918

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to