[ 
https://issues.apache.org/jira/browse/TIKA-290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-290.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.5
         Assignee: Jukka Zitting

The NPE issue was already fixed in current trunk, but the test file still threw 
an IOException because of an unsupported character encoding. It looks like the 
ICU4J encoding detection code we use opts for some weird encodings as the "best 
match" when the input matches multiple different encodings.

I solved the immediate problem in revision 820956 by only accepting encodings 
that are actually supported by the Java runtime.

The solution is still not ideal as Tika now reports the test file as using the 
ISO-8859-2 encoding. I guess we need to come up with some better detection 
heuristics for cases like this. I'll follow up on tika-dev@, for now I'm 
resolving this issue as fixed.

> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.txt.txtpar...@6caf16
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-290
>                 URL: https://issues.apache.org/jira/browse/TIKA-290
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4
>         Environment: Windows XP / jdk1.6.0_15
>            Reporter: MRIT64
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.5
>
>         Attachments: test.txt
>
>
> It's just for information (I am testing Tika).
> I am using tika-app-0.4.jar from the box. 
> I get the run-time error below :
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.txt.txtpar...@6caf16
> with the ANSI text file containing :
> azerty 
> 123456789012345 6789012345678901 2345678901234567890123456789 
> 0123456789012345678901234567890123 456789012345678901234567890123456 
> 789012345678901234567890123456789012345678901 2345678901234567890123456789 
> 012345678901234567890123456 7890123456789012345 
> 678901234567890123456789012345 6789012345678901234567890
> 1234567890123456789012 345678901234567890123456789012345 
> 6789012345678901234567890123456789012345678901234 
> 567890123456789012345678901234567890123456789012345678901234 
> 56789012345678901234567890123456789012345678901234567890123456789012345 
> 78901234567890123456789012345678901234 56789012345678901234567890TOOLONGTOKEN
> qwerty.
> It works well if this file is saved in UTF-8 or if I delete some lines in the 
> ANSI file. I don't know why.
> Best regards

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to