TXTParser use of CharsetDetector has several bugs
-------------------------------------------------

                 Key: TIKA-335
                 URL: https://issues.apache.org/jira/browse/TIKA-335
             Project: Tika
          Issue Type: Bug
    Affects Versions: 0.5
            Reporter: Ken Krugler


In looking at how TXTParser uses CharsetDetector, I see the following issues:

1. The incoming charset (if any) from metadata should be passed to 
CharsetDetector.setDeclaredEncoding().
2. The first supported charset should be used, not the last. These are returned 
in confidence order, from best to worst.
3. The current code might also wind up setting a language from one result, and 
the charset from another.

So the biggest change is to bail out of the loop once a supported charset has 
been found. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to