TXTParser use of CharsetDetector has several bugs -------------------------------------------------
Key: TIKA-335 URL: https://issues.apache.org/jira/browse/TIKA-335 Project: Tika Issue Type: Bug Affects Versions: 0.5 Reporter: Ken Krugler In looking at how TXTParser uses CharsetDetector, I see the following issues: 1. The incoming charset (if any) from metadata should be passed to CharsetDetector.setDeclaredEncoding(). 2. The first supported charset should be used, not the last. These are returned in confidence order, from best to worst. 3. The current code might also wind up setting a language from one result, and the charset from another. So the biggest change is to bail out of the loop once a supported charset has been found. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.