[ https://issues.apache.org/jira/browse/TIKA-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778137#action_12778137 ]
Luke Nezda commented on TIKA-322: --------------------------------- http://code.google.com/p/juniversalchardet/ has a pretty good, efficient charset decoder which is a Java port of the Mozilla universalchardet algorithms. It is licensed under Mozilla Public License Version 1.1. I am not sure if MPL is ASF compatible; it appears to be, but ianal. afaik, it does not provide detection confidence or language detection features ICU4J does and I think it has code/data files for less encodings, but it is primarily statistical so they could be added. I am also not sure what choices were made with regard to multiple encodings. In theory, it should detect what Firefox detects for a given URL/file. > Improve encoding detection speed and accuracy > --------------------------------------------- > > Key: TIKA-322 > URL: https://issues.apache.org/jira/browse/TIKA-322 > Project: Tika > Issue Type: Improvement > Components: mime > Reporter: Jukka Zitting > Priority: Minor > > The encoding detection code we took from ICU4J is not very efficient and > sometimes produces odd results when more than one encoding matches the given > input data. It would be good to refactor the code to be faster for > easy-to-detect encodings and to have better heuristics in case multiple > matches are found. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.