Hi, Interesting graph from Google about the relative usage of different character encodings:
http://googleblog.blogspot.com/2010/01/unicode-nearing-50-of-web.html It's interesting to see that the Unicode entry only lists the UTF-8 encoding. Are the other Unicode encodings so infrequent? I think we can use this data as a guideline when optimizing the encoding detection code in Tika. BR, Jukka Zitting