Hi,

Interesting graph from Google about the relative usage of different
character encodings:

    http://googleblog.blogspot.com/2010/01/unicode-nearing-50-of-web.html

It's interesting to see that the Unicode entry only lists the UTF-8
encoding. Are the other Unicode encodings so infrequent?

I think we can use this data as a guideline when optimizing the
encoding detection code in Tika.

BR,

Jukka Zitting

Reply via email to