Today every browser implements their own encoding label matching algorithm, 
supports their own list of encodings, their own list of encoding label aliases, 
and everything sort of works, but not really.

HTML5 solves part of this problem by defining exactly how to identify an 
encoding label alias in a text/html stream. It also defines which encoding 
label matching algorithm to use, UTS22, but we found out that this is 
incompatible with (existing) sites that specify EUC_JP at the HTTP level and 
actually want to be decoded per UTF-8 according to a <meta> in the text/html 
stream. This works fine if you have a strict encoding label matching algorithm, 
but with UTS22, EUC_JP and EUC-JP become the same thing, while only the latter 
is the actual encoding label.

Another problem HTML5 does not solve is giving a definitive list of encodings 
clients have to implement to be compatible with a large body of Web content. 
This means new clients will have to reverse engineer that list from existing 
clients which I think is bad.


-- 
Anne van Kesteren
http://annevankesteren.nl/

Reply via email to