John Cowan wrote:
> > The result is a back-to-the-principles "WCode", nicely streamlined:
> > - no compatibility or precomposed characters
> 
> But less compact.  Without precomposed characters, the overhead of
> conversion from old character sets grows considerably.

True. Compactness was not a goal with "WCode", just ease of processing.
Anyway, this is no different from the current Unicode policy for new characters.
Also, I do not think that average strings would become that much longer. Maybe 10-20% 
for Western Europe.

> >   + 8-bit form encodes the BMP uniquely and C1 controls as single bytes
> 
> WTF-8 uses 2 bytes for ASCII, C0 Latin

No. It uses single bytes for 00..9f, which includes US-ASCII, C0 & C1 controls.

> , Latin A and B, Modifiers, Marks,
> IPA, and part of Greek; the rest of Greek (the alphabet is actually
> split down the middle), Cyrillic, Armenian, Hebrew, Arabic, Syriac, Thaana
> are now 3 bytes.
> 
> So typical Greek or Cyrillic text now requires 3 bytes per letter in WTF-8,
> 2 bytes in WTF-16.  WTF-8 is only really useful for ASCII-compatibility.

Right. You get something - C1 controls as single bytes, as many Unix fans would like 
to see - and you lose something - compactness of encoding.

markus

Reply via email to