Re: UTF-8 and UTF-16 issues

Edward Cherlin Sun, 25 Jun 2000 18:17:00 -0700
At 2:48 PM -0800 6/19/00, Markus Scherer wrote:
>"OLeary, Sean (NJ)" wrote:
> > UTF-16 is the 16-bit encoding of Unicode that includes the use of
> > surrogates. This is essentially a fixed width encoding.
>
>certainly not. utf-16, of course, is variable-width: 1 or 2 16-bit 
>units per character. certainly the iuc discussion did not spread 
>this under "utf-16" but possibly as "ucs-2".
[snip]

The essential distinction that Sean refers to is not that all 
characters are encoded in the same length, but that all coding 
elements are of the same length. This is in contrast not with ISO 
10646, but with "double-byte" encodings of CJK text, where escape 
sequences are used to switch between runs of 8-bit and 16-bit codes.

The point of the distinction is that in double-byte encodings the 
only way to tell the length of the current character is by parsing 
from the beginning of the file. In Unicode, the current 16-bit value 
is explicitly a 16-bit character code (assigned, unassigned, or 
Private Use), an upper surrogate code, a lower surrogate code, or not 
a character code, without reference to what has gone before in the 
file.


Edward Cherlin
Generalist
"A knot!" exclaimed Alice. "Oh, do let me help to undo it."
Alice in Wonderland
Re: UTF-8 and UTF-16 issues

Reply via email to