Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

Yung-Fong Tang Fri, 28 Feb 2003 10:59:48 -0800

Kenneth Whistler wrote:

Think of it this way. Does anyone expect the ASCII standard to tell,
in detail, what a process should or should not do if it receives
data which purports to be ASCII, but which contains an 0x80 byte
in it? All the ASCII standard can really do is tell you that
0x80 is not defined in ASCII, and a conformant process shall not
interpret 0x80 as an ASCII character. Beyond that, it is up to
the software engineers to figure out who goofed up in mislabelling
or corrupting the data, and what the process receiving the bad data
should do about it.

That is not a good comparision. ASCII is a single byte character code standard. And when I got a 0x80 in ASCII string, I know where is the boundary- the boundary is the whole 8-bits of that 0x80 is bad. The scope is not the first 3 bits nor 9 bits- but the 8 bits data. I cannot tell the rest of the data is good or bad, but I know ASCII is only 8-bits and 8 bits only.

Same thing for JIS x0208 (a TWO and only TWO bytes character set, not a variable length character set). If I am processing a ISO-2022-JP message and in the JIS x0208 mode and I got a 0x24 0xa8 I know the boundary of that problem is 16 bits, not 8 -bits nor 32 bits.

When you deal with encoding which need states (ISO-2022, ISO-2022-JP, etc) or variable length encoding (Shift_JIS, Big5, UTF-8), then the situration is different.

Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

Reply via email to