Re: discovering code points with embedded nulls

Otto Stolz Wed, 05 Feb 2003 11:25:46 -0800

[EMAIL PROTECTED] wrote:

I'm dealing with an API that claims it doesn't support unicode characters with embedded nulls.

...

Test all constituent bytes for 0x00.

This depends on the encoding form you are using (and the API is expecting):

- UTF-8 encodes a Unicode string into a sequence of bytes;
this sequence contains no 0x00 bytes.
Btw., ASCII characters are encoded the same way as in ASCII.

- UTF-16 encodes a Unicode string into a sequence of 16-bit units,
hence it makes no sense to look at this encoding bytewise.
If you nevertheless treat a 16-bit unit as a sequence of two bytes
(repeat: this is a no-no), then you will most probably find
0x00 bytes therein; in particular, every ASCII character is
encoded as a sequence of the respective ASCII byte and a 0x00 byte
(both orders are possible, cf. <http://www.unicode.org/faq/utf_bom.html>).

- UTF-32 encodes a Unicode string into a sequence of 32-bit units,
hence it makes no sense to look at this encoding bytewise.
If you nevertheless treat a 32-bit unit as a sequence of four bytes
(repeat: this is a no-no), then you will certainly find
0x00 bytes therein; in particular, every ASCII character is
encoded as a sequence of the respective ASCII byte and three
0x00 bytes.

Best wishes,
Otto Stolz

Re: discovering code points with embedded nulls

Reply via email to