[EMAIL PROTECTED] wrote:
I'm dealing with an API that claims it doesn't support unicode characters with embedded nulls.
...
This depends on the encoding form you are using (and the API is expecting):Test all constituent bytes for 0x00.
- UTF-8 encodes a Unicode string into a sequence of bytes;
this sequence contains no 0x00 bytes.
Btw., ASCII characters are encoded the same way as in ASCII.
- UTF-16 encodes a Unicode string into a sequence of 16-bit units,
hence it makes no sense to look at this encoding bytewise.
If you nevertheless treat a 16-bit unit as a sequence of two bytes
(repeat: this is a no-no), then you will most probably find
0x00 bytes therein; in particular, every ASCII character is
encoded as a sequence of the respective ASCII byte and a 0x00 byte
(both orders are possible, cf. <http://www.unicode.org/faq/utf_bom.html>).
- UTF-32 encodes a Unicode string into a sequence of 32-bit units,
hence it makes no sense to look at this encoding bytewise.
If you nevertheless treat a 32-bit unit as a sequence of four bytes
(repeat: this is a no-no), then you will certainly find
0x00 bytes therein; in particular, every ASCII character is
encoded as a sequence of the respective ASCII byte and three
0x00 bytes.
Best wishes,
Otto Stolz

