On Wed, Jul 2, 2014 at 8:02 AM, Karl Williamson <[email protected]> wrote: > In > UTF-8, an example would be that Sun, I'm told, and for reasons I've > forgotten or never knew, did not want raw NUL bytes to appear in text > streams, so used the overlong sequence \xC0\x80 to represent them; overlong > sequences generally being considered "bad" because they could be used to > insert malicious payloads into the input.
In C, NUL ends a string. If you have to run data that may have NUL characters through C functions, you can't store the NULs as \0. I might argue 11111111b for 0x00 in UTF-8 would be technically legal--the standard never specifies which bit sequences correspond to which byte values--but \xC0\x80 would probably be more reliably processed by existing code. -- Kie ekzistas vivo, ekzistas espero. _______________________________________________ Unicode mailing list [email protected] http://unicode.org/mailman/listinfo/unicode

