Can you give a reference for null-bytes not being valid utf-8?
The way I read the Unicode standard (ISO-10646) on UTF-8, all code points
in the
range 0000 through 007f are encoded as one byte. The octet sequence
\xC0\x80 is,
as far as I can see, *invalid* UTF-8 (0xC0 and 0xC1 cannot occur in valid
UTF-8). It's used as a hack in some cases where people know that it'll be
interpreted as the null code-point anyway, but only because the decoder
doesn't
validate its input.
(See: http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences)
Where I see people talk about null-terminated UTF-8 strings, it's typically
in
contexts where a null character would not make sense in the string.
I think the current function, using start and length of the UTF-8 byte
sequence,
should be retained.
If nothing else, changing the function in this way can break existing code
that
uses it. We don't want that.
If null-terminated UTF-8 strings are common, it would be better to add a
separate function to create JS strings from those.
If you are creating the UTF-8 byte sequences yourself, you are likely to
already
know where it ends, and it's just a matter of not throwing that information
away. If you are using a library that returns null-terminated UTF-8
sequences,
then it's obviously not as simple.
http://codereview.chromium.org/6524031/
--
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev