[v8-dev] Re: Terminate on \0 in data passed to String::New() (issue6524031)

lrn Thu, 17 Feb 2011 01:18:31 -0800

Can you give a reference for null-bytes not being valid utf-8?

The way I read the Unicode standard (ISO-10646) on UTF-8, all code pointsin therange 0000 through 007f are encoded as one byte. The octet sequence\xC0\x80 is,

as far as I can see, *invalid* UTF-8 (0xC0 and 0xC1 cannot occur in valid
UTF-8). It's used as a hack in some cases where people know that it'll be

interpreted as the null code-point anyway, but only because the decoderdoesn't

validate its input.
(See: http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences)

Where I see people talk about null-terminated UTF-8 strings, it's typicallyin

contexts where a null character would not make sense in the string.

I think the current function, using start and length of the UTF-8 bytesequence,

should be retained.

If nothing else, changing the function in this way can break existing codethat

uses it. We don't want that.

If null-terminated UTF-8 strings are common, it would be better to add a
separate function to create JS strings from those.

If you are creating the UTF-8 byte sequences yourself, you are likely toalready

know where it ends, and it's just a matter of not throwing that information

away. If you are using a library that returns null-terminated UTF-8sequences,

then it's obviously not as simple.

http://codereview.chromium.org/6524031/

--
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev

[v8-dev] Re: Terminate on \0 in data passed to String::New() (issue6524031)

Reply via email to