Comment #9 on issue 761 by [email protected]: Incorrect UTF-8 encoding/decoding for non-BMP characters in String related functions
http://code.google.com/p/v8/issues/detail?id=761

<quote>
I'm reconsidering whether it's possible to convert all incoming UTF-8 into UTF-16 sequences instead of UCS-2 (i.e., convert a non-BMP character into a surrogate pair). This will be on input only, and won't make sense outside of comments and String and RegExp literals (since a surrogate code isn't valid anywhere else). It's likely to confuse users, since we won't ever interpret the result as UTF-16 anyway. That means that the length of a string literal containing non-BMP characters is different from the number of Unicode characters sent as UTF-8.
</quote>

I think this is perfectly fine. Especially, it's fine that 'length' will keep counting 2-byte code units instead of Unicode characters.



--
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev

Reply via email to