Comment #33 on issue 761 by [email protected]: Incorrect UTF-8 encoding/decoding for non-BMP characters in String related functions
http://code.google.com/p/v8/issues/detail?id=761

The bleeding edge revision 11007 has fixes to handle surrogate pairs on input and output. The intended behaviour is:

* 4-byte UTF-8 sequences turn into 2 surrogates in the JS String
* Two 3-byte UTF-8 sequences can also be used to create 2 surrogates in the JS String * String.fromCharCode(x) takes a single UTF-16 code unit, so you still can't give it numbers above 0xffff * Most places in JS (RegExp, [], charCodeAt, charAt, etc.) work on UTF-16 code units with no special treatment for surrogates. * On output to UTF-8, unmatched surrogates map to a 3-byte UTF-8 sequence, and surrogate pairs map to a single 4-byte UTF-8 sequence.

--
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev

Reply via email to