Comment #33 on issue 761 by [email protected]: Incorrect UTF-8
encoding/decoding for non-BMP characters in String related functions
http://code.google.com/p/v8/issues/detail?id=761
The bleeding edge revision 11007 has fixes to handle surrogate pairs on
input and output. The intended behaviour is:
* 4-byte UTF-8 sequences turn into 2 surrogates in the JS String
* Two 3-byte UTF-8 sequences can also be used to create 2 surrogates in the
JS String
* String.fromCharCode(x) takes a single UTF-16 code unit, so you still
can't give it numbers above 0xffff
* Most places in JS (RegExp, [], charCodeAt, charAt, etc.) work on UTF-16
code units with no special treatment for surrogates.
* On output to UTF-8, unmatched surrogates map to a 3-byte UTF-8 sequence,
and surrogate pairs map to a single 4-byte UTF-8 sequence.
--
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev