Comment #4 on issue 761 by [email protected]: Incorrect UTF-8 encoding/decoding for non-BMP characters in String related functions
http://code.google.com/p/v8/issues/detail?id=761
After closer inspection, I don't see any way we can safely use the second option.
We parse the same input as either utf-8 or as a String value, and they should parse exactly the same. More precisely, the pre-parser parses it as utf-8, and the real parser parses it from a string later on. The pre-parser stores indices into the string for later use, so the number of codepoints in the two representations MUST be the same. That means that we can't turn one code-point into a surrogate pair as long as we don't parse the string as UTF-16 too.
I.e., the second option requires us to also change any parsing from string values to interpreting the string value as UTF-16, which is a larger change.
-- v8-dev mailing list [email protected] http://groups.google.com/group/v8-dev
