Comment #12 on issue 761 by [email protected]: Incorrect UTF-8 encoding/decoding for non-BMP characters in String related functions
http://code.google.com/p/v8/issues/detail?id=761
While it would be convenient to convert UTF-8 to UTF-16 and then treat it as UCS-2, we should still be compatible with other browsers. Currently we match Safari and IE: A four-byte sequence like F0 80 80 80 (UTF-8 of U+10000) is converted to four U+FFFD characters (probably because the first byte isn't recognized by the decoder, and the following bytes aren't valid UTF-8 starters). (In comparison, Opera and Firefox read it as one U+FFFD. They obviously decode the UTF-8 correctly, and then converts the one non-BMP character to invalid).
We should probably keep compatibility for now. -- v8-dev mailing list [email protected] http://groups.google.com/group/v8-dev
