Comment #8 on issue 761 by [email protected]: Incorrect UTF-8
encoding/decoding for non-BMP characters in String related functions
http://code.google.com/p/v8/issues/detail?id=761
There has been some discussion in TC39 - at least, on the es-discuss
mailing list - about full Unicode support for ECMAScript strings. A
strawman proposal is at:
http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings
- note that this is NOT accepted for ES6 and there were concerns raised
about this particular proposal.
...
FWIW, it is possible to do text processing in JavaScript treating strings
as UTF-16 sequences, both manually and with a little help from the browser.
For example:
// see
http://ecmanaut.blogspot.com/2006/07/encoding-decoding-utf8-in-javascript.html
function encode_utf8( s ) { return unescape( encodeURIComponent( s ) ); }
function decode_utf8( s ) { return decodeURIComponent( escape( s ) ); }
function codes(s) { var c = [], i; for (i = 0; i < s.length; i += 1) { c[i]
= s.charCodeAt(i); }; return c.map(function(d) { return d.toString(16);
}).join(' '); }
// from original poster's sample
var utf8str = encode_utf8('\ud801\udc12');
codes(utf8str); // > "f0 90 90 92"
codes(decode_utf8(utf8str)); // > "d801 dc12"
IMHO, the suggestion to convert incoming UTF-8 to UTF-16 instead of UCS-2
matches at least part of the reality on the web. With DOM interop, here's
another example:
// from the WebKit inspector
var u = '\uD834\uDD1E'; // U+1D11E MUSICAL SYMBOL G CLEF
document.title = u; // works on my machine
The 16-bit JavaScript string is being interpreted as a UTF-16 sequence
somewhere between the script runtime and the display. Converting incoming
WebSocket UTF-8 strings to UTF-16 before handing them to JavaScript seems
like the right thing to do, so that they can later find their way back out
to the DOM for display.
--
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev