Comment #8 on issue 761 by [email protected]: Incorrect UTF-8 encoding/decoding for non-BMP characters in String related functions
http://code.google.com/p/v8/issues/detail?id=761

There has been some discussion in TC39 - at least, on the es-discuss mailing list - about full Unicode support for ECMAScript strings. A strawman proposal is at: http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings - note that this is NOT accepted for ES6 and there were concerns raised about this particular proposal.

...

FWIW, it is possible to do text processing in JavaScript treating strings as UTF-16 sequences, both manually and with a little help from the browser. For example:

// see http://ecmanaut.blogspot.com/2006/07/encoding-decoding-utf8-in-javascript.html
function encode_utf8( s ) { return unescape( encodeURIComponent( s ) ); }
function decode_utf8( s ) { return decodeURIComponent( escape( s ) ); }

function codes(s) { var c = [], i; for (i = 0; i < s.length; i += 1) { c[i] = s.charCodeAt(i); }; return c.map(function(d) { return d.toString(16); }).join(' '); }

// from original poster's sample
var utf8str = encode_utf8('\ud801\udc12');
codes(utf8str); // > "f0 90 90 92"
codes(decode_utf8(utf8str)); // > "d801 dc12"

IMHO, the suggestion to convert incoming UTF-8 to UTF-16 instead of UCS-2 matches at least part of the reality on the web. With DOM interop, here's another example:

// from the WebKit inspector
var u = '\uD834\uDD1E'; // U+1D11E MUSICAL SYMBOL G CLEF
document.title = u; // works on my machine

The 16-bit JavaScript string is being interpreted as a UTF-16 sequence somewhere between the script runtime and the display. Converting incoming WebSocket UTF-8 strings to UTF-16 before handing them to JavaScript seems like the right thing to do, so that they can later find their way back out to the DOM for display.


--
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev

Reply via email to