[v8-dev] Re: Issue 761 in v8: Incorrect UTF-8 encoding/decoding for non-BMP characters in String related functions

codesite-noreply Tue, 13 Sep 2011 10:08:16 -0700

Comment #8 on issue 761 by [email protected]: Incorrect UTF-8encoding/decoding for non-BMP characters in String related functions

http://code.google.com/p/v8/issues/detail?id=761

There has been some discussion in TC39 - at least, on the es-discussmailing list - about full Unicode support for ECMAScript strings. Astrawman proposal is at:http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings- note that this is NOT accepted for ES6 and there were concerns raisedabout this particular proposal.

...

FWIW, it is possible to do text processing in JavaScript treating stringsas UTF-16 sequences, both manually and with a little help from the browser.For example:

// seehttp://ecmanaut.blogspot.com/2006/07/encoding-decoding-utf8-in-javascript.html

function encode_utf8( s ) { return unescape( encodeURIComponent( s ) ); }
function decode_utf8( s ) { return decodeURIComponent( escape( s ) ); }

function codes(s) { var c = [], i; for (i = 0; i < s.length; i += 1) { c[i]= s.charCodeAt(i); }; return c.map(function(d) { return d.toString(16);}).join(' '); }


// from original poster's sample
var utf8str = encode_utf8('\ud801\udc12');
codes(utf8str); // > "f0 90 90 92"
codes(decode_utf8(utf8str)); // > "d801 dc12"

IMHO, the suggestion to convert incoming UTF-8 to UTF-16 instead of UCS-2matches at least part of the reality on the web. With DOM interop, here'sanother example:


// from the WebKit inspector
var u = '\uD834\uDD1E'; // U+1D11E MUSICAL SYMBOL G CLEF
document.title = u; // works on my machine

The 16-bit JavaScript string is being interpreted as a UTF-16 sequencesomewhere between the script runtime and the display. Converting incomingWebSocket UTF-8 strings to UTF-16 before handing them to JavaScript seemslike the right thing to do, so that they can later find their way back outto the DOM for display.



--
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev

[v8-dev] Re: Issue 761 in v8: Incorrect UTF-8 encoding/decoding for non-BMP characters in String related functions

Reply via email to