You should take a look at http://wiki.ecmascript.org/doku.php?id=harmony:unicode_supplementary_characters if you haven't, and look at the es-discuss archives https://mail.mozilla.org/listinfo/es-discuss for various discussions of improving Unicode handling in ES6.
The short version is that the next version of ECMAScript is gaining some capabilities to handle non-BMP code points more sensibly, but these will be rather limited and provide close to the bare minimum necessary for processing strings with "astral" data. I realize that's somewhat orthogonal to your point which is about v8 internals, but ECMAScript itself is still firmly mired in the world of 16-bit code units. FWIW, Web APIs are also sticking with DOMStrings comprised of 16-bit code units. On Fri, Dec 21, 2012 at 7:22 AM, Chris Angelico <[email protected]> wrote: > I'm fully aware that this may not be the best place for this, as it's > more a question for JavaScript itself than the V8 engine. But here > goes. > > One of my projects at work involves a C++ program that can be > user-scripted - untrusted scripts that need to manipulate strings, and > a few basic aggregate types (mapping/dictionary and array/list, > implemented in JS using object and array). The C++ code works with > UTF-8 all the way, loading data from a PostgreSQL database, sending > stuff across TCP sockets, etc, etc, and I use the String constructor > and WriteUtf8 to get data into and out of JavaScript. So far, so good. > > Everything works fine as long as all characters are in the BMP. But if > they're not, JavaScript's internal representation as UTF-16 starts to > be a problem. Suppose the script has this: > > function first_two(s) {return s.substr(0,2);} > function remaining(s) {return s.substr(2);} > > And you call each of those functions with a string constructed from > the following UTF-8 bytes: > "\xF0\x92\x8D\x85\x41\x41\x41" > > That's three copies of the letter A, following a non-BMP character > (U+12345, which apparently is a cuneiform sign). The string has four > characters in it, so in theory, the first function should return the > astral character followed by a letter A, and the second function > should return "AA". But that's not what happens; the astral character > gets rendered as U+D808 U+DF45, which counts as two, so the first > function returns just one actual character, and the second returns the > three A's. > > It gets worse when a character gets split. Do the same test with this > input byte stream: "\x41\xF0\x92\x8D\x85\x41\x41" - exactly the same, > but with one letter A moved to the front. Now the first function > returns U+0041 U+D808, and the second returns U+DF45 U+0041 U+0041. > Those codepoints then get rendered into UTF-8, representing *invalid > characters*, which any compliant parser (I was testing using the Pike > utf8_to_string() function) will throw out. > > This is not an indictment of the V8 programmers. The JavaScript > specification is what's wrong. But I'm wondering if this might be a > place where an extension could be implemented, for the benefit of > embedded code that will never have to be executed using any other > interpreter, and then adoption might proceed from there. > > Of course, the obvious way to fix the bug is to use UTF-32 / UCS-4 for > all strings, but that's fairly wasteful. An alternative that works > quite efficiently has been implemented by the Pike and Python > languages; conceptually, strings are stored in UTF-32, but in memory, > the leading 0 bytes are omitted if unnecessary. Each string has a > "width" of either 8, 16, or 32 (or if you prefer, 1, 2, or 4), based > on the highest codepoint in it. Python's string benchmark results > showed some operations slower under the new format, but others faster, > with the overall benchmark rating significantly improving (though the > exact improvement depends on myriad factors, of course); but more > importantly, string handling becomes *correct*. > > This would be a potentially incompatible change to code. It may be > worth requiring some sort of token at the top of the script, same as > "use strict" - something like "use strict unicode" - to engage this > behaviour. Scripts depending on this would then still function in > other engines, but with the potential to break on non-BMP characters. > > Some handy info on the subject: > > http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/ > http://www.python.org/dev/peps/pep-0393/ - the Python Enhancement > Proposal discussing the new string type (has lots of specifics but > also the concept discussion) > > I'd love to see V8 lead the JavaScript world in true Unicode handling. > Use of PEP-393 strings (I'm in two minds as to whether they should be > called that or "Pike strings") would be a great step forward for the > whole world. > > Chris Angelico > > -- > v8-users mailing list > [email protected] > http://groups.google.com/group/v8-users > -- v8-users mailing list [email protected] http://groups.google.com/group/v8-users
