Re: [v8-users] String representation: why UTF-16?

Joshua Bell Fri, 21 Dec 2012 14:52:46 -0800

You should take a look at
http://wiki.ecmascript.org/doku.php?id=harmony:unicode_supplementary_characters
if
you haven't, and look at the es-discuss archives
https://mail.mozilla.org/listinfo/es-discuss for various discussions of
improving Unicode handling in ES6.


The short version is that the next version of ECMAScript is gaining some
capabilities to handle non-BMP code points more sensibly, but these will be
rather limited and provide close to the bare minimum necessary for
processing strings with "astral" data.

I realize that's somewhat orthogonal to your point which is about v8
internals, but ECMAScript itself is still firmly mired in the world of
16-bit code units. FWIW, Web APIs are also sticking with DOMStrings
comprised of 16-bit code units.



On Fri, Dec 21, 2012 at 7:22 AM, Chris Angelico <[email protected]> wrote:

> I'm fully aware that this may not be the best place for this, as it's
> more a question for JavaScript itself than the V8 engine. But here
> goes.
>
> One of my projects at work involves a C++ program that can be
> user-scripted - untrusted scripts that need to manipulate strings, and
> a few basic aggregate types (mapping/dictionary and array/list,
> implemented in JS using object and array). The C++ code works with
> UTF-8 all the way, loading data from a PostgreSQL database, sending
> stuff across TCP sockets, etc, etc, and I use the String constructor
> and WriteUtf8 to get data into and out of JavaScript. So far, so good.
>
> Everything works fine as long as all characters are in the BMP. But if
> they're not, JavaScript's internal representation as UTF-16 starts to
> be a problem. Suppose the script has this:
>
> function first_two(s) {return s.substr(0,2);}
> function remaining(s) {return s.substr(2);}
>
> And you call each of those functions with a string constructed from
> the following UTF-8 bytes:
> "\xF0\x92\x8D\x85\x41\x41\x41"
>
> That's three copies of the letter A, following a non-BMP character
> (U+12345, which apparently is a cuneiform sign). The string has four
> characters in it, so in theory, the first function should return the
> astral character followed by a letter A, and the second function
> should return "AA". But that's not what happens; the astral character
> gets rendered as U+D808 U+DF45, which counts as two, so the first
> function returns just one actual character, and the second returns the
> three A's.
>
> It gets worse when a character gets split. Do the same test with this
> input byte stream: "\x41\xF0\x92\x8D\x85\x41\x41" - exactly the same,
> but with one letter A moved to the front. Now the first function
> returns U+0041 U+D808, and the second returns U+DF45 U+0041 U+0041.
> Those codepoints then get rendered into UTF-8, representing *invalid
> characters*, which any compliant parser (I was testing using the Pike
> utf8_to_string() function) will throw out.
>
> This is not an indictment of the V8 programmers. The JavaScript
> specification is what's wrong. But I'm wondering if this might be a
> place where an extension could be implemented, for the benefit of
> embedded code that will never have to be executed using any other
> interpreter, and then adoption might proceed from there.
>
> Of course, the obvious way to fix the bug is to use UTF-32 / UCS-4 for
> all strings, but that's fairly wasteful. An alternative that works
> quite efficiently has been implemented by the Pike and Python
> languages; conceptually, strings are stored in UTF-32, but in memory,
> the leading 0 bytes are omitted if unnecessary. Each string has a
> "width" of either 8, 16, or 32 (or if you prefer, 1, 2, or 4), based
> on the highest codepoint in it. Python's string benchmark results
> showed some operations slower under the new format, but others faster,
> with the overall benchmark rating significantly improving (though the
> exact improvement depends on myriad factors, of course); but more
> importantly, string handling becomes *correct*.
>
> This would be a potentially incompatible change to code. It may be
> worth requiring some sort of token at the top of the script, same as
> "use strict" - something like "use strict unicode" - to engage this
> behaviour. Scripts depending on this would then still function in
> other engines, but with the potential to break on non-BMP characters.
>
> Some handy info on the subject:
>
> http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/
> http://www.python.org/dev/peps/pep-0393/ - the Python Enhancement
> Proposal discussing the new string type (has lots of specifics but
> also the concept discussion)
>
> I'd love to see V8 lead the JavaScript world in true Unicode handling.
> Use of PEP-393 strings (I'm in two minds as to whether they should be
> called that or "Pike strings") would be a great step forward for the
> whole world.
>
> Chris Angelico
>
> --
> v8-users mailing list
> [email protected]
> http://groups.google.com/group/v8-users
>

-- 
v8-users mailing list
[email protected]
http://groups.google.com/group/v8-users

Re: [v8-users] String representation: why UTF-16?

Reply via email to