Re: [v8-users] String representation: why UTF-16?

Chris Angelico Fri, 21 Dec 2012 15:13:43 -0800

On Sat, Dec 22, 2012 at 9:52 AM, Joshua Bell <[email protected]> wrote:
> You should take a look at
> http://wiki.ecmascript.org/doku.php?id=harmony:unicode_supplementary_characters
> if you haven't, and look at the es-discuss archives
> https://mail.mozilla.org/listinfo/es-discuss for various discussions of
> improving Unicode handling in ES6.


I'm glad there's discussion on the subject, at least! Of course, the
compatibility problems are very much there.
http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html#UTF32

> The short version is that the next version of ECMAScript is gaining some
> capabilities to handle non-BMP code points more sensibly, but these will be
> rather limited and provide close to the bare minimum necessary for
> processing strings with "astral" data.
>
> I realize that's somewhat orthogonal to your point which is about v8
> internals, but ECMAScript itself is still firmly mired in the world of
> 16-bit code units. FWIW, Web APIs are also sticking with DOMStrings
> comprised of 16-bit code units.

The main problem is backward compatibility. I'll see if I can join the
ES discussion (as if I don't already have more mailing lists than I
can keep up with!), but this is also an implementation issue. The
flexible string representation depends on strings being immutable, as
they are in both Python and Pike, and ECMAScript fits that too. It'd
be very efficient with handling the common case where a UTF-8 string
contains no bytes >0x7F, as the original string buffer can be used to
represent the string itself (assuming that it's owned by the right
subsystem, etc).

I'd like to see this as an openly backward-incompatible change. It's
the easiest way forward - acknowledge that the previous behaviour is
buggy, and make it possible to run a script in non-buggy mode.

ChrisA

-- 
v8-users mailing list
[email protected]
http://groups.google.com/group/v8-users

Re: [v8-users] String representation: why UTF-16?

Reply via email to