2012-04-25 12:09, Marc Durdin wrote:

Probably the most egregious example I know of is JavaScript.
> As far as I know, JavaScript still only groks UCS-2. I'd love to be wrong.

The ECMAScript standard neither requires nor forbids support to non-BMP characters: “A conforming implementation of this Standard shall interpret characters in conformance with the Unicode Standard, Version 3.0 or later and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the adopted encoding form, implementation level 3. If the adopted ISO/IEC 10646-1 subset is not otherwise specified, it is presumed to be the BMP subset, collection 300. If the adopted encoding form is not otherwise specified, it presumed to be the UTF-16 encoding form.”
  http://www.ecma-international.org/publications/standards/Ecma-262.htm

In practice, modern implementations support UTF-16 and the full Unicode coding space. There are of course problems with fonts, and the native “character” type is still a 16-bit code unit, and things are generally clumsy, but still. You can even have non-BMP characters directly as data in a UTF-8 encoded HTML document, and when you access such data in client-side JavaScript, the browser will have internally converted the data to UTF-8 format, so the JavaScript code sees a non-BMP character as two code units, or “JavaScript characters.”

Demo:

<!doctype html>
<meta charset=utf-8>
<p id=p>&#x1D64F;</p>
<script>
var s = document.getElementById('p').innerHTML;
document.write(s.charCodeAt(0));
document.write(', ');
document.write(s.charCodeAt(1));
</script>

In modern browsers, this displays U+1D64F (mathematical sans-serif bold italic capital t) and then two numbers that constitute the UTF-16 encoded representation.

You could even use the non-BMP character as such in a JavaScript literal in a UTF-8 encoded document, like s = '𝙏'. Though support is not required, modern browsers deal with this. Technically, a JavaScript literal consists of code units, but non-BMP characters just generate two code units each.

Yucca




Reply via email to