Re: Support for non-BMP characters

Jukka K. Korpela Wed, 25 Apr 2012 05:00:46 -0700

2012-04-25 12:09, Marc Durdin wrote:

Probably the most egregious example I know of is JavaScript.

> As far as I know, JavaScript still only groks UCS-2. I'd love to bewrong.

The ECMAScript standard neither requires nor forbids support to non-BMPcharacters: “A conforming implementation of this Standard shallinterpret characters in conformance with the Unicode Standard, Version3.0 or later and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as theadopted encoding form, implementation level 3. If the adopted ISO/IEC10646-1 subset is not otherwise specified, it is presumed to be the BMPsubset, collection 300. If the adopted encoding form is not otherwisespecified, it presumed to be the UTF-16 encoding form.”

  http://www.ecma-international.org/publications/standards/Ecma-262.htm

In practice, modern implementations support UTF-16 and the full Unicodecoding space. There are of course problems with fonts, and the native“character” type is still a 16-bit code unit, and things are generallyclumsy, but still. You can even have non-BMP characters directly as datain a UTF-8 encoded HTML document, and when you access such data inclient-side JavaScript, the browser will have internally converted thedata to UTF-8 format, so the JavaScript code sees a non-BMP character astwo code units, or “JavaScript characters.”


Demo:

<!doctype html>
<meta charset=utf-8>
<p id=p>&#x1D64F;</p>
<script>
var s = document.getElementById('p').innerHTML;
document.write(s.charCodeAt(0));
document.write(', ');
document.write(s.charCodeAt(1));
</script>

In modern browsers, this displays U+1D64F (mathematical sans-serif bolditalic capital t) and then two numbers that constitute the UTF-16encoded representation.

You could even use the non-BMP character as such in a JavaScript literalin a UTF-8 encoded document, like s = '𝙏'. Though support is notrequired, modern browsers deal with this. Technically, a JavaScriptliteral consists of code units, but non-BMP characters just generate twocode units each.


Yucca

Re: Support for non-BMP characters

Reply via email to