From: "Markus Scherer" <[EMAIL PROTECTED]>
> Paul Hastings wrote:
> > would it be correct to say that javascript "natively" supports unicode?
> 
> ECMAScript, of which JavaScript and JScript are implementations, is defined on 
> 16-bit Unicode 
> scripts and using 16-bit Unicode strings.
> 
> In other words, the basic encoding support is there, but there are basically no 
> Unicode-specific 
> APIs in the standard. No character properties, no collation that is guaranteed to do 
> more than 
> strcmp, etc. Script writers have to rely on implementation-specific functions or 
> supply their own.

It would be more correct to say that ECMAScript handles text using the UTF-16 encoding 
form on most platforms, and so can handle any Unicode character. However, it's true 
that ECMAScript will allow you to create invalide Unicode strings, as it allows you to 
create strings where surrogate characters do not pair.

This says nothing on the internal encoding of strings within ECMA engines: it could as 
well use CESU-8 internally, but this will internal encoding will be hidden.

So the situation of ECMAScript isexactly similar to Java (in which the builtin type 
"char" is an unsigned 16 bit integer, and the String type is handled in terms of 
"char" code units with UTF-16). However the serialization of compiled Java classes 
internally encodes these strings with UTF-8, which is deserialized to UTF-16 when the 
class is loaded.

You will have a similar situation on Windows with the Win32 API, and in its C/C++ 
binding using TCHAR (and the T() macro for string constants) with the _UNICODE 
compile-time define. Or on all systems where the ANSI C type wchar_t is defined as a 
16 bit integer.

Note that we are speaing here about code units, not codepoints. The code units is what 
programming languages use to handle strings, not codepoints. As code units are well 
defined in Unicode in relation with a encoding form, any language or system can be 
made compliant to fully support Unicode, if it also provides library functions for 
string handling that implement the Unicode-defined algorithms (described in terms of 
code points).

It's up to the library (not the language) to make its implementation of Unicode with 
code units comply with the standard algorithms based on code points. Of course it is 
much easier to implement these algorithms with 16-bit code units than with 8-bit code 
units. But the language itself has no other special Unicode compliance characteristic.

Reply via email to