On Wednesday, March 03, 2004 11:22 PM Peter Kirk va escriure: >>> Does it also mean wchar_t is 4 bytes if __STDC_ISO_10646__ is >>> defined? or does it only mean wchar_t hold the character in >>> ISO_10646 (which mean it could be 2 bytes, 4 bytes or more than >>> that?) >> > On 03/03/2004 11:27, Antoine Leca wrote: > >> The later. But if wchar_t is 16 bits, it can only encode Unicode 3.0 >> or before. ie no UTF-16 support. >> > Surely if wchar_t is 16 bits, it CAN be used to encode the whole of > Unicode with UTF-16, i.e. with supplementary plane characters > represented as "surrogate pairs" in pairs of wchar_t.
OK, right, the programmer CAN put whatever she wants into a wchar_t (or a unsigned short, for that matter). I was speaking about what the compiler+libc was expecting to find and to handle correctly. Sorry for the inexact words. > Whether these > characters SHOULD be represented as UTF-16 code units in a wchar_t > string (or whether representation should be either UCS-2 or UTF-32) > is a separate issue, probably related to how the associated libraries > handle the code units for surrogates. And also to the level of support the compiler offers for the \U00xxxxxx notation. As I wrote in other posts, an otherwise compliant compiler, - using 16-bit wchar_t, and - defining __STDC_ISO_10646__ to something (which should be less than 200111L, date of publication of ISO/IEC 10646-2:2001, first one that defined the use of the external planes) cannot conformingly interpret the \U00xxyyyy notation in a L"" string constant if xx is not 00, because it would then fails to conform to the requirement that any character should be represented in a single wchar_t (more exactly, it can do it, but should emit some warning, because the character does not fit into one wchar_t). I usually say then that a compiler with 16-bit wchar_t can only encode UCS-2, not UTF-16. In other words, the management of UTF-16, such as keeping together the pair of surrogates, pairing them when transcoding to something else such as UTF-8, etc., should be done by the user (or externaly provided libraries, obviously), because there are no way to tell if the standard library does it or no. That's said, it CAN be done, as Peter rightly said. And the rest of the job, that is, the handling of BMP codepoints, can be left to the compiler/system libraries, thanks to the support advertised by the #definition of __STDC_ISO_10646__. On the other hand, an (hypothetic, as Nelson showed) compiler/library that defines __STDC_ISO_10646__ to be 200111L (and provides 32-bit or wider wchar_t, of course), does assure that all the managing of the surrogates are done correctly by the standard library and associated support. As such, iswupper(L'\U00010400') (DESERET CAPITAL LETTER LONG I) should not return 0. Antoine