Folks, Since "ISO/IEC 9899 - Programming Language C" was quoted, I wonder if you are aware of the efforts of SC22/WG14 to develop a Technical Report that deals with the problems discussed in this thread.
The document is ISO/IEC DTR 19769 - Extensions for the programming language C to support new character data types The project is currently in DTR ballot and will, when approved, certainly take some time to be implemented in C-compilers and in operating systems. But it gives a good indication, in which direction the formal standardization is going with data types in C language. Here are some excerpts from the DTR 19769: Quote: 3 The new typedefs This Technical Report introduces the following two new typedefs, char16_t and char32_t : typedef T1 char16_t; typedef T2 char32_t; where T1 has the same type as uint_least16_t and T2 has the same type as uint_least32_t. The new typedefs guarantee certain widths for the data types, whereas the width of wchar_t is implementation defined. The data values are unsigned, while char and wchar_t could take signed values. This Technical Report also introduces the new header: <uchar.h> The new typedefs, char16_t and char32_t, are defined in <uchar.h> 4 Encoding C99 subclause 6.10.8 specifies that the value of the macro _ _STDC_ISO_10646_ _ shall be "an integer constant of the form yyyymmL (for example, 199712L), intended to indicate that values of type wchar_t are the coded representations of the characters defined by ISO/IEC 10646, along with all amendments and technical corrigenda as of the specified year and month." C99 subclause 6.4.5p5 specifies that wide string literals are initialized with a sequence of wide characters as defined by the mbstowcs function with an implementation-defined current locale. Analogous to this macro, this Technical Report introduces two new macros. If the header <uchar.h> defines the macro _ _STDC_UTF_16_ _, values of type char16_t shall have UTF-16 encoding. This allows the use of UTF-16 in char16_t even when wchar_t uses a non-Unicode encoding. In certain cases the compile-time conversion to UTF-16 may be restricted to members of the basic character set and universal character names (\Unnnnnnnn and \unnnn) because for these the conversion to UTF-16 is defined unambiguously. If the header <uchar.h> defines the macro _ _STDC_UTF_32_ _, values of type char32_t shall have UTF-32 encoding. If the header <uchar.h> does not define the macro _ _STDC_UTF_16_ _, the encoding of char16_t is implementation defined. Similarly, if the header <uchar.h> does not define the macro _ _STDC_UTF_32_ _, the encoding of char32_t is implementation defined. An implementation may define other macros to indicate a different encoding. Unquote The document, which of course is copyrighted by ISO, starts with a nice introduction that defines the problem. In addition to the excerpts above, it also addresses the following subjects: 5 String literals and character constants 5.1 String literals and character constants notations 5.2 The string concatenation 6 Library functions 6.1 The mbrtoc16 function 6.2 The c16rtomb function 6.3 The mbrtoc32 function 6.4 The c32rtomb function 7 ANNEX A Unicode encoding forms: UTF-16, UTF-32 Best regards Arnold -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Nelson H. F. Beebe Sent: Wednesday, March 03, 2004 1:49 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: What's in a wchar_t string ... "Frank Yung-Fong Tang" <[EMAIL PROTECTED]> asks on Wed, 3 Mar 2004 12:38:49 -0500: >> Does it also mean wchar_t is 4 bytes if __STDC_ISO_10646__ is defined? >> or does it only mean wchar_t hold the character in ISO_10646 >> (which mean it could be 2 bytes, 4 bytes or more than that?) Here is the exact text from INTERNATIONAL ISO/IEC STANDARD 9899 Second edition 1999-12-01 Programming languages -- C >> ... >> __STDC_ISO_10646__ An integer constant of the form yyyymmL (for >> example, 199712L), intended to indicate >> that values of type wchar_t are the coded >> representations of the characters defined >> by ISO/IEC 10646, along with all amendments >> and technical corrigenda as of the >> specified year and month. >> ... It says nothing more about the size of wchar_t, or what encodings are used: note the vague language "coded representations...". This means effectively that the implementation, not the Standard, decides. Very few current Unix C or C++ compilers even define the symbol __STDC_ISO_10646__; the C/C++ feature test package at ftp://ftp.math.utah.edu/pub/features http://www.math.utah.edu/pub/features probes that macro value, and many others. My logs of its runs in about 90 build environments show definitions with values 200009 for GNU gcc versions 3.x (all platforms), Intel icc versions 7.x and 8.0 (Intel IA-32 and IA-64), and Portland Group pgcc versions 4.x and 5.x (Intel IA-32). On all of these, it reports that sizeof(wchar_t) = 4, but of course, that says nothing whatever about the encoding. ------------------------------------------------------------------------ ------- - Nelson H. F. Beebe Tel: +1 801 581 5254 - - University of Utah FAX: +1 801 581 4148 - - Department of Mathematics, 110 LCB Internet e-mail: [EMAIL PROTECTED] - - 155 S 1400 E RM 233 [EMAIL PROTECTED] [EMAIL PROTECTED] - - Salt Lake City, UT 84112-0090, USA URL: http://www.math.utah.edu/~beebe - ------------------------------------------------------------------------ -------

