What's in a wchar_t string on unix?What you'll put or find in wchar_t is application dependant. But there's only a guarantee to find a single code unit (not necessarily a codepoint) for characters encoded in the source and compiled with the appropriate source charset. But this charset is not necessarily Unicode. At run-time, functions in the standard libraries that work with or return wide strings only expect these strings to be encoded according to the current locale (not necessarily Unicode). So if you run your program in an environment where the locale is ISO-8859-2, you'll find code units whose value between 0 and 255 match their position in the ISO-8859-2 standard, but you won't find the corresponding character codepoints as defined by Unicode. A wchar_t can then be used with any charset whose minimum code unit size is lower than or equal to the size of the wchar_t type. This may be an Unicode encoding form, or any other encoding (except UTF-32 if wchar_t is defined as a 16-bit integer type, which is not enough to represent every single Unicode codepoint).
wchar_t is then only convenient for Unicode, as it is generally larger than char, but its presence does not mean it will support UTF-16 or UTF-32 (in ANSI C, wchar_t is allowed to represent the same type as char). So you'll still be platform dependant if you want to store a single character in a wchar_t variable. However a "wide" string constant (of type wchar_t*) should be able to store and represent any Unicode character or codepoint, possibly by mapping one codepoint to several wchar_t code units... Unlike Java's "char" type which is always an unsigned 16-bit integer on all platforms, there's no standard size for wchar_t in C and C++... ----- Original Message ----- From: Rick Cameron To: [EMAIL PROTECTED] Sent: Monday, March 01, 2004 8:13 PM Subject: What's in a wchar_t string on unix? Hi, all This may be an FAQ, but I couldn't find the answer on unicode.org. It seems that most flavours of unix define wchar_t to be 4 bytes. If the locale is set to be Unicode, what's in a wchar_t string? Is it UTF-32, or UTF-16 with the code units zero-extended to 4 bytes? Cheers - rick cameron

