[EMAIL PROTECTED] wrote:
> 
> Antoine Leca wrote:
> >   char C_thai[] =
> > "\u0E40\u0E02\u0E17\u0E32\u0E49\u0E1B\u0E07\u0E1C\u0E33";
> 
> Would the Unicode values be converted to the local SBCS/MBCS character set?

In this case, yes (assuming a normal C compiler).

With wchar_t / L"...", they are converted to the local "wide character set",
which happens to be Unicode on most boxes, with the following main exceptions:

- some (cheap) C compilers does not have any special support for wchar_t,
 so it defaults to the same as cahr, and are usually 8 bit;

- with East Asian C compilers, wchar_t are either Unicode or either
 a flat character coding, that is every character whether coded as SBCS or DBCS
 stands, with its nominal, legacy, code, in a 16-bit or 32-bit cell
 (that is different from MBCS in that the ASCII character are stored
 in cells the same width as DBCS characters)

- EBCDIC implementations have their own rules (for obvious reasons), that
 I do not know exactly (I am not sure they are consistent)

C99 also specifies that if __STDC_ISO_10646__ is defined, then the wchar_t
values are the Unicode codepoints (then to learn if it is UTF-16 or UTF-32,
one should look at WCHAR_MAX to learn if wchar_t are 16-bit or 32-bit).


 
> If yes:
> 
> Is the definition of this locale info part of the C99 standard itself, or is
> it operating system's locale?

It is "implementation-defined". Which means:
- it is not required in any way by the C99 Standard itself (except if
 __STDC_ISO_10646__ is defined);
- it is required to be stated in full words in the documentation for the compiler;
- it can vary as per compilation options; often the OS's current locale is
 the default value, that can be overriden.

 
> And what happens to Unicode values that cannot be converted in that
> character set?

The compiler is required to fall back to something (it cannot refuse to
compile, nor it can simply drop the character); it is allowed to "fall back"
to different character depending on the typed character, though; so for example,

  #include <stdio.h>
  int main() {  printf("%ls\n", L"\u00C0 table!");  return 0;  }

Can produce (among others, this is UTF-8 encoded):

À table!
A table!
à table!
 table!



I can continue to dissert on this subject (all of this should finally be
cooked in a FAQ anyway), but I do not want to flood the list with a marginaly
interesting subject.


Antoine

Reply via email to