Re: UTF8 locale & shell encoding

Philippe Verdy Fri, 16 Jan 2004 09:13:07 -0800

From: "Jon Hanna" <[EMAIL PROTECTED]>
> > It would be good to say that this depends on the compiler tool you use,
and
> > its version...
>
> True, I was refering here to VisualC++. The naming convention has been
> relatively stable for the last few versions IIRC.
>
>  There's nothing less portable _on Windows_ than the "standard
> > C/C++ library", which try to mimic more or less successfully what is
offered
> > on Unix/Linux and other POSIX systems...
>
> It's not a good idea to code as if other values have any degree of
> cross-compiler and/or cross-platform stability unless you are explicitly
coding
> to a standard which does define them (as I believe POSIX does) in addition
to
> standards related to C++ itself.


Note the words that I underlined: _on Windows_

The most portable way _on Windows_ to convert between UTF-8 and ACP/OCP
codepages remains the MultiByteToWideChar API. Sure it is a Windows specific
API, but it will be supported by all C/C++ compilers for Windows, whatever
their level of support for locales (most compilers on Windows have very
Basic support for locales, and weak or no support for other locales than
"C"). So any code that depends on locale names on Windows is very likely to
fail because of the absence of support for other POSIX locales than "C".

Of course you can choose which compiler and version to use if you build your
own binaries. But if you want to make the _source_ code portable, you then
need a compromize, by specifically saying which compiler and version you
support with this source code.

The question from Deepak is then correctly answered: POSIX locales are not a
great help for Windows where there's not even a system environment setting
to define it: you need a POSIX emulation layer to artificially infer a POSIX
locale based on system locale information as seen with the Win32 APIs like
getACP() or getOEMCP() and other APIs to get the user's regional settings
for language and number/date formatting. Such emulation layer is built in
the port of Java VM and core libraries for Windows.

Of course you can use functions like wcs* mbs* functions on Windows, but
conversion of character encodings is to build yourself, or by using the
support functions built into the standard C library of a specific compiler
and version, which most of the time will only be able to convert between the
ACP code page (used by mbs* functions and the ANSI version of Win32 APIs)
and UTF-16 (used by wcs* functions and the _UNICODE version of the Win32
API). Standard C libraries for Windows that are based on (char*) strings
assume most of the time that filenames will be given in the local ANSI
codepage (see the result of getACP()), or that output to a console or
DOS-emulation functions will use the OEMCP, or that calls to _UNICODE
versions of Win32 API with (wchar_t*) strings will use UTF-16.

Thanks, the MultiByteToWideChar() API (and the reverse) is working correctly
even on Windows 95 provided that it is limited to convert between UTF-16 on
one side and ACP or OEMCP or UTF-8 on the other side.

Support for other codepages (including Windows-1252 on non-European versions
of Windows) is not guaranteed (won't work on Windows 95, may work on Windows
2000/NT/XP provided that these extra codepages have been installed by the
Administrator in the regional settings configuration panel).

Re: UTF8 locale & shell encoding

Reply via email to