Codepages and locales

Bertho Stultiens Mon, 29 May 2000 04:44:39 -0700
Hi Y'all,

I am currently implementing the Wine Message Compiler (wmc: as an
alternative to mc.exe). I need to implement quite a bit unicode support
for it to function correctly. Unicode requires a lot of tables for
conversion and I did just that for nearly all codepages (from
ftp.unicode.org).

There is supposedly a difference between the A/W versions of
FormatMessage, and the messagetable resources *can* be in unicode
format, unlike the comments in the wine-source. The original message
compiler has extra switches for this (-u and -U). However, I do not know
whether it is actually used, other than on winnt. Can someone confirm
this?

The complex DBCS codepages (cp932..cp950; include leadbytes for
extension) generate tables which are approx 128kB in size (total about
500kB). More than 80% of the program is occupied by the tables. The
tables are, strictly seen, also required for wrc and wine{lib}. Wouldn't
it be better to make the codepages "loadable"? This would save quite a
bit of static (r/o) data in the executable and make it possible to
share. If so, how should the file/data format and API interface for the
tables be (I do not directly mean multibytetowidechar and friends, but
they should be considered too)? I would prefer to encapsulate the data
into ELF shared libraries so that we can take advantage of the .rodata
mapping being OS maintained. Otherwise, you either need to write into
the tables for indirection, or create extra tables at runtime on which
you can build indirection tables later.

Then there is the issue of ToUpper/ToLower, strcoll, etc... Should we
rely on the collate-info from libc and system language setting, or build
that into wine as well (also with the glibc bug in mind)? Alternative
would be to switch entirely to glibc's support for i18n and l10n, but
that would require lots of extra conversions because of wchar_t w.r.t.
WCHAR. Yet another problem here is the lack of threading v.s. language
separation in glibc.

Then also, following the recent discussion on ansi/widechar functions,
were one of the conclusions was that we need to support codepages. I
would also want to add that the language support should be fixed at the
same time.

BTW, I also noticed that the MultiByte*, LeadByte* and friends
implementation is way off from what they are supposed to do. Some things
work, but most of it gives wrong information. And, there is only one
codepage, 437, which causes a couple of problems with my foreign
stuff...

Another thing that I noticed was that a lot of the NLS data (ole/nls/*)
is plain wrong. I need the language/codepage identification for wmc. A
lot of different countries have exactly the same language-id, which is
not possible in real life... Are these files used? Is anybody
maintaining them? Are there plans for changing the
location/content/interpretation? And, shouldn't they also be runtime
loadable, instead of occupying memory? Meantime, I extracted all data
for wmc with a small win98 hack program. The same can be done for all
other locale-values for the 70 or so locales supported by my win98 copy
(according to what EnumSystemLocales() says).

Alternative to windows extraction and ftp.unicode.org is to get things
from ftp.dkuug.dk/i18n for all sorts of mappings and standards involving
codepages and locales. There data may be more complete, but also takes
more effort to convert into, for us, usable tables.

Greetings Bertho

PS, the message compiler "mc.exe" from the SDK tools crashes on native
win98, but runs fine under Wine. Way to go!
Codepages and locales

Reply via email to