Hi,
Ho,
As you might now, from 2.6, GLib is using UTF-8 as the file name encoding for all (hopefully) of its API on Windows. It provides so-called gstdio wrappers in <glib/gstdio.h> for the standard POSIX and C functions that take pathnames as arguments, like g_open().
On Unix, these wrappers are simply #defines for the actual C or POSIX function. On Windows, they convert from UTF-8 to wide characters and call the C library's wide character function, for instance _wopen() in g_open(). (Let's ignore Win9x for now.)
That ignorance can be performed without any difficulty. That Win9x thing is not an operating system. It is a bad joke.
There were two reasons for this change:
1) Windows file names *are* in Unicode in the file system, so it's certainly most correct to handle them as Unicode and not shoehorn them into a restricted codepage representation. For instance, support file names with Cyrillic letters on a Western European Windows box. I think it is also relatively common in CJK locales to use characters not in the corresponding double-byte codepage.
You bet. I have a box with british english Windows and still have files which employ japanese katakana, hiragana and/or kanji in their names.
2) In the double-byte code pages the trailing byte can be '\\', which otherwise is a directory separator. This means that all code that scans pathnames byte by byte looking for backslashes (either stepping through a string manually, or using strchr() or strrchr()) is broken by design, and would need to be rewritten heavily with ugly ifdefs to use multi-byte string functions on Win32. There are a lot of such places. UTF-8 doesn't have any such issue.
Precisely, Unicode does not have the issue. UTF-8, UTF-16 and UTF-32 are just coding forms for the same standard. They are algorithmically convertible.
Now, upper level GNOME libraries that use GLib can mostly be converted trivially to use the gstdio wrappers. (I use "GNOME" in a loose sense here. Of course a GNOME desktop as such doesn't and won't exist on Windows, but many of the GNOME libraries are being ported to Windows so that it will be able to build GNOME applications on Windows.)
Now, a problem are libraries that don't use GLib, but are widely used by GNOME libraries. For instance libxml2.
Yes.
As the GNOME libs get "UTF-8 aware", i.e. are converted to use the gstdio wrappers, what should be done with pathnames passed to libxml2? If I convert them to system codepage, this means it won't work to have XML files with pathnames that aren't representable in the system codepage. This will not be good, as the intention otherwise is to make everything work just fine with any non-ASCII file name.
I found one earlier message to this list about this issue, http://mail.gnome.org/archives/xml/2001-October/msg00072.html . There the suggested solution was to override libxml2's default I/O interface. Presumably this would be by calling xmlRegisterInputCallbacks() with an open callback that would call the gstdio wrappers, but otherwise would be more or less a copy of the default xmlFileOpen(). Is this still the recommended approach?
Plugging in your own IO is still the recomended approach. I hope that will someday change on all platforms. I would love to se Unicode as mandatory for file name storage everywhere. In fact, I would love it if all non-Unicode encodings would just vanish.
Now, using Unicode file names per default would certainly make libxml2 inoperable on all Windows incarnations which don't use the NTFS filesystem. I would welcome that.
But there are embedded platforms. Never forget, libxml2 does not only power the desktops like KDE or GNOME, it is also used on embedded hardware. How many of these can afford to support full Unicode range, given the memory and storage constraints?
Ciao, Igor
_______________________________________________ xml mailing list, project page http://xmlsoft.org/ [email protected] http://mail.gnome.org/mailman/listinfo/xml
