Hi all, Here is a possible way to address this issue:
XML4C implementation assumes that the target platform has a native Unicode data type and a set of C/C++ APIs that can process Unicode in a locale sensitive fashion. - Solaris, AIX, and Linux provide a native Unicode data type (and process code) by extending the ANSI/ISO/IEC C/C++ standard definition of the wchar_t data type. This is perfectly legal per the standard. However, relying on this extension limits the portability of xml4c to platforms that make the same extension to wchar_t. - HP-UX does not extend wchar_t to be Unicode. I know this is true for traditional locales (like Japanese Shift-JIS); I left v-mail with the libc team in Cupertino to see if perhaps this extension applies for UTF-8 locales. This is not the XML4C team's problem. This is simply a lack of needed features on HP-UX. But since we can't control whether or not HP-UX provides these features... - We will need to implement a set of Unicode APIs in order to use xml4c on non-ASCII, non- western European data. A good way to make this more visible (and easier to address) would be to create an abstraction layer - an XML4C-specific Unicode data type and set of Unicode processing routines. So... I suggest that current usage of wchar_t in xml4c be replaced with a new set of API names: wcstombs() --> xml4c_UnicodeToMbs() towupper() --> xml4c_UnicodeToUpper() ... Then, create a directory in the source for each platform to which xml4c will be ported. In that directory, the implementation/mappings of the xml4c_Unicode*() APIs for that platform must be defined. For AIX, Solaris, and Linux, it sounds like this will be a simple #define wrapper for their wchar_t routines. This takes advantage of their extensions, providing a level of abstraction without sacrificing performance. The source for xml4c is now truely portable to any ANSI/ISO/IEC C/C++ conformant platform; part of the port is to define the platform mappings for the required Unicode APIs and data type. This becomes very visible to anyone contemplating a port of xml4c. A quick analysis of the source tree shows that the following files contain usage of wchar_t data type and/or APIs. These files would need to be modified to use the new xml4c_Unicode*() APIs: ./xml4csrc3_0_0/samples/domcount/domcount.cpp ./xml4csrc3_0_0/samples/enumval/enumval.cpp ./xml4csrc3_0_0/samples/memparse/memparse.cpp ./xml4csrc3_0_0/samples/pparse/pparse.cpp ./xml4csrc3_0_0/samples/redirect/redirect.cpp ./xml4csrc3_0_0/samples/redirect/redirecthandlers.cpp ./xml4csrc3_0_0/samples/saxcount/saxcount.cpp ./xml4csrc3_0_0/samples/saxprint/saxprint.cpp ./xml4csrc3_0_0/samples/stdinparse/stdinparse.cpp ./xml4csrc3_0_0/src/util/compilers/borlandcdefs.hpp ./xml4csrc3_0_0/src/util/compilers/vcppdefs.hpp ./xml4csrc3_0_0/src/util/transcoders/iconv/iconvtransservice.cpp ./xml4csrc3_0_0/src/util/transcoders/win32/win32transservice.cpp ./xml4csrc3_0_0/tools/nls/xlat/xlat.cpp ./xml4csrc3_0_0/tools/nls/xlat/xlat_cppsrc.cpp ./xml4csrc3_0_0/tools/nls/xlat/xlat_msgcatalog.cpp ./xml4csrc3_0_0/tools/nls/xlat/xlat_win32rc.cpp Note that the first block of files are in the */samples directory; these are code examples and do not affect HP's use of xml4c. The second block of files are those used in the parsers implementation. The APIs used in these files (and hence the ones for which we'll need to implement an HP wrapper) are currently: isw*() mbstowcs() mbtowc() towupper() wcscmp() wcslen() wcstombs() wprintf() wint_t /* data type */ wchar_t /* data type */ Note that this list could potentially grow over time to include the entire set of wide character APIs from libC. Regards, Mike Krause Software Engineer Hewlett-Packard Company PS: Thanks Dean and folks at IBM for the good discussion on this subject! We're looking forward to helping out with a solution to this problem on the HP-UX platform. -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 12, 2000 6:09 PM To: [EMAIL PROTECTED] Subject: More Xerces-C character encoding discussion [Xerces-C : HPUX] We have been discussing some isssues with the HP folks about the character encoding issues with their platforms. Some of this has been tangentially discussed already, but this time I'll concentrate on the HP specific issues. Anyone with any comments please feel free to speak up. Issue: HP platforms don't necessarily store Unicode in their wchar_t type. What they store is actually locale specific, and I assume its never actually Unicode in any locale. This definitely raises some issues, because of the fact that the parser is very Unicode-centric. Possible Solutions: 1 - Definition of XMLCh. I don't think that such platforms would want to float XMLCh to char_t. They should set XMLCh to a 16 bit unsigned value. This setting is controlled in the per-compiler file. The XML parser code will automatically readjust to this, though their might be some remaining issues in some of the platform specific files like the pluggable transcoders, which will get worked out as we find them. But the rest of the XML parser code and DOM code will automagically just compile with XMLCh set to either a 16 or 32 bit value. This will prevent any accidental interopability of the local non-Unicode wide character APIs and the Unicode APIs of the parser. All of the parser APIs would then only accept unsigned shorts and it would also spit out unsigned short XML data, so it would be obvious where and when you needed to transcode in and out of the system, and L"foo" won't be passable to any XMLCh API. 2 - Calls to the System: All calls to system or runtime APIs from the parser itself go through the base abstractions that are plugged into the bottom of the parser. In particular all of the system APIs are called in the per-platform support file. So, for a platform such as HP, certainly these support files will have to preflight the incoming XMLCh data before passing it on to the system APIs that they call. By providing such pre-flighting code in the HP platform support file, a large amount of the problems will be taken care of. 3 - Transcoders. There are issues wrt the plugged in transcoder implementation. It is likely that the HP platform will have to provide its own Iconv based transcoder service implementation. This implementation will have to put a buffer between incoming Unicode and the local iconv APIs, and between any outgoing transcoded text that needs to come back to the parser in Unicode format. Providing this specialized transcoder implementation will handle the bulk of the remaining issues. Whether this means that the existing Iconv based transcoder is just spiffed up with some conditional code or not, I don't know. If supporting these extra steps imposed any significant extra complexity or overhead, I would argue for the HP folks maintaining their own Iconv based transcoder implementation. But they can always just do the work and lets see what the differences are. If they are reasonable to get into the existing iconv transcoder, then we can go that way. 4 - Unicode normalization. The XML parser assumes that all plugged in transcoding services pre-normalize all code that it transcodes into the Unicode encoding. If it does not, then the parser will make no attempt to compensate for this. So, if you provide a transcoding service, and normalization is important to you, you might have to do some post processing of transcoded text blocks to pre-normalize them before returning the block of Unicode characters to the parser. The HP folks believe that the HP implementation of iconv does not do this. We do no know if the other Unixes do so. 5 - The Samples. Basically, we are leaning towards saying that on platforms such as the HP ones, where wide chars are not Unicode, then samples just won't work. We are loath to turn the samples into overly complicated lessons on internationalization, when they really intended to be simple demonstrations of how to use the parser. Making them industrial strength would not necessarily be in the best interests of keeping them relatively straightforward. We will probably just have to document this fact, and provide some basic guidance about the real effort required in writing fully portable code on top of the parser. Though I don't think that we can provide any really deep tutorial on the subject, since it would be a book unto itself and that's not what we are geared for. Perhaps the Internationalization folks might provide some good links to send people to look at. This probably all of their discussion is likely oriented towards wchar_t being Unicode as well. But at least we can provide a little warning that such platforms present special concerns. 6 - Short Character Constants Basically, most incoming APIs of the parser have an alternate method that takes a short character. This character is transcoded to Unicode using the 'local code page transcoder'. This trancoder is obtained by the parser by asking the installed transcoding service to provide one. The platform code initialization of each particular platform should do whatever is required to make sure that this transcoder is doing the right thing for whatever encoding a short character constant (i.e. "foo") means on that platform. If this means consulting locale data or whatever, this can be done by the platform implementation's initialization code. The parser does not get involved in such things. * We want to stress that there should be NO calls to system APIs or wide character runtime APIs in the parser itself. If you find any, please report them since they are bugs. Any such work done by the parser should be done via the provided abstraction classes in the util/ directory, mostly XMLString and the transcoding service abstractions. If this is not strictly followed, then #2 and #3 won't work correctly in this types of situations. I think that, if the HP platform utilities are written to take this issue into account, and they provide a transcoder aware of the issues, probably that will be sufficient to solve the vast bulk of the issues, and perhaps all of them (at least all of the ones that we believe should be dealt with.) It will always be required on such platforms to transcode data going into the parser or coming out of it. This only leaves the access by the parser to system and transcoding services. As long as appropriately aware versions of such services are plugged into the parser from the bottom, everything should work out. Anyway, these are some of the obvious issues, and is intended to just kick start the discussion. Please respond to this document with any thoughts you have on the subject and lets beat them out. ---------------------------------------- Dean Roddey Software Weenie IBM Center for Java Technology - Silicon Valley [EMAIL PROTECTED]