RE: More Xerces-C character encoding discussion

KRAUSE,MIKE (HP-FtCollins,ex1) 18 Jan 2000 00:15:02 -0000

Hi all,

Here is a possible way to address this issue:


XML4C implementation assumes that the target platform has a native Unicode
data type and a set of C/C++ APIs that can process Unicode in a locale
sensitive fashion.

   - Solaris, AIX, and Linux provide a native Unicode data type (and process
     code) by extending the ANSI/ISO/IEC C/C++ standard definition of
     the wchar_t data type.  This is perfectly legal per the standard.
     However, relying on this extension limits the portability of xml4c
     to platforms that make the same extension to wchar_t.

   - HP-UX does not extend wchar_t to be Unicode.  I know this is true
     for traditional locales (like Japanese Shift-JIS); I left v-mail with
     the libc team in Cupertino to see if perhaps this extension applies for
     UTF-8 locales.

     This is not the XML4C team's problem.  This is simply a lack of
     needed features on HP-UX.  But since we can't control whether or
     not HP-UX provides these features...

  -  We will need to implement a set of Unicode APIs in order to use
     xml4c on non-ASCII, non- western European data.

A good way to make this more visible (and easier to address) would be to
create an abstraction layer - an XML4C-specific Unicode data type and
set of Unicode processing routines.  So...  I suggest that current usage
of wchar_t in xml4c be replaced with a new set of API names:

   wcstombs() -->  xml4c_UnicodeToMbs()
   towupper() -->  xml4c_UnicodeToUpper()
   ...

Then, create a directory in the source for each platform to which xml4c
will be ported.  In that directory, the implementation/mappings of the
xml4c_Unicode*() APIs for that platform must be defined.  For AIX,
Solaris, and Linux, it sounds like this will be a simple #define wrapper
for their wchar_t routines.  This takes advantage of their extensions,
providing a level of abstraction without sacrificing performance.  The
source for xml4c is now truely portable to any ANSI/ISO/IEC C/C++
conformant platform; part of the port is to define the platform mappings
for the required Unicode APIs and data type.  This becomes very visible
to anyone contemplating a port of xml4c.


A quick analysis of the source tree shows that the following files
contain usage of wchar_t data type and/or APIs.  These files would need
to be modified to use the new xml4c_Unicode*() APIs:

   ./xml4csrc3_0_0/samples/domcount/domcount.cpp
   ./xml4csrc3_0_0/samples/enumval/enumval.cpp
   ./xml4csrc3_0_0/samples/memparse/memparse.cpp
   ./xml4csrc3_0_0/samples/pparse/pparse.cpp
   ./xml4csrc3_0_0/samples/redirect/redirect.cpp
   ./xml4csrc3_0_0/samples/redirect/redirecthandlers.cpp
   ./xml4csrc3_0_0/samples/saxcount/saxcount.cpp
   ./xml4csrc3_0_0/samples/saxprint/saxprint.cpp
   ./xml4csrc3_0_0/samples/stdinparse/stdinparse.cpp

   ./xml4csrc3_0_0/src/util/compilers/borlandcdefs.hpp
   ./xml4csrc3_0_0/src/util/compilers/vcppdefs.hpp
   ./xml4csrc3_0_0/src/util/transcoders/iconv/iconvtransservice.cpp
   ./xml4csrc3_0_0/src/util/transcoders/win32/win32transservice.cpp
   ./xml4csrc3_0_0/tools/nls/xlat/xlat.cpp
   ./xml4csrc3_0_0/tools/nls/xlat/xlat_cppsrc.cpp
   ./xml4csrc3_0_0/tools/nls/xlat/xlat_msgcatalog.cpp
   ./xml4csrc3_0_0/tools/nls/xlat/xlat_win32rc.cpp

Note that the first block of files are in the */samples directory; these
are code examples and do not affect HP's use of xml4c.  The second block
of files are those used in the parsers implementation.  The APIs used in
these files (and hence the ones for which we'll need to implement an HP
wrapper) are currently:

   isw*()
   mbstowcs()
   mbtowc()
   towupper()
   wcscmp()
   wcslen()
   wcstombs()
   wprintf()
   wint_t  /* data type */
   wchar_t /* data type */

Note that this list could potentially grow over time to include the entire
set of wide character APIs from libC.

Regards,

Mike Krause
Software Engineer
Hewlett-Packard Company

PS:  Thanks Dean and folks at IBM for the good discussion on this subject!
We're looking forward to helping out with a solution to this problem on the
HP-UX platform.

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: Wednesday, January 12, 2000 6:09 PM
To: [EMAIL PROTECTED]
Subject: More Xerces-C character encoding discussion





[Xerces-C : HPUX]

We have been discussing some isssues with the HP folks about the character
encoding issues with their platforms. Some of this has been tangentially
discussed already, but this time I'll concentrate on the HP specific
issues. Anyone with any comments please feel free to speak up.

Issue:

HP platforms don't necessarily store Unicode in their wchar_t type. What
they store is actually locale specific, and I assume its never actually
Unicode in any locale. This definitely raises some issues, because of the
fact that the parser is very Unicode-centric.


Possible Solutions:

1 - Definition of XMLCh.

I don't think that such platforms would want to float XMLCh to char_t. They
should set XMLCh to a 16 bit unsigned value. This setting is controlled in
the per-compiler file. The XML parser code will automatically readjust to
this, though their might be some remaining issues in some of the platform
specific files like the pluggable transcoders, which will get worked out as
we find them. But the rest of the XML parser code and DOM code will
automagically just compile with XMLCh set to either a 16 or 32 bit value.

This will prevent any accidental interopability of the local non-Unicode
wide character APIs and the Unicode APIs of the parser. All of the parser
APIs would then only accept unsigned shorts and it would also spit out
unsigned short XML data, so it would be obvious where and when you needed
to transcode in and out of the system, and L"foo" won't be passable to any
XMLCh API.

2 - Calls to the System:

All calls to system or runtime APIs from the parser itself go through the
base abstractions that are plugged into the bottom of the parser. In
particular all of the system APIs are called in the per-platform support
file. So, for a platform such as HP, certainly these support files will
have to preflight the incoming XMLCh data before passing it on to the
system APIs that they call. By providing such pre-flighting code in the HP
platform support file, a large amount of the problems will be taken care
of.

3 - Transcoders.

There are issues wrt the plugged in transcoder implementation. It is likely
that the HP platform will have to provide its own Iconv based transcoder
service implementation. This implementation will have to put a buffer
between incoming Unicode and the local iconv APIs, and between any outgoing
transcoded text that needs to come back to the parser in Unicode format.
Providing this specialized transcoder implementation will handle the bulk
of the remaining issues.

Whether this means that the existing Iconv based transcoder is just spiffed
up with some conditional code or not, I don't know. If supporting these
extra steps imposed any significant extra complexity or overhead, I would
argue for the HP folks maintaining their own Iconv based transcoder
implementation. But they can always just do the work and lets see what the
differences are. If they are reasonable to get into the existing iconv
transcoder, then we can go that way.

4 - Unicode normalization. The XML parser assumes that all plugged in
transcoding services pre-normalize all code that it transcodes into the
Unicode encoding. If it does not, then the parser will make no attempt to
compensate for this. So, if you provide a transcoding service, and
normalization is important to you, you might have to do some post
processing of transcoded text blocks to pre-normalize them before returning
the block of Unicode characters to the parser. The HP folks believe that
the HP implementation of iconv does not do this. We do no know if the other
Unixes do so.

5 - The Samples.

Basically, we are leaning towards saying that on platforms such as the HP
ones, where wide chars are not Unicode, then samples just won't work. We
are loath to turn the samples into overly complicated lessons on
internationalization, when they really intended to be simple demonstrations
of how to use the parser. Making them industrial strength would not
necessarily be in the best interests of keeping them relatively
straightforward.

We will probably just have to document this fact, and provide some basic
guidance about the real effort required in writing fully portable code on
top of the parser. Though I don't think that we can provide any really deep
tutorial on the subject, since it would be a book unto itself and that's
not what we are geared for. Perhaps the Internationalization folks might
provide some good links to send people to look at. This probably all of
their discussion is likely oriented towards wchar_t being Unicode as well.
But at least we can provide a little warning that such platforms present
special concerns.

6 -  Short Character Constants

Basically, most incoming APIs of the parser have an alternate method that
takes a short character. This character is transcoded to Unicode using the
'local code page transcoder'. This trancoder is obtained by the parser by
asking the installed transcoding service to provide one. The platform code
initialization of each particular platform should do whatever is required
to make sure that this transcoder is doing the right thing for whatever
encoding a short character constant (i.e. "foo") means on that platform. If
this means consulting locale data or whatever, this can be done by the
platform implementation's initialization code. The parser does not get
involved in such things.


* We want to stress that there should be NO calls to system APIs or wide
character runtime APIs in the parser itself. If you find any, please report
them since they are bugs. Any such work done by the parser should be done
via the provided abstraction classes in the util/ directory, mostly
XMLString and the transcoding service abstractions. If this is not strictly
followed, then #2 and #3 won't work correctly in this types of situations.



I think that, if the HP platform utilities are written to take this issue
into account, and they provide a transcoder aware of the issues, probably
that will be sufficient to solve the vast bulk of the issues, and perhaps
all of them (at least all of the ones that we believe should be dealt
with.) It will always be required on such platforms to transcode data going
into the parser or coming out of it. This only leaves the access by the
parser to system and transcoding services. As long as appropriately aware
versions of such services are plugged into the parser from the bottom,
everything should work out.

Anyway, these are some of the obvious issues, and is intended to just kick
start the discussion. Please respond to this document with any thoughts you
have on the subject and lets beat them out.

----------------------------------------
Dean Roddey
Software Weenie
IBM Center for Java Technology - Silicon Valley
[EMAIL PROTECTED]

RE: More Xerces-C character encoding discussion

Reply via email to