Some proposals for changes to the Xerces-C system

roddey 5 Jan 2000 18:36:41 -0000

I would like to put forward a discussion of some proposed changes we feel
are necessary for the Xerces-C parser code base. Please comment on these
changes soon, as we would like to get them into the Xerces 1.1.x (3.1.x for
XML4C) code base, which is going to probably end up being a reference
release for XML4C (i.e. it will have to live for a long time in that form,
so we want to fix this problem before then.)

The problem surrounds the issue of the definition of XMLCh. When I first
designed this stuff, I figured that XMLCh would float to whatever wchar_t
is on the particular compiler. This would allow XMLCh to be passed straight
to the wide character APIs. However, I failed to get this idea across
correctly, and most of the platforms ended up just defining XMLCh to be an
unsigned 16 bit value. This is fine technically, since the code will
compile with XMLCh set to either a 16 bit or 32 bit type. However, it does
require then that all XMLCh be transcoded on those platforms where wchar_t
is not an unsigned 16 bit value.

We would like to, for the next release, deal with this issue by floating
XMLCh to wchar_t on all platforms as the default. I.e. all the binary drops
would come this way and the source drops would be set up this way by
default (though if you build it yourself you can certainly change it if you
have to.) VC++ and Borland C++ already work this way, i.e. XMLCh and
wchar_t are the same size. But the other compiler files would be changed to
define XMLCh as wchar_t on that platform.

Does anyone have any problem with this? The following things would be
affected:

1) DOM will be bigger on platforms with a 32 bit wchar_t, and be unchanged
on platforms with a 16 bit wchar_t. If you need to deal with mondo sized
DOMs, you might consider the pain of transcoding all text before you can
use it with the wide char APIs better than the memory usage. In these
cases, you would want to build your own version with a 16 bit XMLCh.
2) XMLCh will be passable directly to the wide character APIs (as long as
your platform/compiler assume that wchar_t holds Unicode code points.)
2.a) XMLCh text will also be directly interoperable with any other third
party code on your platform written to use wchar_t as its character type
(as long as your platform/compiler assume that wchar_t holds Unicode code
points.)
3) Very importantly, things like L"SomeWideString" will be directly
passable to parser APIs (as long as your compiler generates Unicode code
points for L prefixed constants.)
4) We will get rid of the StrX() helper class in all of the samples, so we
will also in the samples just pass XMLCh straight to the wide character
APIs. I.e. the samples will be written to assume the equivalence of XMLCh
and wchar_t, which would be the default anyway and would make the samples
the most understandable.


I will insure that the code will be maintained such that it works correctly
with either a 16 or 32 bit XMLCh, so that you'll always have the choice.
This is a relatively straightforward issue in the parser, since Unicode
code points are still being manipulated regardless of the size of the unit
in which they are stored. Its handled mostly just in the course of the
transcoding work already done to internalize other encodings to the
internal XMLCh format.

Of course, code built on top of the parser might choose to pick one or the
other scheme. If they picked one, my suggestion would be to assume XMLCh is
the same as wchar_t, but of course that cannot be enforced. Much code of
course can be written such that it is not dependent in any way on the size
of the XMLCh type, and that is always desireable. But, from a binary
compatability stand point, it would be desirable to at least use the
default XMLCh mapping in order to work with the official binary builds of
Xerces-C and XML4C (not to mention all of the other wchar_t based code out
there.)

The reason for all the caveats above about whether your compiler represents
wide chars as Unicode code points is due to the fact that we know so far
that some HP systems do not do this. Such systems represent a significant
problem. They use wchar_t, but dont' store Unicode in it. Therefore, they
will not want to have XMLCh defined as wchar_t, because of the horrendous
ambiguity it entails. These folks will probably probably have to define
XMLCh as a 16 bit unsigned, and live with the fact that they must transcode
all text in and out of the parser. But such systems are definitely the odd
man out and we cannot see adding any significant complexity to the code in
order to deal with them out of the box. Unicode is the future, and we feel
that we should concentrate our efforts there. I could not imagine any
system being built at this time or in the foreseeable future not using
Unicode as its intrinsic code page.


Anyway, this is where we would like to go, and we know that a lot of people
have asked for this change so that they don't have to transcode just to
call local wide character APIs. But we want to give anyone a ch


----------------------------------------
Dean Roddey
Software Weenie
IBM Center for Java Technology - Silicon Valley
[EMAIL PROTECTED]
Some proposals for changes to the Xerces-C system

Reply via email to