I've been out on vacation for a few days, but I'm back now and I'm
continuing to work ont he output formatting stuff. I'm still working on the
low level stuff, that provides the basic formatting support for stuff later
to come. I'm incorporating this stuff into SAXPrint for testing, so you
might want to play with it. The current SAXPrint code in the repository has
a new parameter:

     -x=encodingname

where encodingname is the output encoding you want to print the file in.
The resulting output should be in the encoding you indicate and should have
the appropriate stuff escaped so that its legal XML again.

Current limitations:

1. The ability to escape characters not representable in the target
encoding is not in there yet, so it will consider it an error at this time.
Of course, encodings like UTF-16B or UTF-16L, or UTF-8 will represent
anything. But if you choose some other encoding, then it might be a
problem.

2. It does not automatically pick up the source encoding. The reason for
this is that the plugged in display handler in this sample is just one that
delegates to cout. Obviously cout cannot handle UTF-16 or UCS-4 and such.
So, if you choose one of those, you are going to get gorp for output. So
I'm making you select the output format for now, and will auto-select UTF-8
if you don't provide one.

3. Since the SAX output doesn't let us know whether "" or '' was used on an
attribute, the output always does "" and escapes any " within the attribute
value.

4. Even though is theoretically legal to have "&someref;" where the content
of someref has a " inside of it, the SAXPrint code is not smart enough to
keep up with nesting of entities and not escape a " within an entity ref
inside the attribute value.


As an example, the file tmp.xml:

------------------------
<?xml version='1.0' encoding="ISO-8859-1"?>

<root foo="&lt;&quot;">
    &amp;&lt;
</root>
------------------------

when run like this:

------------------------
SAXPrint -x=ISO-8859-1 tmp.xml
------------------------

will come out like this:

------------------------
<?xml version='1.0' encoding="ISO-8859-1"?>

<root foo="&lt;&quot;">
    &amp;&lt;
</root>
------------------------

which is of course a legal file that can be re-parsed again. If you look
inside the SAXPrint handler class, you'll see that basically its almost all
just delegation to an XMLFormatter object that was created by the print
handler during construction. There are some enums in XMLFormatter that
indicate what style of escaping you want, and how to deal with chars that
cannot be represented. The first flag works, but the second flag will be
ignored for now. I'll work on that next, but its going to be a slower
process (at runtime I mean) to use the flag that says to do unrepresentable
chars as char refs.

Anyway, if you want to start playing with SAXPrint or playing with the
XMLFormatter stuff in some of your own code, feel free to do so and to give
some feedback. Just be aware that it could be a bit unstable at this early
point.

If you just want lower level transcoding support. the formatter stuff is
based on the new two way transcoding as well. So you might want to play
with someof that too. You can now create an XMLTranscoder for a named
encoding and use it for transcoding both directions, which means you can
use it to transcode Unicode to your target encoding. Your own options at
this level are to either have unrepresentable chars be replaced with a
replacement char, or to have it be an error.

----------------------------------------
Dean Roddey
Software Weenie
IBM Center for Java Technology - Silicon Valley
[EMAIL PROTECTED]


Reply via email to