Xerces-C Tech Talk: Encodings

roddey 1 Feb 2000 20:14:44 -0000

Since encoding support is an oft raised issue (like the way I used that
word 'oft' to make it sound like I'm intelligent?), in this episode of
Xerces-C Tech Talk, our fearless parser discusses how to tame evil
encodings to make them work for you. Ok, so that was a little cliched, but
damnit Jim, I'm a software engineer, not a prose writer. Anyway, I've
obviously had too much coffee and lets move on to the subject at hand. This
document describes how text encodings are used by the parser, and the steps
it takes to read in an external XML entity (meaning a file or memory buffer
or some external representation of XML text.)

There are gazzillions of ways to represent a character or symbol of a
language in binary form in storage. Well, ok maybe there are only
bazzillions, but there are still a lot of them. Some of them, such as
ASCII, are very well known since they'be been around since the dawn of
computerdom. But XML, being designed as the Borg of text formats, must be
able to potentially handle XML text stored in any conceiveable character
encoding. An XML parser must be able to figure out (usually without any
help from the studio audience) what encoding each XML entity is in, and how
to read that XML text into some internal format.

When the parser encounters some external entity that must be parsed, it
will create an XMLReader object for that entity. XMLReader objects present
entities inside the parser, and provide the mechanisms to do all of the
work that we are discussing here. Each parser has a reader manager object,
which is really just a stack of readers. Since an entity can reference and
entity which can reference an entity and so on, obviously the parser must
have a way to store what its doing, parse a new entity, and then come back
to where it left off. This stack of readers provides that capability. When
the parser needs to look at the next character of input, it asks the reader
manager for the next character. The reader manager in turn asks the reader
on the top of the stack for its next character. This way, the parser is not
aware of nesting of entities for the most part.

When a new reader is created for a newly referenced external entity, the
reader constructor does a series of probes to figure out what the heck its
dealing with. In some cases, if the input source for that entity is
provided by client code, and provides an encoding override, then no work is
done and the client code's word is taken on the encoding. Otherwise, the
reader must figure out how to interpret the data that its been given. So
the first thing the reader does is to read a buffer of raw binary data from
the input stream it was given. In order to drastically improve performance,
each reader maintains a relatively large buffer of raw data, so that it can
read from the source in large chunks. This is particularly important for
socket based remote data, in which a read operations might involve a good
bit of work. The parser will then work out of this large raw buffer until
its empty, then it will fill it again.

Once its has that first buffer of raw data, the reader will proceed to try
to 'auto sense' the basic encoding. In the src/framework/ directory, there
is a class called XMLRecognizer. This class contains the smarts to do the
autosensing operation. It is provided as a public class to allow client
code to autosense formats if they choose to. The reason I said 'basic'
encoding is that each XML entity can have, in its XMLDecl or TextDecl, an
encoding="" statement that says exactly what encoding it is in. However, in
order to read that first line, we have to know enough about the encoding of
the file to decode the first line and find out if an encoding="" is present
or not. The 'basic' encoding well... basically I guess, tells us what
family of encodings the file is in. There are a set of encodings from which
most everything out there descends in one way or another. Since the Decl
lines can only have a very limited set of characters in it, once we figure
out the encoding family, we can get through that first Decl line without
knowing the exact encoding.

The families of encodings are: UTF-8/ASCII, UTF-16, UCS-4, and EBCDIC. So,
if we can figure out which of these families the entity's encoding belongs
to, we can get started. In order to figure this out, we use two tricks. One
is the BOM and the other is the XMLDecl/TextDecl itself. Since many XML
entities start with this decl, the first characters in the file are often
"<?xml ", in some encoding. If so, then the first bytes of the file will
follow a known pattern. For instance, if its in the ASCII/UTF-8 family, it
will start with the bytes: 0x3C, 0x3F, 0x78, 0x6D, 0x6C, 0x20. If we see
one of these patterns of bytes, we know its in that particular encoding
family.

If no decl is present, then the XML spec says that the file must be in
UTF-8. However, as a convenience, many parsers will also look for a BOM, or
Byte Order Mark, that is prepended to most UTF-16 text. The BOM is the
value 0xFEFF, and indicates that the file is highly likely to be a UTF-16
file. According to the endianness of the machine that stored the data, the
bytes of this value may be either 0xFE, 0xFF or 0xFF, 0xFE. This tells us
the endianness of the data itself and allows us to know which way to decode
it. By the way, though UCS-4 has no BOM, we check for both little and big
endian Decl byte sequences so that we know which endianness it is in as
well. The other families are either single byte or are single byte within
the limited set of characters allowed by the XML spec within the decl line.

So, now we've either figured out that the text is in one of the basic
encoding families, or we've not figured it out and assumed its UTF-8. If we
figured it out by their being an XMLDecl/TextDecl present, then we need to
be able to look through it and find the encoding="" line, if its present.
The way that our parser does this is to manually transcode the Decl line,
from within the XMLReader constructor. It puts this code into another
buffer of XMLCh characters. XMLCh is the internal representation of
characters inside the parser. The size of the XMLCh type can vary, but it
always holds Unicode characters. Since Unicode is capable of representing
all of the possible encodings, using Unicode internally allows the parser
to be written to a single character type, vastly simplifying it.

The reader constructor then returns and processing proceeds normally; and,
after some other bookkeeping, the high level parsing code begins trying to
parse this new entity and starts asking for characters. At this point, one
of two things happen. If an XMLDecl/TextDecl was found and manually
pre-transcoded, then the initial parsing code will have enough characters
available to it to parse the Decl line. During this process, it will see
any encoding="" string. When it does, it will call back into the current
reader and ask it to update itself to use this new encoding. At that point,
the reader will use the encoding name to create an XMLTrancoder object
(which it gets from the installed transcoding service) and it will store
this for subsequent transcoding duties. If there was no Decl line, then no
data was pre-transcoded. So, when the first character is asked for, it will
be assumed that the auto-sensed encoding is the correct one and an
XMLTranscoder for that encoding will be created.

When either the end of the Decl line is hit or the first character is
requested and no Decl line was present, the installed transcoder will be
used to transcode data from the reader's raw data buffer to its
internalized character buffer, about 4K at a time. The reader then spools
out characters from this internalized buffer until it becomes empty. It
then tries to transcode another 4K characters from the raw buffer. If the
raw buffer empties out, then it too is refilled. Eventually, no more raw
data can be gotten, hence no more characters can be gotten, so the parser
knows that this reader is all washed up. So it pops the reader off the top
of the reader manager stack and starts working again with the previous
reader.

This whole sequence repeats itself again and again until the end of the
original reader is hit, at which time the end of the parse is at hand, and
there is much rejoicing and feasting if it all went well.

When an encoding="" line is seen, and the parser calls back to the reader
with the encoding name, the reader will ask the currently installed
transcoding service (each per-platform utilities file can choose which want
it wants to install) to create an XMLTranscoder for that encoding. This
call first goes to the platform independent part of the transcoder. This
code will first check to see whether the encoding represents one of the
intrinsically supported encodings. If so, it will short circuit the request
and return one of the XMLTrancoder derivatives from the src/util directory.
As of 3.1.0, the parser intrinsically supports UTF-8, ASCII, UTF-16, UCS-4,
EBCDIC-US and ISO-8859-1 (Latin1.) If the encoding is not one of these,
then the transcoder is asked to create a transcoder for the encoding. If
the transcode cannot or does not, then the parser will issue an error that
a transcoder for encoding X cannot be created.

If the encoding string is of a set of endian ambiguous encodings, such as
UTF-16 or CP1200 or UCS-4, then the previously sensed endianness is used to
know what type of each transcoder to create.Where possible, you should
always use an endian specific name, such as "UTF-16LE", so that there is no
ambiguity.

Related Random Facts:

Even if the sequence of referenced entities creates a legal sequence of XML
characters, that's not good enough necessarily. This little addendum is not
strictly rlated to the topic, but its interesting so I threw it in. The XML
spec requires that all entities contain 'balanced markup'. So you cannot
have the beginning quote of a string in one entity and the closing quote in
another. You cannot have the start tag of an element in one entity, and the
end tag of that element in another entity. You cannot have the open angle
brack of an element decl in one entity and the close bracket in another,
and so on.

So, the parser has to do a lot of checking to insure that these rules are
met. The Xerces parser does it by assigning a unique id to each new reader
that is created. It uses a synchronized counter (a static member of the
reader manager), to create a new id that it sets on each reader that it
creates and pushes onto the stack. So now for instance, when the high level
parsing code sees a < character in the DTD and knows its about to start
parsing some markup decl, it can ask the reader manager for the id of the
current reader, which it stores in a local variable. when it gets to the >
character, it gets the id again. If it does not get back the same id, it
knows that the entity is unbalanced and issues a 'Partial Markup' error.

----------------------------------------
Dean Roddey
Software Weenie
IBM Center for Java Technology - Silicon Valley
[EMAIL PROTECTED]
Xerces-C Tech Talk: Encodings

Reply via email to