Xerces-C Tech Talk: Input Sources

roddey 27 Jan 2000 20:23:21 -0000

Just a little refresher for how input sources are used in the C++ parser,
for those new to Xerces-C and covering the recent changes for those who
have been using it.

The parser works completely in terms of 'input sources', as defined by the
SAX API. An input source is a very abstract concept that just represents
some source of XML data to be parsed. Where that data is and how its gotten
to is something that the parser doesn't care about. It just wants to ask
for data from an XML entity, so that it can parse it. The input source
presents that '10,000 foot level' view of XML data that the parser cares
about.

The input source has a couple of jobs:

1) It holds the SYSTEM and PUBLIC ids that represent the XML data source.
These may be parsed from an external entity declaration in the DTD (e.g.
<!ENTITY FooBar SYSTEM "http://booger.com/path/someentity.xml";>), or they
may be provided programmatically by the client program.

2) It is the factory method for input streams that can read data from the
source. The parser works in terms of input streams. For the C++ parser, it
works in terms of the abstract class BinInputStream. When the parser wants
to get to the data represented by an input source, it asks the input source
object to create a new stream that can access that data. The parser doesn't
care about the details of how that data gets read, it just knows that it
has a stream that is capable of reading it.

3) It allows you to force the the encoding. Usually, the parser figures out
what encoding an XML entity is in by poking around in the XML data. It
looks for key beginning sequences and for the encoding="" statement to
figure this out. However, if you know you are parsing XML with a particular
encoding, you can force the encoding up front by calling setEncoding() on
the input source before you give it to the parser to parse. The parser will
then skip any internal auto-sense of the encoding and just take your word
for it. This might be desireable if you know the file has been transcoded
such that the encoding="" line is no longer correct.


Since the base InputSource class is abstract, you must create a derived
class that knows how to handle some sort of data source. The parser (as of
the upcoming 3.1.0/1.1.0 release) comes with:

1 - LocalFileInputSource - For files on the local file system. This is the
optimized way to parse local files.
2 - MemBufInputSource - For parsing from in memory file buffers (which you
might have defined programmatically or just read into memory from some
other source.)
3 - URLInputSource - For parsing from some URL based source.

In many cases, you will just call the parser like this:

    myParser.parse("SomePathOrURLl");

Since the parser works only in terms of input sources, this type of API is
merely a convenience for you. Internally, the parser must try to figure out
if this is a local file or a URL and create the correct type of input
source to use. If you know what type of entity you are parsing, it will
generally be more efficient to create the input source yourself, to avoid
this ambiguity.

What the parser does to figure this out itself is:

1 - Try to parse it as a URL.
2 - If this works, assume its a URL and create a URLInputSource
3 - Else, assume its a file and give it our best shot with a
LocalFileInputSource


When any XML entity references another entity, it can choose to either give
a fully qualified path or URL, or a partially qualified path or URL. Fully
qualified SYSTEM ids are take on face value and used as is. Relative ids
are assumed to be relative to the path of the entity which references it.
So, if "foo.xml" references "../bar.xml" which references "../baz.xml",
then baz.xml will be two directory levels above foo.xml. This is because
../bar.xml is relative foo.xml so its one level up, and ../baz.xml is
relative to bar.xml so its up another level.

* Note that this did not work correctly in versions prior to the upcoming
3.1.0/1.1.0 release. In previous versions, this was incorrectly implemented
such that all relative references were relative to the primary document.
So, though this is a change in semantics, it is in the direction of
correctness.


You can play many tricks with input sources. Since writing your own input
source allows you to insert your own types if binary input streams into the
parser, it allows you to get data into the parser from any source which can
supply that data. You could have an input source for a database that
extracts a record into memory and returns a memory input stream to parse
it. You could have an input source which gets a mail message from another
machine, pulls some XML payload from it, and returns a memory input stream
to parse it. You can have an input source that effectivley parses some
continuous stream of data which is broken into legal chunks of XML by
having your stream class recognize some embedded end of entity marker and
stop reading, so your input source just keeps handing out a handle to the
same source of data over and over to each successive input stream requested
by the parser.

Be aware that, as the code stands right now, if you force the encoding on
an entity, all internal smarts about the encoding are skipped. So, if you
force the encoding to UTF-16LE, and there is a BOM, the parser won't try to
skip it, and the parse will fail. So currently, if you do this, skip the
BOM yourself. If there is a BOM, and you don't know the endianness, check
the BOM and set the appropriate encoding for that endianness. The reasoning
here is that by your forcing the encoding, you are telling the parser to
stay out of it. If you installed your own transcoder, and set an encoding
of "foobar", the parser wouldn't have any idea what that meant and whether
an 0xFFFE at the start of the file is a BOM or not. Any attempt to second
guess by special casing some forced encoding names would only be a partial
solution. There is currently some discussion on this issue with one of our
customers. If you have any thoughts or opinions on this, please post them.


----------------------------------------
Dean Roddey
Software Weenie
IBM Center for Java Technology - Silicon Valley
[EMAIL PROTECTED]
Xerces-C Tech Talk: Input Sources

Reply via email to