Re: SAX parser: Multiple documents from a single stream

roddey 11 Jan 2000 22:21:59 -0000

Use of XML parsers with such a continuous stream has been much discussed,
and really there hasn't been much of a solution as far as I know of. In the
C++ parser, we have no real way to deal with such a thing.

Part of the problem arises from the fact that, in many cases, its
impossible to tell the end of one document from the start of another. For
instance:

<doc1>Some text</doc1>

<!-- Who does this belong to? -->

<doc2>Some text</doc2>

The comment in the middle can legally be either trailing miscellaneous of
the first document or prolog of the second. Even if each one of them starts
with an XML decl, the parser can't be expected to guess where its supposed
to stop parsing in a continuous stream. How does it know that the xml decl
is not just a malformed PI or incorrectly placed XMLDecl of some sort?

The only way I could see that it would work, for the C++ parser anyway, is:

1) Create your own ContinuousInputSource class, derived from InputSource.
This will give you a chance to create your own types of streams, since
input sources are the factories for input streams. So it will have a data
member that holds the info for the source of the data, some sort of handle
one would assume, that it gives to each new stream it creates.

2) Create a special ContinousBinInputStream, derived from our
BinInputStream. Each instance of this class just streams from the data
source given to it by the input source class upon creation. It does not try
to own it or close it, it just streams from it.

3) There will be some marker in the incoming data. When the stream hits
this, it quits returning data, which causes the parse to quite. The parser
will now close down that stream, but that's ok because the stream doesn't
try to close down the source of the data. It just leaves it pointing at
wherever it was last left.

4) Now, if the input source class just keeps giving back new streams that
work on the same actual source of data, the next stream it gives back will
just start spooling data out that same input where it left off. So the next
parse operation will start parsing from the data source from where it was
left off, and will stop when it too hits the separator marker.


So the trick is to have something in the data stream that will trrigger
your input stream class to stop returning data and claim that the input has
ended. It will then skip over that separator and be ready for the next
stream to be created.

//
//  Create a continous input source object and give it a handle
//  to the data source.
//
ContinuousInputSource curSrc(someHandle);

// All the work to create and set up the parser...
SAXParser parserToUse;

while (true)
{
    //
    //  Do the parse with the same input source over and over. Your
    //  input stream class could throw an exception to get out of the
    //  loop when the data really ended.
    //
    parserToUse(curSrc);
}


In the ContinousInputSource's factory method, it would be something like
this. It would just create a new input stream, of your own derived type,
and give it a copy of the handle it should stream data from.

BinInputStream* ContinuousInputSource::makeStream() const
{
    return new ContinuousBinInStream(fSrcHandle);
}


----------------------------------------
Dean Roddey
Software Weenie
IBM Center for Java Technology - Silicon Valley
[EMAIL PROTECTED]



Adrian Brogan <[EMAIL PROTECTED]> on 01/11/2000 07:15:28 AM

Please respond to [EMAIL PROTECTED]

To:   "'[EMAIL PROTECTED]'" <[EMAIL PROTECTED]>
cc:
Subject:  SAX parser: Multiple documents from a single stream



If the SAX parser is taking input from a continuous stream of data, is
there
a way to configure/use/get the SAX parser to stop at the start of each new
document on the stream ?

I.e. what i want to do is parse one document, then do some of my own
processing on it, then return to the SAX parser and repeat this process.

Thanks

Adrian
-----------------------------------------------------------------------
Adrian Brogan (Development Team Leader)
E-mail: [EMAIL PROTECTED]
Re: SAX parser: Multiple documents from a single stream

Reply via email to