Re: RE: how to access the raw text that generated a sax event

tedsandler Tue, 23 Apr 2002 13:31:03 -0700


> 
> From: Dean Roddey <[EMAIL PROTECTED]>
> Date: 2002/04/23 Tue PM 03:33:45 EDT
> To: [EMAIL PROTECTED]
> Subject: Re: RE: how to access the raw text that generated a sax event
> 
> The source offset stuff is always relative to the entity, so if you have
> internal or external entity references and such, you are going to have to
> keep up with that fact. So if a entity reference to an internal general
> entity contains elements (and it pretty much has to contain whole elements),
> those offsets will be relative to that entity's base, so you'd have to have
> access to the entity definitions to get back to the raw stuff.
> 
> Also, if you store away raw offsets, it might not make much sense to you
> becuase there will be no expansion of entities of any sort in that raw data,
> so if you go back and look at it, you will have to be reponsible for entity
> expansion if you want to know what it's real content is. And what if those
> entities are external? You'll have to go through the DTD ifno and find the
> refereences to those external entities and expand those, (which have entity
> references in them, etc...) i.e. you'll end up writing a small XML parser if
> the data contains much in the way of entity references.


That's what I was afraid of.  Here's the problem though:

I am working on a system that will be responsible for splitting large XML files into 
record sized chunks.  These chunks will be handed off to end-users who want the option 
of parsing them with whatever parser they choose.

Furthermore, what constitutes a record is not known ahead of time, other than that a 
record will be contained within a single element.  The end users must be free to 
declare, at runtime, what elements delimit a record.

As far as I can tell, there are 2 ways of meeting these requirements:

1) stream the raw text of the records to the end-users.

2) catch the parsed XML with the various handlers and reconstitute it as XML so it can 
be fwd'd on to the end-users.

I thought that the first method would be easier, faster, and more elegant as it would 
prevent munging of the original text (i.e. prevent <emptyTag abc="1" def="2"/> from 
being changed into <emptyTag abc="1" def="2"></emptyTag> and vice versa).  However, 
it's now starting to seem like I was mistaken.


 
> Anyway, the whole concept of getting back to the original raw XML text is
> counter to what an XML parser is supposed to do, so its never going to be
> easy because it wasn't designed to make that easy or useful to do. I always
> argued that we never even put any of that stuff in there, but some big
> customers forced us to at least allow it. But its kludey and it adds
> overhead to the input streaming system to have to keep track of how many
> bytes each transcoded XML character eat from the input.


Agreed, but unfortunately, that doesn't change the requirements of what I've been 
asked to do.

As such, my questions are the following:

a) Is there any way to implement method #1 above?  If so, what are the risks?  What 
might it get wrong?  (my sense is that it could lead to unexpandable entities for the 
end-users.)

b) How would you implement method #2 above?  What are its risks?

Many thanks for your help,
-ted sandler





> 
> --------------------------
> Dean Roddey
> The Charmed Quark Controller
> Charmed Quark Software
> [EMAIL PROTECTED]
> http://www.charmedquark.com
> 
> "If it don't have a control port, don't buy it!"
> 
> 
> ----- Original Message -----
> From: <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
> Sent: Tuesday, April 23, 2002 11:13 AM
> Subject: Re: RE: how to access the raw text that generated a sax event
> 
> 
> >
> > The other potential solution I've found is the XMLScanner's "getSrcOffset"
> method.  My only fear in using it is that it will give weird results if an
> XML document is comprised of more than 1 entity.
> >
> > Does "getSrcOffset" treat the document as a continuous sequence of bytes,
> or is it more low-level than that?
> >
> > -ted
> >
> >
> > >
> > > From: "Murphy, James" <[EMAIL PROTECTED]>
> > > Date: 2002/04/23 Tue PM 02:01:40 EDT
> > > To: "'[EMAIL PROTECTED]'" <[EMAIL PROTECTED]>
> > > Subject: RE: how to access the raw text that generated a sax event
> > >
> > > Looking through the source...
> > >
> > > BinInputStream::curPos() const; looks promising since the built in input
> > > sources actually implement it!  So you should be able to call this in
> your
> > > SAX event handler methods if you provide your event handler class with
> the
> > > InputSource you use to parse.
> > >
> > > I haven't tried it yet but I think it will just work.
> > >
> > > Jim
> > >
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
> > > > Sent: Tuesday, April 23, 2002 1:33 PM
> > > > To: [EMAIL PROTECTED]
> > > > Subject: Re: how to access the raw text that generated a sax event
> > > >
> > > >
> > > > "Murphy, James" <[EMAIL PROTECTED]> writes:
> > > >
> > > > > I thought this would be really handy when parsing from a
> > > > continuous buffer
> > > > > like a MemBufInputSource or a LocalFileInputSource.  I have
> > > > a situation
> > > > > where I SAX parse _very_ large XML instances looking for
> > > > small repeating
> > > > > fragments.  These fragments are operated on individually by
> > > > making a DOM to
> > > > > operating on those nodes in all sorts of application defined ways.
> > > > >
> > > > > If I had the functionality described by Ted, I could SAX
> > > > the file and save
> > > > > off the starting and ending offsets into the large
> > > > document.  Post that info
> > > > > to a thread pool to process the fragments asynchronously.
> > > > In fact, I can
> > > > > use my Win32 memory mapped file input source to SAX the
> > > > original large file
> > > > > and serve as a source to the DOM parser during the per work
> > > > item processing.
> > > > > The way I'm doing it now involved _way_ too many buffer
> > > > copies to be really
> > > > > fast - but it could be.
> > > >
> > > > Hey Jim,
> > > >
> > > > I agree. For the MAGE object model we're going to be routinely parsing
> > > > big chunks of scientific data, maybe 0.5 Gb => 2.0 Gb, and looking for
> > > > certain pieces of the data. I'd like to be able to do lazy parsing,
> > > > and just store the byte offsets to the bits that I want.
> > > >
> > > > There *has* to be some easy modification that we can make to subclass
> > > > InputSource or XMLScanner to get this working. I don't know enough
> > > > about the internals of how the scanner works, but if someone can clue
> > > > me in a bit, I'd be happy to implement this.
> > > >
> > > > jas.
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > > For additional commands, e-mail: [EMAIL PROTECTED]
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: RE: how to access the raw text that generated a sax event

Reply via email to