Re: RE: how to access the raw text that generated a sax event

Dean Roddey Tue, 23 Apr 2002 12:17:55 -0700

The source offset stuff is always relative to the entity, so if you have
internal or external entity references and such, you are going to have to
keep up with that fact. So if a entity reference to an internal general
entity contains elements (and it pretty much has to contain whole elements),
those offsets will be relative to that entity's base, so you'd have to have
access to the entity definitions to get back to the raw stuff.


Also, if you store away raw offsets, it might not make much sense to you
becuase there will be no expansion of entities of any sort in that raw data,
so if you go back and look at it, you will have to be reponsible for entity
expansion if you want to know what it's real content is. And what if those
entities are external? You'll have to go through the DTD ifno and find the
refereences to those external entities and expand those, (which have entity
references in them, etc...) i.e. you'll end up writing a small XML parser if
the data contains much in the way of entity references.

Anyway, the whole concept of getting back to the original raw XML text is
counter to what an XML parser is supposed to do, so its never going to be
easy because it wasn't designed to make that easy or useful to do. I always
argued that we never even put any of that stuff in there, but some big
customers forced us to at least allow it. But its kludey and it adds
overhead to the input streaming system to have to keep track of how many
bytes each transcoded XML character eat from the input.

--------------------------
Dean Roddey
The Charmed Quark Controller
Charmed Quark Software
[EMAIL PROTECTED]
http://www.charmedquark.com

"If it don't have a control port, don't buy it!"


----- Original Message -----
From: <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Tuesday, April 23, 2002 11:13 AM
Subject: Re: RE: how to access the raw text that generated a sax event


>
> The other potential solution I've found is the XMLScanner's "getSrcOffset"
method.  My only fear in using it is that it will give weird results if an
XML document is comprised of more than 1 entity.
>
> Does "getSrcOffset" treat the document as a continuous sequence of bytes,
or is it more low-level than that?
>
> -ted
>
>
> >
> > From: "Murphy, James" <[EMAIL PROTECTED]>
> > Date: 2002/04/23 Tue PM 02:01:40 EDT
> > To: "'[EMAIL PROTECTED]'" <[EMAIL PROTECTED]>
> > Subject: RE: how to access the raw text that generated a sax event
> >
> > Looking through the source...
> >
> > BinInputStream::curPos() const; looks promising since the built in input
> > sources actually implement it!  So you should be able to call this in
your
> > SAX event handler methods if you provide your event handler class with
the
> > InputSource you use to parse.
> >
> > I haven't tried it yet but I think it will just work.
> >
> > Jim
> >
> >
> >
> >
> > > -----Original Message-----
> > > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
> > > Sent: Tuesday, April 23, 2002 1:33 PM
> > > To: [EMAIL PROTECTED]
> > > Subject: Re: how to access the raw text that generated a sax event
> > >
> > >
> > > "Murphy, James" <[EMAIL PROTECTED]> writes:
> > >
> > > > I thought this would be really handy when parsing from a
> > > continuous buffer
> > > > like a MemBufInputSource or a LocalFileInputSource.  I have
> > > a situation
> > > > where I SAX parse _very_ large XML instances looking for
> > > small repeating
> > > > fragments.  These fragments are operated on individually by
> > > making a DOM to
> > > > operating on those nodes in all sorts of application defined ways.
> > > >
> > > > If I had the functionality described by Ted, I could SAX
> > > the file and save
> > > > off the starting and ending offsets into the large
> > > document.  Post that info
> > > > to a thread pool to process the fragments asynchronously.
> > > In fact, I can
> > > > use my Win32 memory mapped file input source to SAX the
> > > original large file
> > > > and serve as a source to the DOM parser during the per work
> > > item processing.
> > > > The way I'm doing it now involved _way_ too many buffer
> > > copies to be really
> > > > fast - but it could be.
> > >
> > > Hey Jim,
> > >
> > > I agree. For the MAGE object model we're going to be routinely parsing
> > > big chunks of scientific data, maybe 0.5 Gb => 2.0 Gb, and looking for
> > > certain pieces of the data. I'd like to be able to do lazy parsing,
> > > and just store the byte offsets to the bits that I want.
> > >
> > > There *has* to be some easy modification that we can make to subclass
> > > InputSource or XMLScanner to get this working. I don't know enough
> > > about the internals of how the scanner works, but if someone can clue
> > > me in a bit, I'd be happy to implement this.
> > >
> > > jas.
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: RE: how to access the raw text that generated a sax event

Reply via email to