RE: RE: how to access the raw text that generated a sax event

Murphy, James Tue, 23 Apr 2002 14:17:32 -0700

Fair enough Dean - I'm sympathetic to your point that Xerces was designed
from an InfoSet perspective.  That's cool - but when you are writing for
performance we are willing to make some Faustian bargains.  Especially
since, like Jason our environment stipulates single entities anyway.


Jim

> -----Original Message-----
> From: Dean Roddey [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, April 23, 2002 3:34 PM
> To: [EMAIL PROTECTED]
> Subject: Re: RE: how to access the raw text that generated a sax event
> 
> 
> The source offset stuff is always relative to the entity, so 
> if you have
> internal or external entity references and such, you are 
> going to have to
> keep up with that fact. So if a entity reference to an 
> internal general
> entity contains elements (and it pretty much has to contain 
> whole elements),
> those offsets will be relative to that entity's base, so 
> you'd have to have
> access to the entity definitions to get back to the raw stuff.
> 
> Also, if you store away raw offsets, it might not make much 
> sense to you
> becuase there will be no expansion of entities of any sort in 
> that raw data,
> so if you go back and look at it, you will have to be 
> reponsible for entity
> expansion if you want to know what it's real content is. And 
> what if those
> entities are external? You'll have to go through the DTD ifno 
> and find the
> refereences to those external entities and expand those, 
> (which have entity
> references in them, etc...) i.e. you'll end up writing a 
> small XML parser if
> the data contains much in the way of entity references.
> 
> Anyway, the whole concept of getting back to the original raw 
> XML text is
> counter to what an XML parser is supposed to do, so its never 
> going to be
> easy because it wasn't designed to make that easy or useful 
> to do. I always
> argued that we never even put any of that stuff in there, but some big
> customers forced us to at least allow it. But its kludey and it adds
> overhead to the input streaming system to have to keep track 
> of how many
> bytes each transcoded XML character eat from the input.
> 
> --------------------------
> Dean Roddey
> The Charmed Quark Controller
> Charmed Quark Software
> [EMAIL PROTECTED]
> http://www.charmedquark.com
> 
> "If it don't have a control port, don't buy it!"
> 
> 
> ----- Original Message -----
> From: <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
> Sent: Tuesday, April 23, 2002 11:13 AM
> Subject: Re: RE: how to access the raw text that generated a sax event
> 
> 
> >
> > The other potential solution I've found is the XMLScanner's 
> "getSrcOffset"
> method.  My only fear in using it is that it will give weird 
> results if an
> XML document is comprised of more than 1 entity.
> >
> > Does "getSrcOffset" treat the document as a continuous 
> sequence of bytes,
> or is it more low-level than that?
> >
> > -ted
> >
> >
> > >
> > > From: "Murphy, James" <[EMAIL PROTECTED]>
> > > Date: 2002/04/23 Tue PM 02:01:40 EDT
> > > To: "'[EMAIL PROTECTED]'" <[EMAIL PROTECTED]>
> > > Subject: RE: how to access the raw text that generated a sax event
> > >
> > > Looking through the source...
> > >
> > > BinInputStream::curPos() const; looks promising since the 
> built in input
> > > sources actually implement it!  So you should be able to 
> call this in
> your
> > > SAX event handler methods if you provide your event 
> handler class with
> the
> > > InputSource you use to parse.
> > >
> > > I haven't tried it yet but I think it will just work.
> > >
> > > Jim
> > >
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED]]
> > > > Sent: Tuesday, April 23, 2002 1:33 PM
> > > > To: [EMAIL PROTECTED]
> > > > Subject: Re: how to access the raw text that generated 
> a sax event
> > > >
> > > >
> > > > "Murphy, James" <[EMAIL PROTECTED]> writes:
> > > >
> > > > > I thought this would be really handy when parsing from a
> > > > continuous buffer
> > > > > like a MemBufInputSource or a LocalFileInputSource.  I have
> > > > a situation
> > > > > where I SAX parse _very_ large XML instances looking for
> > > > small repeating
> > > > > fragments.  These fragments are operated on individually by
> > > > making a DOM to
> > > > > operating on those nodes in all sorts of application 
> defined ways.
> > > > >
> > > > > If I had the functionality described by Ted, I could SAX
> > > > the file and save
> > > > > off the starting and ending offsets into the large
> > > > document.  Post that info
> > > > > to a thread pool to process the fragments asynchronously.
> > > > In fact, I can
> > > > > use my Win32 memory mapped file input source to SAX the
> > > > original large file
> > > > > and serve as a source to the DOM parser during the per work
> > > > item processing.
> > > > > The way I'm doing it now involved _way_ too many buffer
> > > > copies to be really
> > > > > fast - but it could be.
> > > >
> > > > Hey Jim,
> > > >
> > > > I agree. For the MAGE object model we're going to be 
> routinely parsing
> > > > big chunks of scientific data, maybe 0.5 Gb => 2.0 Gb, 
> and looking for
> > > > certain pieces of the data. I'd like to be able to do 
> lazy parsing,
> > > > and just store the byte offsets to the bits that I want.
> > > >
> > > > There *has* to be some easy modification that we can 
> make to subclass
> > > > InputSource or XMLScanner to get this working. I don't 
> know enough
> > > > about the internals of how the scanner works, but if 
> someone can clue
> > > > me in a bit, I'd be happy to implement this.
> > > >
> > > > jas.
> > > >
> > > > 
> ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > > For additional commands, e-mail: 
> [EMAIL PROTECTED]
> > > >
> > >
> > > 
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> >
> >
> > 
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: RE: how to access the raw text that generated a sax event

Reply via email to