You're right of course, that's a very sensible approach.  

But my client has an XML based product to handle communication between
trading partners.  The benefits of XML are significant since it is an
integration product and honestly the instance sizes a usually very
manageable.  But 5% of the time we get these monsters to deal with.  Even
more honestly is I were to solve this better I'd limit the size of input
documents the server can accept - killing a couple birds with one stone.
But, at the moment clients seem to not like that. :)  

Jim


> -----Original Message-----
> From: Dean Roddey [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, April 23, 2002 6:51 PM
> To: [EMAIL PROTECTED]
> Subject: Re: RE: how to access the raw text that generated a sax event
> 
> 
> Of course, the counter argument to that is: Use a format 
> that's designed to
> handle that reasonably. XML isn't, so why use it if its not 
> an optimal (or
> even reasonable) format to use for this kind of thing?
> 
> --------------------------
> Dean Roddey
> The Charmed Quark Controller
> Charmed Quark Software
> [EMAIL PROTECTED]
> http://www.charmedquark.com
> 
> "If it don't have a control port, don't buy it!"
> 
> 
> ----- Original Message -----
> From: "Murphy, James" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Tuesday, April 23, 2002 2:31 PM
> Subject: RE: RE: how to access the raw text that generated a sax event
> 
> 
> > Fair enough Dean - I'm sympathetic to your point that 
> Xerces was designed
> > from an InfoSet perspective.  That's cool - but when you 
> are writing for
> > performance we are willing to make some Faustian bargains.  
> Especially
> > since, like Jason our environment stipulates single entities anyway.
> >
> > Jim
> >
> > > -----Original Message-----
> > > From: Dean Roddey [mailto:[EMAIL PROTECTED]]
> > > Sent: Tuesday, April 23, 2002 3:34 PM
> > > To: [EMAIL PROTECTED]
> > > Subject: Re: RE: how to access the raw text that 
> generated a sax event
> > >
> > >
> > > The source offset stuff is always relative to the entity, so
> > > if you have
> > > internal or external entity references and such, you are
> > > going to have to
> > > keep up with that fact. So if a entity reference to an
> > > internal general
> > > entity contains elements (and it pretty much has to contain
> > > whole elements),
> > > those offsets will be relative to that entity's base, so
> > > you'd have to have
> > > access to the entity definitions to get back to the raw stuff.
> > >
> > > Also, if you store away raw offsets, it might not make much
> > > sense to you
> > > becuase there will be no expansion of entities of any sort in
> > > that raw data,
> > > so if you go back and look at it, you will have to be
> > > reponsible for entity
> > > expansion if you want to know what it's real content is. And
> > > what if those
> > > entities are external? You'll have to go through the DTD ifno
> > > and find the
> > > refereences to those external entities and expand those,
> > > (which have entity
> > > references in them, etc...) i.e. you'll end up writing a
> > > small XML parser if
> > > the data contains much in the way of entity references.
> > >
> > > Anyway, the whole concept of getting back to the original raw
> > > XML text is
> > > counter to what an XML parser is supposed to do, so its never
> > > going to be
> > > easy because it wasn't designed to make that easy or useful
> > > to do. I always
> > > argued that we never even put any of that stuff in there, 
> but some big
> > > customers forced us to at least allow it. But its kludey 
> and it adds
> > > overhead to the input streaming system to have to keep track
> > > of how many
> > > bytes each transcoded XML character eat from the input.
> > >
> > > --------------------------
> > > Dean Roddey
> > > The Charmed Quark Controller
> > > Charmed Quark Software
> > > [EMAIL PROTECTED]
> > > http://www.charmedquark.com
> > >
> > > "If it don't have a control port, don't buy it!"
> > >
> > >
> > > ----- Original Message -----
> > > From: <[EMAIL PROTECTED]>
> > > To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
> > > Sent: Tuesday, April 23, 2002 11:13 AM
> > > Subject: Re: RE: how to access the raw text that 
> generated a sax event
> > >
> > >
> > > >
> > > > The other potential solution I've found is the XMLScanner's
> > > "getSrcOffset"
> > > method.  My only fear in using it is that it will give weird
> > > results if an
> > > XML document is comprised of more than 1 entity.
> > > >
> > > > Does "getSrcOffset" treat the document as a continuous
> > > sequence of bytes,
> > > or is it more low-level than that?
> > > >
> > > > -ted
> > > >
> > > >
> > > > >
> > > > > From: "Murphy, James" <[EMAIL PROTECTED]>
> > > > > Date: 2002/04/23 Tue PM 02:01:40 EDT
> > > > > To: "'[EMAIL PROTECTED]'" 
> <[EMAIL PROTECTED]>
> > > > > Subject: RE: how to access the raw text that 
> generated a sax event
> > > > >
> > > > > Looking through the source...
> > > > >
> > > > > BinInputStream::curPos() const; looks promising since the
> > > built in input
> > > > > sources actually implement it!  So you should be able to
> > > call this in
> > > your
> > > > > SAX event handler methods if you provide your event
> > > handler class with
> > > the
> > > > > InputSource you use to parse.
> > > > >
> > > > > I haven't tried it yet but I think it will just work.
> > > > >
> > > > > Jim
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: [EMAIL PROTECTED]
> > > [mailto:[EMAIL PROTECTED]]
> > > > > > Sent: Tuesday, April 23, 2002 1:33 PM
> > > > > > To: [EMAIL PROTECTED]
> > > > > > Subject: Re: how to access the raw text that generated
> > > a sax event
> > > > > >
> > > > > >
> > > > > > "Murphy, James" <[EMAIL PROTECTED]> writes:
> > > > > >
> > > > > > > I thought this would be really handy when parsing from a
> > > > > > continuous buffer
> > > > > > > like a MemBufInputSource or a 
> LocalFileInputSource.  I have
> > > > > > a situation
> > > > > > > where I SAX parse _very_ large XML instances looking for
> > > > > > small repeating
> > > > > > > fragments.  These fragments are operated on 
> individually by
> > > > > > making a DOM to
> > > > > > > operating on those nodes in all sorts of application
> > > defined ways.
> > > > > > >
> > > > > > > If I had the functionality described by Ted, I could SAX
> > > > > > the file and save
> > > > > > > off the starting and ending offsets into the large
> > > > > > document.  Post that info
> > > > > > > to a thread pool to process the fragments asynchronously.
> > > > > > In fact, I can
> > > > > > > use my Win32 memory mapped file input source to SAX the
> > > > > > original large file
> > > > > > > and serve as a source to the DOM parser during 
> the per work
> > > > > > item processing.
> > > > > > > The way I'm doing it now involved _way_ too many buffer
> > > > > > copies to be really
> > > > > > > fast - but it could be.
> > > > > >
> > > > > > Hey Jim,
> > > > > >
> > > > > > I agree. For the MAGE object model we're going to be
> > > routinely parsing
> > > > > > big chunks of scientific data, maybe 0.5 Gb => 2.0 Gb,
> > > and looking for
> > > > > > certain pieces of the data. I'd like to be able to do
> > > lazy parsing,
> > > > > > and just store the byte offsets to the bits that I want.
> > > > > >
> > > > > > There *has* to be some easy modification that we can
> > > make to subclass
> > > > > > InputSource or XMLScanner to get this working. I don't
> > > know enough
> > > > > > about the internals of how the scanner works, but if
> > > someone can clue
> > > > > > me in a bit, I'd be happy to implement this.
> > > > > >
> > > > > > jas.
> > > > > >
> > > > > >
> > > 
> ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: 
> [EMAIL PROTECTED]
> > > > > > For additional commands, e-mail:
> > > [EMAIL PROTECTED]
> > > > > >
> > > > >
> > > > >
> > > 
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: 
> [EMAIL PROTECTED]
> > > > > For additional commands, e-mail: 
> [EMAIL PROTECTED]
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > > 
> ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > > For additional commands, e-mail: 
> [EMAIL PROTECTED]
> > > >
> > >
> > >
> > > 
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> >
> > 
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to