If you can impose certain restrictions, don't even use the XML parser. Just
do a fast and dirty scan, based on known limitations of the format and break
it up yourself at maximum speed.

--------------------------
Dean Roddey
The Charmed Quark Controller
Charmed Quark Software
[EMAIL PROTECTED]
http://www.charmedquark.com

"If it don't have a control port, don't buy it!"


----- Original Message -----
From: "Murphy, James" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, April 23, 2002 3:57 PM
Subject: RE: RE: how to access the raw text that generated a sax event


> You're right of course, that's a very sensible approach.
>
> But my client has an XML based product to handle communication between
> trading partners.  The benefits of XML are significant since it is an
> integration product and honestly the instance sizes a usually very
> manageable.  But 5% of the time we get these monsters to deal with.  Even
> more honestly is I were to solve this better I'd limit the size of input
> documents the server can accept - killing a couple birds with one stone.
> But, at the moment clients seem to not like that. :)
>
> Jim
>
>
> > -----Original Message-----
> > From: Dean Roddey [mailto:[EMAIL PROTECTED]]
> > Sent: Tuesday, April 23, 2002 6:51 PM
> > To: [EMAIL PROTECTED]
> > Subject: Re: RE: how to access the raw text that generated a sax event
> >
> >
> > Of course, the counter argument to that is: Use a format
> > that's designed to
> > handle that reasonably. XML isn't, so why use it if its not
> > an optimal (or
> > even reasonable) format to use for this kind of thing?
> >
> > --------------------------
> > Dean Roddey
> > The Charmed Quark Controller
> > Charmed Quark Software
> > [EMAIL PROTECTED]
> > http://www.charmedquark.com
> >
> > "If it don't have a control port, don't buy it!"
> >
> >
> > ----- Original Message -----
> > From: "Murphy, James" <[EMAIL PROTECTED]>
> > To: <[EMAIL PROTECTED]>
> > Sent: Tuesday, April 23, 2002 2:31 PM
> > Subject: RE: RE: how to access the raw text that generated a sax event
> >
> >
> > > Fair enough Dean - I'm sympathetic to your point that
> > Xerces was designed
> > > from an InfoSet perspective.  That's cool - but when you
> > are writing for
> > > performance we are willing to make some Faustian bargains.
> > Especially
> > > since, like Jason our environment stipulates single entities anyway.
> > >
> > > Jim
> > >
> > > > -----Original Message-----
> > > > From: Dean Roddey [mailto:[EMAIL PROTECTED]]
> > > > Sent: Tuesday, April 23, 2002 3:34 PM
> > > > To: [EMAIL PROTECTED]
> > > > Subject: Re: RE: how to access the raw text that
> > generated a sax event
> > > >
> > > >
> > > > The source offset stuff is always relative to the entity, so
> > > > if you have
> > > > internal or external entity references and such, you are
> > > > going to have to
> > > > keep up with that fact. So if a entity reference to an
> > > > internal general
> > > > entity contains elements (and it pretty much has to contain
> > > > whole elements),
> > > > those offsets will be relative to that entity's base, so
> > > > you'd have to have
> > > > access to the entity definitions to get back to the raw stuff.
> > > >
> > > > Also, if you store away raw offsets, it might not make much
> > > > sense to you
> > > > becuase there will be no expansion of entities of any sort in
> > > > that raw data,
> > > > so if you go back and look at it, you will have to be
> > > > reponsible for entity
> > > > expansion if you want to know what it's real content is. And
> > > > what if those
> > > > entities are external? You'll have to go through the DTD ifno
> > > > and find the
> > > > refereences to those external entities and expand those,
> > > > (which have entity
> > > > references in them, etc...) i.e. you'll end up writing a
> > > > small XML parser if
> > > > the data contains much in the way of entity references.
> > > >
> > > > Anyway, the whole concept of getting back to the original raw
> > > > XML text is
> > > > counter to what an XML parser is supposed to do, so its never
> > > > going to be
> > > > easy because it wasn't designed to make that easy or useful
> > > > to do. I always
> > > > argued that we never even put any of that stuff in there,
> > but some big
> > > > customers forced us to at least allow it. But its kludey
> > and it adds
> > > > overhead to the input streaming system to have to keep track
> > > > of how many
> > > > bytes each transcoded XML character eat from the input.
> > > >
> > > > --------------------------
> > > > Dean Roddey
> > > > The Charmed Quark Controller
> > > > Charmed Quark Software
> > > > [EMAIL PROTECTED]
> > > > http://www.charmedquark.com
> > > >
> > > > "If it don't have a control port, don't buy it!"
> > > >
> > > >
> > > > ----- Original Message -----
> > > > From: <[EMAIL PROTECTED]>
> > > > To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
> > > > Sent: Tuesday, April 23, 2002 11:13 AM
> > > > Subject: Re: RE: how to access the raw text that
> > generated a sax event
> > > >
> > > >
> > > > >
> > > > > The other potential solution I've found is the XMLScanner's
> > > > "getSrcOffset"
> > > > method.  My only fear in using it is that it will give weird
> > > > results if an
> > > > XML document is comprised of more than 1 entity.
> > > > >
> > > > > Does "getSrcOffset" treat the document as a continuous
> > > > sequence of bytes,
> > > > or is it more low-level than that?
> > > > >
> > > > > -ted
> > > > >
> > > > >
> > > > > >
> > > > > > From: "Murphy, James" <[EMAIL PROTECTED]>
> > > > > > Date: 2002/04/23 Tue PM 02:01:40 EDT
> > > > > > To: "'[EMAIL PROTECTED]'"
> > <[EMAIL PROTECTED]>
> > > > > > Subject: RE: how to access the raw text that
> > generated a sax event
> > > > > >
> > > > > > Looking through the source...
> > > > > >
> > > > > > BinInputStream::curPos() const; looks promising since the
> > > > built in input
> > > > > > sources actually implement it!  So you should be able to
> > > > call this in
> > > > your
> > > > > > SAX event handler methods if you provide your event
> > > > handler class with
> > > > the
> > > > > > InputSource you use to parse.
> > > > > >
> > > > > > I haven't tried it yet but I think it will just work.
> > > > > >
> > > > > > Jim
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: [EMAIL PROTECTED]
> > > > [mailto:[EMAIL PROTECTED]]
> > > > > > > Sent: Tuesday, April 23, 2002 1:33 PM
> > > > > > > To: [EMAIL PROTECTED]
> > > > > > > Subject: Re: how to access the raw text that generated
> > > > a sax event
> > > > > > >
> > > > > > >
> > > > > > > "Murphy, James" <[EMAIL PROTECTED]> writes:
> > > > > > >
> > > > > > > > I thought this would be really handy when parsing from a
> > > > > > > continuous buffer
> > > > > > > > like a MemBufInputSource or a
> > LocalFileInputSource.  I have
> > > > > > > a situation
> > > > > > > > where I SAX parse _very_ large XML instances looking for
> > > > > > > small repeating
> > > > > > > > fragments.  These fragments are operated on
> > individually by
> > > > > > > making a DOM to
> > > > > > > > operating on those nodes in all sorts of application
> > > > defined ways.
> > > > > > > >
> > > > > > > > If I had the functionality described by Ted, I could SAX
> > > > > > > the file and save
> > > > > > > > off the starting and ending offsets into the large
> > > > > > > document.  Post that info
> > > > > > > > to a thread pool to process the fragments asynchronously.
> > > > > > > In fact, I can
> > > > > > > > use my Win32 memory mapped file input source to SAX the
> > > > > > > original large file
> > > > > > > > and serve as a source to the DOM parser during
> > the per work
> > > > > > > item processing.
> > > > > > > > The way I'm doing it now involved _way_ too many buffer
> > > > > > > copies to be really
> > > > > > > > fast - but it could be.
> > > > > > >
> > > > > > > Hey Jim,
> > > > > > >
> > > > > > > I agree. For the MAGE object model we're going to be
> > > > routinely parsing
> > > > > > > big chunks of scientific data, maybe 0.5 Gb => 2.0 Gb,
> > > > and looking for
> > > > > > > certain pieces of the data. I'd like to be able to do
> > > > lazy parsing,
> > > > > > > and just store the byte offsets to the bits that I want.
> > > > > > >
> > > > > > > There *has* to be some easy modification that we can
> > > > make to subclass
> > > > > > > InputSource or XMLScanner to get this working. I don't
> > > > know enough
> > > > > > > about the internals of how the scanner works, but if
> > > > someone can clue
> > > > > > > me in a bit, I'd be happy to implement this.
> > > > > > >
> > > > > > > jas.
> > > > > > >
> > > > > > >
> > > >
> > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail:
> > [EMAIL PROTECTED]
> > > > > > > For additional commands, e-mail:
> > > > [EMAIL PROTECTED]
> > > > > > >
> > > > > >
> > > > > >
> > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail:
> > [EMAIL PROTECTED]
> > > > > > For additional commands, e-mail:
> > [EMAIL PROTECTED]
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > > > For additional commands, e-mail:
> > [EMAIL PROTECTED]
> > > > >
> > > >
> > > >
> > > >
> > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > > For additional commands, e-mail: [EMAIL PROTECTED]
> > > >
> > >
> > >
> > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to