If you can impose certain restrictions, don't even use the XML parser. Just do a fast and dirty scan, based on known limitations of the format and break it up yourself at maximum speed.
-------------------------- Dean Roddey The Charmed Quark Controller Charmed Quark Software [EMAIL PROTECTED] http://www.charmedquark.com "If it don't have a control port, don't buy it!" ----- Original Message ----- From: "Murphy, James" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Tuesday, April 23, 2002 3:57 PM Subject: RE: RE: how to access the raw text that generated a sax event > You're right of course, that's a very sensible approach. > > But my client has an XML based product to handle communication between > trading partners. The benefits of XML are significant since it is an > integration product and honestly the instance sizes a usually very > manageable. But 5% of the time we get these monsters to deal with. Even > more honestly is I were to solve this better I'd limit the size of input > documents the server can accept - killing a couple birds with one stone. > But, at the moment clients seem to not like that. :) > > Jim > > > > -----Original Message----- > > From: Dean Roddey [mailto:[EMAIL PROTECTED]] > > Sent: Tuesday, April 23, 2002 6:51 PM > > To: [EMAIL PROTECTED] > > Subject: Re: RE: how to access the raw text that generated a sax event > > > > > > Of course, the counter argument to that is: Use a format > > that's designed to > > handle that reasonably. XML isn't, so why use it if its not > > an optimal (or > > even reasonable) format to use for this kind of thing? > > > > -------------------------- > > Dean Roddey > > The Charmed Quark Controller > > Charmed Quark Software > > [EMAIL PROTECTED] > > http://www.charmedquark.com > > > > "If it don't have a control port, don't buy it!" > > > > > > ----- Original Message ----- > > From: "Murphy, James" <[EMAIL PROTECTED]> > > To: <[EMAIL PROTECTED]> > > Sent: Tuesday, April 23, 2002 2:31 PM > > Subject: RE: RE: how to access the raw text that generated a sax event > > > > > > > Fair enough Dean - I'm sympathetic to your point that > > Xerces was designed > > > from an InfoSet perspective. That's cool - but when you > > are writing for > > > performance we are willing to make some Faustian bargains. > > Especially > > > since, like Jason our environment stipulates single entities anyway. > > > > > > Jim > > > > > > > -----Original Message----- > > > > From: Dean Roddey [mailto:[EMAIL PROTECTED]] > > > > Sent: Tuesday, April 23, 2002 3:34 PM > > > > To: [EMAIL PROTECTED] > > > > Subject: Re: RE: how to access the raw text that > > generated a sax event > > > > > > > > > > > > The source offset stuff is always relative to the entity, so > > > > if you have > > > > internal or external entity references and such, you are > > > > going to have to > > > > keep up with that fact. So if a entity reference to an > > > > internal general > > > > entity contains elements (and it pretty much has to contain > > > > whole elements), > > > > those offsets will be relative to that entity's base, so > > > > you'd have to have > > > > access to the entity definitions to get back to the raw stuff. > > > > > > > > Also, if you store away raw offsets, it might not make much > > > > sense to you > > > > becuase there will be no expansion of entities of any sort in > > > > that raw data, > > > > so if you go back and look at it, you will have to be > > > > reponsible for entity > > > > expansion if you want to know what it's real content is. And > > > > what if those > > > > entities are external? You'll have to go through the DTD ifno > > > > and find the > > > > refereences to those external entities and expand those, > > > > (which have entity > > > > references in them, etc...) i.e. you'll end up writing a > > > > small XML parser if > > > > the data contains much in the way of entity references. > > > > > > > > Anyway, the whole concept of getting back to the original raw > > > > XML text is > > > > counter to what an XML parser is supposed to do, so its never > > > > going to be > > > > easy because it wasn't designed to make that easy or useful > > > > to do. I always > > > > argued that we never even put any of that stuff in there, > > but some big > > > > customers forced us to at least allow it. But its kludey > > and it adds > > > > overhead to the input streaming system to have to keep track > > > > of how many > > > > bytes each transcoded XML character eat from the input. > > > > > > > > -------------------------- > > > > Dean Roddey > > > > The Charmed Quark Controller > > > > Charmed Quark Software > > > > [EMAIL PROTECTED] > > > > http://www.charmedquark.com > > > > > > > > "If it don't have a control port, don't buy it!" > > > > > > > > > > > > ----- Original Message ----- > > > > From: <[EMAIL PROTECTED]> > > > > To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> > > > > Sent: Tuesday, April 23, 2002 11:13 AM > > > > Subject: Re: RE: how to access the raw text that > > generated a sax event > > > > > > > > > > > > > > > > > > The other potential solution I've found is the XMLScanner's > > > > "getSrcOffset" > > > > method. My only fear in using it is that it will give weird > > > > results if an > > > > XML document is comprised of more than 1 entity. > > > > > > > > > > Does "getSrcOffset" treat the document as a continuous > > > > sequence of bytes, > > > > or is it more low-level than that? > > > > > > > > > > -ted > > > > > > > > > > > > > > > > > > > > > > From: "Murphy, James" <[EMAIL PROTECTED]> > > > > > > Date: 2002/04/23 Tue PM 02:01:40 EDT > > > > > > To: "'[EMAIL PROTECTED]'" > > <[EMAIL PROTECTED]> > > > > > > Subject: RE: how to access the raw text that > > generated a sax event > > > > > > > > > > > > Looking through the source... > > > > > > > > > > > > BinInputStream::curPos() const; looks promising since the > > > > built in input > > > > > > sources actually implement it! So you should be able to > > > > call this in > > > > your > > > > > > SAX event handler methods if you provide your event > > > > handler class with > > > > the > > > > > > InputSource you use to parse. > > > > > > > > > > > > I haven't tried it yet but I think it will just work. > > > > > > > > > > > > Jim > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: [EMAIL PROTECTED] > > > > [mailto:[EMAIL PROTECTED]] > > > > > > > Sent: Tuesday, April 23, 2002 1:33 PM > > > > > > > To: [EMAIL PROTECTED] > > > > > > > Subject: Re: how to access the raw text that generated > > > > a sax event > > > > > > > > > > > > > > > > > > > > > "Murphy, James" <[EMAIL PROTECTED]> writes: > > > > > > > > > > > > > > > I thought this would be really handy when parsing from a > > > > > > > continuous buffer > > > > > > > > like a MemBufInputSource or a > > LocalFileInputSource. I have > > > > > > > a situation > > > > > > > > where I SAX parse _very_ large XML instances looking for > > > > > > > small repeating > > > > > > > > fragments. These fragments are operated on > > individually by > > > > > > > making a DOM to > > > > > > > > operating on those nodes in all sorts of application > > > > defined ways. > > > > > > > > > > > > > > > > If I had the functionality described by Ted, I could SAX > > > > > > > the file and save > > > > > > > > off the starting and ending offsets into the large > > > > > > > document. Post that info > > > > > > > > to a thread pool to process the fragments asynchronously. > > > > > > > In fact, I can > > > > > > > > use my Win32 memory mapped file input source to SAX the > > > > > > > original large file > > > > > > > > and serve as a source to the DOM parser during > > the per work > > > > > > > item processing. > > > > > > > > The way I'm doing it now involved _way_ too many buffer > > > > > > > copies to be really > > > > > > > > fast - but it could be. > > > > > > > > > > > > > > Hey Jim, > > > > > > > > > > > > > > I agree. For the MAGE object model we're going to be > > > > routinely parsing > > > > > > > big chunks of scientific data, maybe 0.5 Gb => 2.0 Gb, > > > > and looking for > > > > > > > certain pieces of the data. I'd like to be able to do > > > > lazy parsing, > > > > > > > and just store the byte offsets to the bits that I want. > > > > > > > > > > > > > > There *has* to be some easy modification that we can > > > > make to subclass > > > > > > > InputSource or XMLScanner to get this working. I don't > > > > know enough > > > > > > > about the internals of how the scanner works, but if > > > > someone can clue > > > > > > > me in a bit, I'd be happy to implement this. > > > > > > > > > > > > > > jas. > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > > To unsubscribe, e-mail: > > [EMAIL PROTECTED] > > > > > > > For additional commands, e-mail: > > > > [EMAIL PROTECTED] > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > To unsubscribe, e-mail: > > [EMAIL PROTECTED] > > > > > > For additional commands, e-mail: > > [EMAIL PROTECTED] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > > > For additional commands, e-mail: > > [EMAIL PROTECTED] > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
