The other potential solution I've found is the XMLScanner's "getSrcOffset" method. My only fear in using it is that it will give weird results if an XML document is comprised of more than 1 entity.
Does "getSrcOffset" treat the document as a continuous sequence of bytes, or is it more low-level than that? -ted > > From: "Murphy, James" <[EMAIL PROTECTED]> > Date: 2002/04/23 Tue PM 02:01:40 EDT > To: "'[EMAIL PROTECTED]'" <[EMAIL PROTECTED]> > Subject: RE: how to access the raw text that generated a sax event > > Looking through the source... > > BinInputStream::curPos() const; looks promising since the built in input > sources actually implement it! So you should be able to call this in your > SAX event handler methods if you provide your event handler class with the > InputSource you use to parse. > > I haven't tried it yet but I think it will just work. > > Jim > > > > > > -----Original Message----- > > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] > > Sent: Tuesday, April 23, 2002 1:33 PM > > To: [EMAIL PROTECTED] > > Subject: Re: how to access the raw text that generated a sax event > > > > > > "Murphy, James" <[EMAIL PROTECTED]> writes: > > > > > I thought this would be really handy when parsing from a > > continuous buffer > > > like a MemBufInputSource or a LocalFileInputSource. I have > > a situation > > > where I SAX parse _very_ large XML instances looking for > > small repeating > > > fragments. These fragments are operated on individually by > > making a DOM to > > > operating on those nodes in all sorts of application defined ways. > > > > > > If I had the functionality described by Ted, I could SAX > > the file and save > > > off the starting and ending offsets into the large > > document. Post that info > > > to a thread pool to process the fragments asynchronously. > > In fact, I can > > > use my Win32 memory mapped file input source to SAX the > > original large file > > > and serve as a source to the DOM parser during the per work > > item processing. > > > The way I'm doing it now involved _way_ too many buffer > > copies to be really > > > fast - but it could be. > > > > Hey Jim, > > > > I agree. For the MAGE object model we're going to be routinely parsing > > big chunks of scientific data, maybe 0.5 Gb => 2.0 Gb, and looking for > > certain pieces of the data. I'd like to be able to do lazy parsing, > > and just store the byte offsets to the bits that I want. > > > > There *has* to be some easy modification that we can make to subclass > > InputSource or XMLScanner to get this working. I don't know enough > > about the internals of how the scanner works, but if someone can clue > > me in a bit, I'd be happy to implement this. > > > > jas. > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
