Right - that makes sense after another look.  As you say since readBytes()
is used to fill a local buffer, curPos() will be increased in chunks not
incremented at each sax event.  

I'm liking Dean's suggestion of writing your own quick and dirty scanner by
hand to rip through the large documents to shred them into more manageable
fragments.

Jim



> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, April 23, 2002 10:32 PM
> To: [EMAIL PROTECTED]
> Subject: Re: how to access the raw text that generated a sax event
> 
> 
> "Jason E. Stewart" <[EMAIL PROTECTED]> writes:
> 
> > Any ideas what to do?
> 
> I finally broke down and read the source code for XMLScanner and
> XMLReader and I'm convinced that without a major re-writing, this is
> not possible.
> 
> Basically, the XMLReader calls readBytes() on the stream to fill up a
> buffer - so curPos() could never help us as the stream is read in
> chunks of the buffer size.
> 
> Then, the XMLReader maintains two internal buffers: one of raw bytes
> and the other of transcoded characters. When the transcoded buffer
> starts running low, it transcodes another bufferful from the raw
> buffer and all the information about how many characters have been
> read so far is thrown away. 
> 
> Also when the raw buffer is running low it reads in more data from the
> stream, and it too throws away all the information about how many
> bytes have been processed so far.
> 
> It would be possible to save this information when the buffers are
> refilled, but it *still* wouldn't give us the info that we
> want. Because the XMLScanner gets all it's data from the transcoded
> character buffer, at best we could hope to find out at what
> *character* position we are at in the file. But we can only be sure of
> the character <-> bytes mapping for fixed width characters, which
> won't help us for UTF16 or UTF8, but I guess it would work for
> ISO-8859-1 and ASCII.
> 
> So I'm out of ideas.
> jas.
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to