Re: read byte offset information during xml parsing

Michael Glavassevich 4 Jan 2005 21:10:47 -0000

All of the fields on DOMLocator are allowed to be -1 or null if the 
relevant data is not available. That includes byte offsets, so conforming 
implementations are allowed to return -1 if they don't report this 
information.


Xiaoming Liu <[EMAIL PROTECTED]> wrote on 01/04/2005 03:31:02 PM:

> Thanks, that makes the matter clear.
> 
> I do have last question though, since DOM level 3 supports byte offset 
in
> DOMLocator [1], does xerces have a plan of supporting DOM level 3?
> 
> [1]
> http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.
> html#Interfaces-DOMLocator
> 
> Xiaoming
> 
> 
> On Tue, 4 Jan 2005, Andy Clark wrote:
> 
> > Xiaoming Liu wrote:
> > > It seems to me that in order to fast access very large XML files, 
byte
> > > offset is an efficient way. Probably it's also doable by character 
offset,
> > > however I didn't know a java class providing character-based random
> > > access.
> >
> > Reporting byte offsets is just not possible given a number
> > of factors. The primary factor being that Xerces does not
> > control the decoding of the source bytes to characters.
> >
> > In order to maintain the proper byte location within the
> > stream, the parser would need to know *exactly* how many
> > bytes were read by the underlying input stream. Since we
> > rely on the decoders present in Java, this just isn't
> > possible.
> >
> > I once had the idea of putting a special byte-counting
> > input stream filter between the underlying stream and the
> > reader that converted the bytes to chars. I thought that
> > I could read one char at a time and then look at the
> > current byte offset to see how many bytes were actually
> > used to encode that single character. But this didn't
> > work.
> >
> > It turned out that the underlying decoders were buffering
> > internally. So even if I asked the reader for a single char,
> > the reader would buffer 1K or 2K of data. This fact makes
> > it impossible to do anything using the default readers to
> > report true byte offsets.
> >
> > > By the way, the java-based XP parser does provide a way to locate
> > > byteoffset of starting element event, oddly, it doesn't provide ways 
to
> > > locate endElement and other events [1]. Since XP is not actively
> >
> > My guess is that the byte offsets reported by XP are only
> > valid for specific encodings: fixed byte length encodings
> > like 1-byte or 2-byte character encodings (e.g. US-ASCII,
> > ISO-8859-1, UTF-16, UCS4, etc) *and* assuming no Unicode
> > character normalization. Unless, of course, that he does
> > the character conversions himself and can keep track of
> > the true byte offsets.
> >
> > If you still need to report byte offsets with Xerces, I
> > think the only way to do it properly is to pre-normalize
> > your docs to a fixed length encoding and then use the
> > Xerces feature that reports character offsets. Then it's
> > just a matter of multiplying the character offset by the
> > char "width" of that encoding. Make sense?
> >
> > --
> > Andy Clark * [EMAIL PROTECTED]
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: [EMAIL PROTECTED]
E-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: read byte offset information during xml parsing

Reply via email to