All of the fields on DOMLocator are allowed to be -1 or null if the relevant data is not available. That includes byte offsets, so conforming implementations are allowed to return -1 if they don't report this information.
Xiaoming Liu <[EMAIL PROTECTED]> wrote on 01/04/2005 03:31:02 PM: > Thanks, that makes the matter clear. > > I do have last question though, since DOM level 3 supports byte offset in > DOMLocator [1], does xerces have a plan of supporting DOM level 3? > > [1] > http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core. > html#Interfaces-DOMLocator > > Xiaoming > > > On Tue, 4 Jan 2005, Andy Clark wrote: > > > Xiaoming Liu wrote: > > > It seems to me that in order to fast access very large XML files, byte > > > offset is an efficient way. Probably it's also doable by character offset, > > > however I didn't know a java class providing character-based random > > > access. > > > > Reporting byte offsets is just not possible given a number > > of factors. The primary factor being that Xerces does not > > control the decoding of the source bytes to characters. > > > > In order to maintain the proper byte location within the > > stream, the parser would need to know *exactly* how many > > bytes were read by the underlying input stream. Since we > > rely on the decoders present in Java, this just isn't > > possible. > > > > I once had the idea of putting a special byte-counting > > input stream filter between the underlying stream and the > > reader that converted the bytes to chars. I thought that > > I could read one char at a time and then look at the > > current byte offset to see how many bytes were actually > > used to encode that single character. But this didn't > > work. > > > > It turned out that the underlying decoders were buffering > > internally. So even if I asked the reader for a single char, > > the reader would buffer 1K or 2K of data. This fact makes > > it impossible to do anything using the default readers to > > report true byte offsets. > > > > > By the way, the java-based XP parser does provide a way to locate > > > byteoffset of starting element event, oddly, it doesn't provide ways to > > > locate endElement and other events [1]. Since XP is not actively > > > > My guess is that the byte offsets reported by XP are only > > valid for specific encodings: fixed byte length encodings > > like 1-byte or 2-byte character encodings (e.g. US-ASCII, > > ISO-8859-1, UTF-16, UCS4, etc) *and* assuming no Unicode > > character normalization. Unless, of course, that he does > > the character conversions himself and can keep track of > > the true byte offsets. > > > > If you still need to report byte offsets with Xerces, I > > think the only way to do it properly is to pre-normalize > > your docs to a fixed length encoding and then use the > > Xerces feature that reports character offsets. Then it's > > just a matter of multiplying the character offset by the > > char "width" of that encoding. Make sense? > > > > -- > > Andy Clark * [EMAIL PROTECTED] > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > Michael Glavassevich XML Parser Development IBM Toronto Lab E-mail: [EMAIL PROTECTED] E-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
