Michael, Thanks a lot for pointing out the reference. I would like to illstrate our use cases to see if it's a generic requirement other folks may also think.
We are working on very large XML files containing many records, in order to random access this very large file, we would like to build index of byte offset and length of these records, java.io.RandomAccessFile provides the API to easily seek/read file by bytes, however we cannot generate the index by off-the-shelf java XML parsers. It seems to me that in order to fast access very large XML files, byte offset is an efficient way. Probably it's also doable by character offset, however I didn't know a java class providing character-based random access. By the way, the java-based XP parser does provide a way to locate byteoffset of starting element event, oddly, it doesn't provide ways to locate endElement and other events [1]. Since XP is not actively maintained, I would like to listen opinions of xerces experts. I did implement a simple parser to continue our work, but it would be nice if this feature is supported by a well developed xml parser. [1]http://www.jclark.com/xml/xp/api/packages.html Xiaoming On Mon, 3 Jan 2005, Michael Glavassevich wrote: > Hello Xiaoming, > > In general byte offsets aren't available to the parser. The document > scanners read from a java.io.Reader so the byte to character decoding is > being done at a lower level. The parser only sees the decoded characters. > If what you actually want is the character offset, we made some changes to > XNI last year (they're in CVS) to expose this information in > org.apache.xerces.xni.XMLLocator. If you're using DOM Level 3, we've also > made the character offset available to DOMLocator [1]. You'll get this > functionality from the latest jars [2]. > > [1] > http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#Interfaces-DOMLocator > [2] http://brutus.apache.org/gump/public-jars/xml-xerces2/jars/ > > Xiaoming Liu <[EMAIL PROTECTED]> wrote on 01/03/2005 06:14:25 PM: > > > hi, > > > > I am looking for a Java XML parser which supports reading byte offset > > information during xml parsing, e.g. in '<foo><bar></bar></foo>', the > > parser can report '<bar>' starts from byte 5; and '</bar>' starts from > > byte 10 . > > > > I went through standard APIs like DOM, SAX, and XMLPull and cannot find > > related APIs. In Sax, the nearest interface is org.xml.sax.Locator. I > also > > checked Xerces XNI and found the nearest class is > > org.apache.xerces.xni.XMLLocator. In either class, only line number and > > column number are reported. > > > > However, similar functions are provided in other languages, such as the > > "XML_GetCurrentByteIndex" of expat parser (C, perl). > > > > so my question is whether there is a Java XML Parser reporting byte > offset > > information during parsing, and if not, is there any plan to > > implement this feature? > > > > many thanks, > > Xiaoming > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > Michael Glavassevich > XML Parser Development > IBM Toronto Lab > E-mail: [EMAIL PROTECTED] > E-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
