I would appreciate round-tripping support in Xerces. It is really necessary for XML editors/tools -- broken user indentation is annoying.
+1 Regards, Libor Alex Rosen wrote: > A few weeks ago I e-mailed this list, asking about adding round-tripping > support to Xerces - i.e. the ability to output the exact same XML file as > was read in, or at least very close to it. In other words, preserving more > of the non-infoset information that normally gets dropped. > > I spent some time working on this, and have a prototype done, which uses > Augmentations to pass in more information about the "raw text" of the > original document than Xerces normally gives. An example is the amount of > whitespace between attributes. Saving this extra information (and using it > on output) means that if the user puts each attribute on its own line, that > will be preserved on output, instead of collapsing them back onto one line. > These sorts of modifications are semantically equivalent, but it really > annoys users when you reformat their document out from under them. > > The particular project that needs this is a dom4j project, so I also created > a special dom4j reader that takes this extra information that's given by the > parser and stores it in each dom4j node it creates, and a writer that uses > this saved information to write out a more accurate version of the output > document. (This could easily be extended to DOM and JDOM.) I've attached an > example. Sample.xml is the source file, rt-output.xml is the output using > the new round-trip-enabled Xerces/dom4j code, and the other two are the > output using standard Xerces/dom4j (in both standard and pretty-printing > modes). Not everything is identical, but it's much, much better. > > I think it would be nice if this feature were added to Xerces. I think it > fulfills a significant need, and I don't think it adds any overhead when > it's not turned on, and probably minimal overhead with it turned on. It > currently doesn't cover many of the less-used areas of XML (notations, etc.) > but I think it does a very good job of covering the common cases. > > There also happened to be a similar thread going on at the same time as my > original post, that I'd like to respond to: > > http://marc.theaimsgroup.com/?l=xerces-j-dev&m=103029884901546&w=2 > > >>I can understand the cases in which people would like to >>be able to do this but I also realize what it would take >>to implement it. ;) > > > I don't the the implementation is too bad. It's not trivial, but not > unreasonably complex, I don't think. > > >>The "limited usefulness" that I was referring to was the >>fact that reporting character offsets only works if the >>parsed source is already a character stream. If it's >>anything else (say a byte stream in UTF8 or Shift_JIS) >>then the application can't map those offsets back to the >>source without re-reading the file. > > > But there's *always* a character stream (Reader). Xerces creates one if it's > not handed one. The easy way is to have Xerces send the actual text along to > the user. (The other way is to have the user override createReader() to get > his hands on the relevent character stream, which turns out to be a little > ugly, but works fine.) Thus it's always applicable, even when you hand > Xerces an InputStream. And I think it would be useful to a significant > number of users. > > So is there any chance of this modification making it in to Xerces? I'd be > happy to send a patch once it's cleaned up a bit. > > Thanks, > Alex > -- Libor Kramolis, Software Engineer | <[EMAIL PROTECTED]> NetBeans/Sun Microsystems, XML Project | http://xml.netbeans.org/ --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
