I've put two zip files on my personal home page that AEA Technology is donating to the Apache project. Neither are anything particularly special, but they would let anyone who is diving who wants to take a look at the hot spot in XMLAttr::set or lexical validation of reals to start with code that works.
The first zip file contains a modified version of XMLAttr.hpp and XMLAttr.cpp that eliminates the delete and new that currently occurs when processing every attribute. The donated code allocates an initial chunk of memory for the name, value, etc of each XMLAttr and only delete's and new's when an encountered attribute part is larger than the currently allocated chunk. If a larger attribute part is encountered, it then allocates the next multiple of the chunk size. Since the vast majority of attribute names and values are fairly short (on the order of 1-32 characters), this typically results in no allocations after the initial population of the XMLAttr pool. In sample files that were heavy with attributes, this increased the speed of SAXCount by 20%. The code has a preprocessor definition in the XMLAttr.hpp file which allows you to switch from original to new behavior. Of course, that would have to be removed and the code cleaned up before any type of integration. The second file is Visual C++ 6 console application that implements and benchmarks lexical validation and comparision of real values. With slight modification (some booleans to determine if periods, exponents, NaN's and Infinity's are allowed), it could validate all numeric types (real, decimal, integer). If it is necessary to exactly reproduce the effects of IEEE rounding on comparisions, it would also be relatively easy to determine if rounding would be significant and in those few cases defer to a conversion to double then a comparison. Benchmarking on VC6 showed that lexical validation could be 20 times faster than using of atof and 5 times faster than using VarR8FromStr() and that use of atof() for numeric validation could take as much time as parsing. Since schema datatypes are first being implemented in Java, I don't expect this code to do much more than to serve as a potential resource for whoever does the implementation for Xerces-J. However, maybe it will give you something to think about. Here are the links (not guaranteed to be there for any length of time) http://home.houston.rr.com/curta/XMLAttr.zip http://home.houston.rr.com/curta/realvalid.zip
