Actually, the lexical comparision routines is more stringent than atof. I gave the existing conversions all possible benefits (giving them only valid data, not forcing them to convert from Unicode etc), so the benchmarks understate the advantage of lexical comparision. Also, atof() could not be depended on since it may allow other characters that are prohibited in XML Schema's lexical pattern.
Definitely, when you access a particular piece of information (through a type-aware DOM etc), you will be forced to do a conversion to a platform-supported floating point. However, that conversion could be done on demand since there may be situations where you are only accessing a tiny fraction of the content (or not interested in the floating point values at all). Conversion would be required for access, but not for validation. The situations where a lexical comparision of min/max bounds will differ from the results of a conversion then comparision are going to be vanishingly rare and would be detectable. So you could still use lexical comparision for 99.99999999% of boundary evaluations and in the 0.0000000001% where a) you needed 6 or 15 digits to find a difference, b) there is a difference of one in the exponent and one value is 1.00000xxx and the other is 9.99999xxxx c) one value is zero and the other value is less than 1e-38 or 1e-308. I'd prefer that the schema folks say that those scenarios are not significant, however you could replicate the results in all cases by catching those situations where rounding or underflow could affect results. It is definitely an apples-to-oranges comparision, however for validation you only need a roughly spherical fruit so it makes sense to use the much cheaper one. Basically, my big concern is making validation fast enough that it doesn't get disabled entirely. The very rare cases where lexical validation doesn't precisely match conversion/comparision are not significant enough to double the time to load data. -----Original Message----- From: Mike Pogue [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 15, 2000 8:17 PM To: [EMAIL PROTECTED] Subject: Re: Numeric comparision benchmarks (C++) I suspect that converting to a double/float type also does *other* validation that a lexical min/max routine wouldn't, such as detecting whether any invalid-in-this-context characters were used. So, a comparison of performance of "lexical min/max" vs. "the full-blown try to convert it to double/float" is probably an apples-to-oranges comparison. My experience with customer feedback so far is that they do not want lexical types. They want direct mappings to built-in types (double, float, etc.), and accessors for them. Mike "Arnold, Curt" wrote: > > Some background: > > I've been lobbying the XML Schema working group to reestablish the "real" > datatype that had been in Datatypes until the 17 Dec 1999 draft. Prior to > the 17 Dec draft, there had been a "real" datatype > that was a arbitrary range and precision floating point value. There was a > minAbsoluteValue facet that I didn't like (for details > see:(http://lists.w3.org/Archives/Public/www-xml-schema-comments/1999OctDec/0024.html). > > In the 17 Dec draft, the minAbsoluteValue facet disappeared, the double and > float datatypes were added as primitive datatypes (corresponding to IEEE) and > the "real" datatype was removed. I don't have > any problems with the first two, but I still think that there is substantial > justification for a arbitrary range and precision floating point. > > One of the points (for others see > http://lists.w3.org/Archives/Public/www-xml-schema-comments/2000JanMar/0133.html), > I tried to make was that a lexical-based comparision for evaluation of min > and max > constraints could be substantially faster than conversion to double and then > IEEE floating point comparision. And that unless you really desired to mimic > the rounding effects of IEEE, you should not > have to pay the penalty. Basically that a lexical-based comparision would > see 1.00000000000000000000000001 as greater than one and would accept it if > you had a <minExclusive value="1.0"/>. A IEEE > based comparision would have to realize that 1.00000000000000000001 rounds to > 1.0 and that 1.0 is not greater than 1.0. > > Unfortunately, I could not provide benchmarks to quantify the difference. So > today, I finally did some benchmarks. Hopefully, these can be useful > independent of the schema debate. All times are > reported in GetTickCount() values for a Pentium II 400 running Windows 2000 > Profession with code compiled using VC6 with the default Release settings. > > First, was a timing of 600000 floating point comparisions against 0 (0 was > converted or parsed outside the loop). All values were within the range and > precision of double and no NaN's, Infinity's or > illegal lexicals were in the test set. > > a) 600000 Unicode strings compared using a home-grown lexical comparision : > 300 ticks > b) Equivalent ASCII strings converted using atof() and compared: 5308 ticks > c) Equivalent ASCII strings converted using sscanf() and compared: 6379 ticks > d) Unicode strings converted using VarR8FromStr (COM Automation support > routine): 1552 ticks > > I wasn't successful in the time I had allotted to benchmark conversion using > std::basic_istream<XMLCh>. > > So conversion was between 5-20 times slower than a lexically-based > comparision. VarR8FromStr is much more efficient than the C RTL's atof() > function, but at the cost of platform independence. > > atof() or sscanf() will also not give you consistent results if C++'s double > is not IEEE on the particular platform where the lexical comparision should > give identical results on all platforms. > > To put this in perspective to parsing time, I benchmarked reading a data file > for one of our programs (262000 numbers compared against a preconverted 0) in > various modes. > > Expat, no numeric comparision: 2610 > Expat, lexical comparision: 2804 > Expat: VarR8FromStr: 3525 > Expat, atof comparision: 5384 > > Xerces (1_1_0_d05), non-validating, no comparision: 6679 > Xerces, validating, no comparision: 6789 > Xerces: Non-Validating, lexical range: 8800 > Xerces: Validating, lexical: 9000 > Xerces: Validating, VarR8: 9900 > Xerces: Non-Validating, VarR8: 9850 > > So, using conversion for bound checking can almost than double the parse time > for a numerically intensive file. > > I'd like to see: > > 1) Putting a real type back in, make float, double and decimal derived from > real > 2) Using lexical validation for real > 3) Use lexical validation as a first (and typically only) phase for > validation of float and double types. From the lexical validation, you can > detect if you are potentially in an area where rounding > could give you difference answers. Since rounding should affect only a tiny > fraction of comparisions, then a conversion-based comparision would rarely > ever be needed. Alternative, min and max > constraints could be explicitly stated in the schema draft to be performed > lexically. > 4) Don't depend on C RTL for conversion of double for type-aware DOM's etc. > > I'll make my lexical comparision code available to the Apache project, if > anyone is interested.