Some background:

I've been lobbying the XML Schema working group to reestablish the "real" 
datatype that had been in Datatypes until the 17 Dec 1999 draft.  Prior to the 
17 Dec draft, there had been a "real" datatype
that was a arbitrary range and precision floating point value.  There was a 
minAbsoluteValue facet that I didn't like (for details
see:(http://lists.w3.org/Archives/Public/www-xml-schema-comments/1999OctDec/0024.html).

In the 17 Dec draft, the minAbsoluteValue facet disappeared, the double and 
float datatypes were added as primitive datatypes (corresponding to IEEE) and 
the "real" datatype was removed.  I don't have
any problems with the first two, but I still think that there is substantial 
justification for a arbitrary range and precision floating point.

One of the points (for others see 
http://lists.w3.org/Archives/Public/www-xml-schema-comments/2000JanMar/0133.html),
 I tried to make was that a lexical-based comparision for evaluation of min and 
max
constraints could be substantially faster than conversion to double and then 
IEEE floating point comparision.  And that unless you really desired to mimic 
the rounding effects of IEEE, you should not
have to pay the penalty.  Basically that a lexical-based comparision would see 
1.00000000000000000000000001 as greater than one and would accept it if you had 
a <minExclusive value="1.0"/>.  A IEEE
based comparision would have to realize that 1.00000000000000000001 rounds to 
1.0 and that 1.0 is not greater than 1.0.

Unfortunately, I could not provide benchmarks to quantify the difference.  So 
today, I finally did some benchmarks.  Hopefully, these can be useful 
independent of the schema debate.  All times are
reported in GetTickCount() values for a Pentium II 400 running Windows 2000 
Profession with code compiled using VC6 with the default Release settings.

First, was a timing of 600000 floating point comparisions against 0 (0 was 
converted or parsed outside the loop).  All values were within the range and 
precision of double and no NaN's, Infinity's or
illegal lexicals were in the test set.

a) 600000 Unicode strings compared using a home-grown lexical comparision : 300 
ticks
b) Equivalent ASCII strings converted using atof() and compared: 5308 ticks
c) Equivalent ASCII strings converted using sscanf() and compared: 6379 ticks
d) Unicode strings converted using VarR8FromStr (COM Automation support 
routine): 1552 ticks

I wasn't successful in the time I had allotted to benchmark conversion using 
std::basic_istream<XMLCh>. 

So conversion was between 5-20 times slower than a lexically-based comparision. 
 VarR8FromStr is much more efficient than the C RTL's atof() function, but at 
the cost of platform independence.

atof() or sscanf() will also not give you consistent results if C++'s double is 
not IEEE on the particular platform where the lexical comparision should give 
identical results on all platforms.

To put this in perspective to parsing time, I benchmarked reading a data file 
for one of our programs (262000 numbers compared against a preconverted 0) in 
various modes.

Expat, no numeric comparision: 2610
Expat, lexical comparision: 2804
Expat: VarR8FromStr: 3525
Expat, atof comparision: 5384

Xerces (1_1_0_d05), non-validating, no comparision: 6679
Xerces, validating, no comparision: 6789
Xerces: Non-Validating, lexical range: 8800
Xerces: Validating, lexical: 9000
Xerces: Validating, VarR8: 9900
Xerces: Non-Validating, VarR8:  9850

So, using conversion for bound checking can almost than double the parse time 
for a numerically intensive file.

I'd like to see:

1) Putting a real type back in, make float, double and decimal derived from real
2) Using lexical validation for real
3) Use lexical validation as a first (and typically only) phase for validation 
of float and double types.  From the lexical validation, you can detect if you 
are potentially in an area where rounding
could give you difference answers.  Since rounding should affect only a tiny 
fraction of comparisions, then a conversion-based comparision would rarely ever 
be needed.  Alternative, min and max
constraints could be explicitly stated in the schema draft to be performed 
lexically.
4) Don't depend on C RTL for conversion of double for type-aware DOM's etc.

I'll make my lexical comparision code available to the Apache project, if 
anyone is interested.

Reply via email to