I suspect that converting to a double/float type also does *other* validation 
that a lexical min/max
routine wouldn't, such as detecting whether any invalid-in-this-context 
characters were used.

So, a comparison of performance of "lexical min/max" vs. "the full-blown try to 
convert it to
double/float" is probably an apples-to-oranges comparison.  

My experience with customer feedback so far is that they do not want lexical 
types.
They want direct mappings to built-in types (double, float, etc.), and 
accessors for them.

Mike

"Arnold, Curt" wrote:
> 
> Some background:
> 
> I've been lobbying the XML Schema working group to reestablish the "real" 
> datatype that had been in Datatypes until the 17 Dec 1999 draft.  Prior to 
> the 17 Dec draft, there had been a "real" datatype
> that was a arbitrary range and precision floating point value.  There was a 
> minAbsoluteValue facet that I didn't like (for details
> see:(http://lists.w3.org/Archives/Public/www-xml-schema-comments/1999OctDec/0024.html).
> 
> In the 17 Dec draft, the minAbsoluteValue facet disappeared, the double and 
> float datatypes were added as primitive datatypes (corresponding to IEEE) and 
> the "real" datatype was removed.  I don't have
> any problems with the first two, but I still think that there is substantial 
> justification for a arbitrary range and precision floating point.
> 
> One of the points (for others see 
> http://lists.w3.org/Archives/Public/www-xml-schema-comments/2000JanMar/0133.html),
>  I tried to make was that a lexical-based comparision for evaluation of min 
> and max
> constraints could be substantially faster than conversion to double and then 
> IEEE floating point comparision.  And that unless you really desired to mimic 
> the rounding effects of IEEE, you should not
> have to pay the penalty.  Basically that a lexical-based comparision would 
> see 1.00000000000000000000000001 as greater than one and would accept it if 
> you had a <minExclusive value="1.0"/>.  A IEEE
> based comparision would have to realize that 1.00000000000000000001 rounds to 
> 1.0 and that 1.0 is not greater than 1.0.
> 
> Unfortunately, I could not provide benchmarks to quantify the difference.  So 
> today, I finally did some benchmarks.  Hopefully, these can be useful 
> independent of the schema debate.  All times are
> reported in GetTickCount() values for a Pentium II 400 running Windows 2000 
> Profession with code compiled using VC6 with the default Release settings.
> 
> First, was a timing of 600000 floating point comparisions against 0 (0 was 
> converted or parsed outside the loop).  All values were within the range and 
> precision of double and no NaN's, Infinity's or
> illegal lexicals were in the test set.
> 
> a) 600000 Unicode strings compared using a home-grown lexical comparision : 
> 300 ticks
> b) Equivalent ASCII strings converted using atof() and compared: 5308 ticks
> c) Equivalent ASCII strings converted using sscanf() and compared: 6379 ticks
> d) Unicode strings converted using VarR8FromStr (COM Automation support 
> routine): 1552 ticks
> 
> I wasn't successful in the time I had allotted to benchmark conversion using 
> std::basic_istream<XMLCh>.
> 
> So conversion was between 5-20 times slower than a lexically-based 
> comparision.  VarR8FromStr is much more efficient than the C RTL's atof() 
> function, but at the cost of platform independence.
> 
> atof() or sscanf() will also not give you consistent results if C++'s double 
> is not IEEE on the particular platform where the lexical comparision should 
> give identical results on all platforms.
> 
> To put this in perspective to parsing time, I benchmarked reading a data file 
> for one of our programs (262000 numbers compared against a preconverted 0) in 
> various modes.
> 
> Expat, no numeric comparision: 2610
> Expat, lexical comparision: 2804
> Expat: VarR8FromStr: 3525
> Expat, atof comparision: 5384
> 
> Xerces (1_1_0_d05), non-validating, no comparision: 6679
> Xerces, validating, no comparision: 6789
> Xerces: Non-Validating, lexical range: 8800
> Xerces: Validating, lexical: 9000
> Xerces: Validating, VarR8: 9900
> Xerces: Non-Validating, VarR8:  9850
> 
> So, using conversion for bound checking can almost than double the parse time 
> for a numerically intensive file.
> 
> I'd like to see:
> 
> 1) Putting a real type back in, make float, double and decimal derived from 
> real
> 2) Using lexical validation for real
> 3) Use lexical validation as a first (and typically only) phase for 
> validation of float and double types.  From the lexical validation, you can 
> detect if you are potentially in an area where rounding
> could give you difference answers.  Since rounding should affect only a tiny 
> fraction of comparisions, then a conversion-based comparision would rarely 
> ever be needed.  Alternative, min and max
> constraints could be explicitly stated in the schema draft to be performed 
> lexically.
> 4) Don't depend on C RTL for conversion of double for type-aware DOM's etc.
> 
> I'll make my lexical comparision code available to the Apache project, if 
> anyone is interested.

Reply via email to