I do find that the bit-twiddling is faster to be surprising. How much of the win is from avoiding a useless SMI->FP->int32 conversion vs. avoiding the FISTTP instruction?
A branch mispredict is several times slower than a FP conversion, so be careful that your benchmarks are realistic. It might also be worth running the code under VTune to see if the old code was stalling due to an unfortunate microarchitectural issue that could be fixed with a little instruction scheduling. I can imagine that fnstsw, having a value dependency on the exception state, might be delayed for the latency of the previous instruction(s). I see some benchmarks have code like (x & 0xFFFFFFFF) or the same but with dynamic values. It would be a huge advantage in this case if the FP->int conversion took care of ToInt32 / ToUint32 conversions specified for the bitops. I was considering investigating if the low 32 bits of FISTTP mem64 would always be the right value. I think it would for values with magnitude < 2^51, but I was not sure outside that range. http://codereview.chromium.org/506052 -- v8-dev mailing list [email protected] http://groups.google.com/group/v8-dev
