Thanks for clarifying these results and for providing the modified 3d-morph.
When you get a chance could you provide new profile results with MathRound removed? And can you provide the pref results with the event counters enabled so we can see cache effects? Thanks! On Fri, Jun 6, 2014 at 6:56 AM, Yang Guo <[email protected]> wrote: > Hi Raymond, > > the modified 3d-morph is attached. > > The code from 0xa to 0x47 are a stack check (at the entry to function to > detect stack overflow) and unboxing the argument into a double register > (double numbers are usually boxed in V8 and stored on the heap, except for > certain kinds of arrays and in optimized code). > > The code from 0xd5 to 0x147 is indeed a MathRound. Replacing it with a > floor (updated CL) actually gives a slight boost. The modified 3d-morph > goes from 8250ms to 8050ms, and the unmodified one now alternates between > 15ms and 16ms. > > Yes, those comparisons are bounds checks. Unfortunately, out-of-bound > reads on typed arrays in Javascript should return undefined. We already > eliminate some of the redundant bounds checks, but not all can be > eliminated. Of course the generated code for Javascript is a lot larger > than that for C, no surprise there. Javascript is a dynamic language after > all. And are right in that we probably should focus on the things that add > overhead. > > Moving the calculation to C wouldn't make things faster though, since the > switch to C code is rather expensive, and C code cannot be inlined either. > > Yang > > > > On Fri, Jun 6, 2014 at 12:52 AM, Raymond Toy <[email protected]> wrote: > >> Can you explain what some of the code is in the prof results you sent? >> >> What is all the stuff from address 0xa to 0x47 doing? >> >> What is 0xd5 to 0x147 doing? I'm guessing it's doing MathRound, but it >> seems that can be done with just one or two instructions. And the original >> code was Math.floor(x + 0.5). If MathRound is rounding to even, then that >> is not what we want. >> >> There are some various bits of code comparing ebx to small positive >> constants Is that a bounds check on the kTrig array? >> >> When I compare this disassembly with what gcc produces on the original >> fdlibm code, gcc seems to be much smaller and simpler. The actual >> computation parts, however, appear roughly equal. It's all the stuff >> around it that makes V8 probably run slower than I would have expected. >> >> >> >> On Thu, Jun 5, 2014 at 8:28 AM, Yang Guo <[email protected]> wrote: >> >>> Here's a profile of the 64bit build. MathSinSlow takes most of the time, >>> and the file includes a disassembly of the generated code, with each >>> instruction annotated with profiling stats. Note that this runs an altered >>> version of SunSpider's 3d-morph to run longer, giving more profiling >>> samples. >>> >>> Yang >>> >>> >>> On Thu, Jun 5, 2014 at 5:23 PM, <[email protected]> wrote: >>> >>>> On 2014/06/04 16:30:37, Raymond Toy wrote: >>>> >>>>> On 2014/06/04 07:19:29, Yang wrote: >>>>> > On 2014/06/03 16:51:30, Raymond Toy wrote: >>>>> > > On 2014/06/03 07:01:45, Yang wrote: >>>>> > > > https://codereview.chromium.org/303753002/diff/40001/src/math.js >>>>> > > > File src/math.js (right): >>>>> > > > >>>>> > > > >>>>> https://codereview.chromium.org/303753002/diff/40001/src/ >>>>> math.js#newcode262 >>>>> > > > src/math.js:262: } >>>>> > > > On 2014/06/02 17:26:11, Raymond Toy wrote: >>>>> > > > > As you mentioned via email, you've removed the 3rd iteration. >>>>> This is >>>>> > really >>>>> > > > > needed if you want to be able to reduce multiples of pi/2 >>>>> accurately. >>>>> > > > >>>>> > > > That's true. However, the reduction step is not exposed as a >>>>> library >>>>> > function. >>>>> > > > From what I have seen, the third step seems to only affect y1. >>>>> With a y0 >>>>> > > really >>>>> > > > close to y1, it does not change the result of sine or cosine. >>>>> This is >>>>> >>>> also >>>> >>>>> > why >>>>> > > I >>>>> > > > was asking for a test case where removing this third step would >>>>> make a >>>>> > > > difference. >>>>> > > >>>>> > > I don't understand what you mean by "y0 really close to y1". What >>>>> are you >>>>> > > saying? >>>>> > > >>>>> > > >>>>> > > tan(Math.PI*45/2) requires the 3rd iteration. ieee754_rem_pio2 >>>>> returns >>>>> > > [45, -9.790984586812941e-16, -6.820314736619894e-32] >>>>> > > >>>>> > > If you ignore the y1 result, we have >>>>> > > kernel_tan(-9.790984586812941e-16, 0e0, -1) -> 1021347742030824.2 >>>>> > > >>>>> > > If you include the y1 result: >>>>> > > kernel_tan(-9.790984586812941e-16,-6.820314736619894e-32, -1) -> >>>>> > > 1021347742030824.1 >>>>> > >>>>> > I somehow didn't type what I thought. I meant to say: if y0 is >>>>> really close >>>>> >>>> to >>>> >>>>> > 0, there does not seem to be any point to invest in the third loop. >>>>> (I am >>>>> aware >>>>> > that omitting y1 changes the result in some cases. I'm not arguing >>>>> this). >>>>> > >>>>> > So in the example here, if I omit the third iteration, I get >>>>> > [45, -9.790984586812941e-16, -6.820199415561299e-32] >>>>> > >>>>> > y0 is the same, y1 differs slightly, but the end result is still >>>>> > 1021347742030824.1. >>>>> >>>> >>>> While I understand your desire to reduce the complexity, you are >>>>> modifying an >>>>> algorithm written by an expert. I think the burden is on you to prove >>>>> that by >>>>> removing the third iteration you do not change the value of y0. >>>>> >>>> >>>> Also, where is this coming from? In reality, how often will you >>>>> compute >>>>> >>>> sin(x) >>>> >>>>> where x is very near a multiple of pi/2 (where the third iteration is >>>>> needed)? >>>>> >>>> >>>> I suspect it occurs more often than we might expect, but also that if >>>>> you're >>>>> doing that, I think you're also computing zillions more values that >>>>> are not a >>>>> multiple of pi/2. >>>>> >>>> >>>> For example, in 3d-morph, we compute sin((n-1)*pi/15) for n = 0 to >>>>> 119. Thus >>>>> out of 120 values, we have a multiple of pi just 8 times out of 120. >>>>> If the >>>>> >>>> cost >>>> >>>>> of reduction for multiples of pi/2 AND the computation of sin were >>>>> reduced to >>>>> exactly zero, you would save about just 6.6% in runtime. >>>>> >>>> >>>> I think there are more important things to look at. We need profile >>>>> results. >>>>> >>>> We >>>> >>>>> need to understand what is really expensive in the reduction, not what >>>>> we >>>>> >>>> think >>>> >>>>> is expensive. >>>>> >>>> >>>> I added back the third iteration, and tweaked some places, so that the >>>> runtime >>>> is now down to 16ms (vs the current 12ms). >>>> >>>> https://codereview.chromium.org/303753002/ >>>> >>> >>> >> > -- -- v8-dev mailing list [email protected] http://groups.google.com/group/v8-dev --- You received this message because you are subscribed to the Google Groups "v8-dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
