(Yang sent me the new profile result in private email) This looks much better. The only question I have now is what the code in 0xed to 0x105 is doing. Something related to converting to a float to an integer; perhaps boxing the result?
Otherwise, it looks roughly like what gcc does, with a few extra moves and the bounds checks for kTrig. On Friday, June 6, 2014 9:28:53 AM UTC-7, Yang Guo wrote: > > Argh. I even prepared it, but totally forgot to send it to you. Will do > when I get home. > > Yang > On Jun 6, 2014 6:03 PM, "Raymond Toy" <[email protected] <javascript:>> > wrote: > >> Thanks for clarifying these results and for providing the modified >> 3d-morph. >> >> When you get a chance could you provide new profile results with >> MathRound removed? And can you provide the pref results with the event >> counters enabled so we can see cache effects? >> >> Thanks! >> >> >> On Fri, Jun 6, 2014 at 6:56 AM, Yang Guo <[email protected] >> <javascript:>> wrote: >> >>> Hi Raymond, >>> >>> the modified 3d-morph is attached. >>> >>> The code from 0xa to 0x47 are a stack check (at the entry to function to >>> detect stack overflow) and unboxing the argument into a double register >>> (double numbers are usually boxed in V8 and stored on the heap, except for >>> certain kinds of arrays and in optimized code). >>> >>> The code from 0xd5 to 0x147 is indeed a MathRound. Replacing it with a >>> floor (updated CL) actually gives a slight boost. The modified 3d-morph >>> goes from 8250ms to 8050ms, and the unmodified one now alternates between >>> 15ms and 16ms. >>> >>> Yes, those comparisons are bounds checks. Unfortunately, out-of-bound >>> reads on typed arrays in Javascript should return undefined. We already >>> eliminate some of the redundant bounds checks, but not all can be >>> eliminated. Of course the generated code for Javascript is a lot larger >>> than that for C, no surprise there. Javascript is a dynamic language after >>> all. And are right in that we probably should focus on the things that add >>> overhead. >>> >>> Moving the calculation to C wouldn't make things faster though, since >>> the switch to C code is rather expensive, and C code cannot be inlined >>> either. >>> >>> Yang >>> >>> >>> >>> On Fri, Jun 6, 2014 at 12:52 AM, Raymond Toy <[email protected] >>> <javascript:>> wrote: >>> >>>> Can you explain what some of the code is in the prof results you sent? >>>> >>>> What is all the stuff from address 0xa to 0x47 doing? >>>> >>>> What is 0xd5 to 0x147 doing? I'm guessing it's doing MathRound, but it >>>> seems that can be done with just one or two instructions. And the >>>> original >>>> code was Math.floor(x + 0.5). If MathRound is rounding to even, then that >>>> is not what we want. >>>> >>>> There are some various bits of code comparing ebx to small positive >>>> constants Is that a bounds check on the kTrig array? >>>> >>>> When I compare this disassembly with what gcc produces on the original >>>> fdlibm code, gcc seems to be much smaller and simpler. The actual >>>> computation parts, however, appear roughly equal. It's all the stuff >>>> around it that makes V8 probably run slower than I would have expected. >>>> >>>> >>>> >>>> On Thu, Jun 5, 2014 at 8:28 AM, Yang Guo <[email protected] >>>> <javascript:>> wrote: >>>> >>>>> Here's a profile of the 64bit build. MathSinSlow takes most of the >>>>> time, and the file includes a disassembly of the generated code, with >>>>> each >>>>> instruction annotated with profiling stats. Note that this runs an >>>>> altered >>>>> version of SunSpider's 3d-morph to run longer, giving more profiling >>>>> samples. >>>>> >>>>> Yang >>>>> >>>>> >>>>> On Thu, Jun 5, 2014 at 5:23 PM, <[email protected] <javascript:>> >>>>> wrote: >>>>> >>>>>> On 2014/06/04 16:30:37, Raymond Toy wrote: >>>>>> >>>>>>> On 2014/06/04 07:19:29, Yang wrote: >>>>>>> > On 2014/06/03 16:51:30, Raymond Toy wrote: >>>>>>> > > On 2014/06/03 07:01:45, Yang wrote: >>>>>>> > > > https://codereview.chromium.org/303753002/diff/40001/src/ >>>>>>> math.js >>>>>>> > > > File src/math.js (right): >>>>>>> > > > >>>>>>> > > > >>>>>>> https://codereview.chromium.org/303753002/diff/40001/src/ >>>>>>> math.js#newcode262 >>>>>>> > > > src/math.js:262: } >>>>>>> > > > On 2014/06/02 17:26:11, Raymond Toy wrote: >>>>>>> > > > > As you mentioned via email, you've removed the 3rd >>>>>>> iteration. This is >>>>>>> > really >>>>>>> > > > > needed if you want to be able to reduce multiples of pi/2 >>>>>>> accurately. >>>>>>> > > > >>>>>>> > > > That's true. However, the reduction step is not exposed as a >>>>>>> library >>>>>>> > function. >>>>>>> > > > From what I have seen, the third step seems to only affect y1. >>>>>>> With a y0 >>>>>>> > > really >>>>>>> > > > close to y1, it does not change the result of sine or cosine. >>>>>>> This is >>>>>>> >>>>>> also >>>>>> >>>>>>> > why >>>>>>> > > I >>>>>>> > > > was asking for a test case where removing this third step >>>>>>> would make a >>>>>>> > > > difference. >>>>>>> > > >>>>>>> > > I don't understand what you mean by "y0 really close to y1". >>>>>>> What are you >>>>>>> > > saying? >>>>>>> > > >>>>>>> > > >>>>>>> > > tan(Math.PI*45/2) requires the 3rd iteration. ieee754_rem_pio2 >>>>>>> returns >>>>>>> > > [45, -9.790984586812941e-16, -6.820314736619894e-32] >>>>>>> > > >>>>>>> > > If you ignore the y1 result, we have >>>>>>> > > kernel_tan(-9.790984586812941e-16, 0e0, -1) -> >>>>>>> 1021347742030824.2 >>>>>>> > > >>>>>>> > > If you include the y1 result: >>>>>>> > > kernel_tan(-9.790984586812941e-16,-6.820314736619894e-32, -1) -> >>>>>>> > > 1021347742030824.1 >>>>>>> > >>>>>>> > I somehow didn't type what I thought. I meant to say: if y0 is >>>>>>> really close >>>>>>> >>>>>> to >>>>>> >>>>>>> > 0, there does not seem to be any point to invest in the third >>>>>>> loop. (I am >>>>>>> aware >>>>>>> > that omitting y1 changes the result in some cases. I'm not arguing >>>>>>> this). >>>>>>> > >>>>>>> > So in the example here, if I omit the third iteration, I get >>>>>>> > [45, -9.790984586812941e-16, -6.820199415561299e-32] >>>>>>> > >>>>>>> > y0 is the same, y1 differs slightly, but the end result is still >>>>>>> > 1021347742030824.1. >>>>>>> >>>>>> >>>>>> While I understand your desire to reduce the complexity, you are >>>>>>> modifying an >>>>>>> algorithm written by an expert. I think the burden is on you to >>>>>>> prove that by >>>>>>> removing the third iteration you do not change the value of y0. >>>>>>> >>>>>> >>>>>> Also, where is this coming from? In reality, how often will you >>>>>>> compute >>>>>>> >>>>>> sin(x) >>>>>> >>>>>>> where x is very near a multiple of pi/2 (where the third iteration >>>>>>> is needed)? >>>>>>> >>>>>> >>>>>> I suspect it occurs more often than we might expect, but also that >>>>>>> if you're >>>>>>> doing that, I think you're also computing zillions more values that >>>>>>> are not a >>>>>>> multiple of pi/2. >>>>>>> >>>>>> >>>>>> For example, in 3d-morph, we compute sin((n-1)*pi/15) for n = 0 to >>>>>>> 119. Thus >>>>>>> out of 120 values, we have a multiple of pi just 8 times out of 120. >>>>>>> If the >>>>>>> >>>>>> cost >>>>>> >>>>>>> of reduction for multiples of pi/2 AND the computation of sin were >>>>>>> reduced to >>>>>>> exactly zero, you would save about just 6.6% in runtime. >>>>>>> >>>>>> >>>>>> I think there are more important things to look at. We need profile >>>>>>> results. >>>>>>> >>>>>> We >>>>>> >>>>>>> need to understand what is really expensive in the reduction, not >>>>>>> what we >>>>>>> >>>>>> think >>>>>> >>>>>>> is expensive. >>>>>>> >>>>>> >>>>>> I added back the third iteration, and tweaked some places, so that >>>>>> the runtime >>>>>> is now down to 16ms (vs the current 12ms). >>>>>> >>>>>> https://codereview.chromium.org/303753002/ >>>>>> >>>>> >>>>> >>>> >>> >> -- -- v8-dev mailing list [email protected] http://groups.google.com/group/v8-dev --- You received this message because you are subscribed to the Google Groups "v8-dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
