Argh. I even prepared it, but totally forgot to send it to you. Will do
when I get home.

Yang
On Jun 6, 2014 6:03 PM, "Raymond Toy" <[email protected]> wrote:

> Thanks for clarifying these results and for providing the modified
> 3d-morph.
>
> When you get a chance could you provide new profile results with MathRound
> removed? And can you provide the pref results with the event counters
> enabled so we can see cache effects?
>
> Thanks!
>
>
> On Fri, Jun 6, 2014 at 6:56 AM, Yang Guo <[email protected]> wrote:
>
>> Hi Raymond,
>>
>> the modified 3d-morph is attached.
>>
>> The code from 0xa to 0x47 are a stack check (at the entry to function to
>> detect stack overflow) and unboxing the argument into a double register
>> (double numbers are usually boxed in V8 and stored on the heap, except for
>> certain kinds of arrays and in optimized code).
>>
>> The code from 0xd5 to 0x147 is indeed a MathRound. Replacing it with a
>> floor (updated CL) actually gives a slight boost. The modified 3d-morph
>> goes from 8250ms to 8050ms, and the unmodified one now alternates between
>> 15ms and 16ms.
>>
>> Yes, those comparisons are bounds checks. Unfortunately, out-of-bound
>> reads on typed arrays in Javascript should return undefined. We already
>> eliminate some of the redundant bounds checks, but not all can be
>> eliminated. Of course the generated code for Javascript is a lot larger
>> than that for C, no surprise there. Javascript is a dynamic language after
>> all. And are right in that we probably should focus on the things that add
>> overhead.
>>
>> Moving the calculation to C wouldn't make things faster though, since the
>> switch to C code is rather expensive, and C code cannot be inlined either.
>>
>> Yang
>>
>>
>>
>> On Fri, Jun 6, 2014 at 12:52 AM, Raymond Toy <[email protected]> wrote:
>>
>>> Can you explain what some of the code is in the prof results you sent?
>>>
>>> What is all the stuff from address 0xa to 0x47 doing?
>>>
>>> What is 0xd5 to 0x147 doing? I'm guessing it's doing MathRound, but it
>>> seems that can be done with just one or two instructions.  And the original
>>> code was Math.floor(x + 0.5).  If MathRound is rounding to even, then that
>>> is not what we want.
>>>
>>> There are some various bits of code comparing ebx to small positive
>>> constants Is that a bounds check on the kTrig array?
>>>
>>> When I compare this disassembly with what gcc produces on the original
>>> fdlibm code, gcc seems to be much smaller and simpler.  The actual
>>> computation parts, however, appear roughly equal.  It's all the stuff
>>> around it that makes V8 probably run slower than I would have expected.
>>>
>>>
>>>
>>> On Thu, Jun 5, 2014 at 8:28 AM, Yang Guo <[email protected]> wrote:
>>>
>>>> Here's a profile of the 64bit build. MathSinSlow takes most of the
>>>> time, and the file includes a disassembly of the generated code, with each
>>>> instruction annotated with profiling stats. Note that this runs an altered
>>>> version of SunSpider's 3d-morph to run longer, giving more profiling
>>>> samples.
>>>>
>>>> Yang
>>>>
>>>>
>>>> On Thu, Jun 5, 2014 at 5:23 PM, <[email protected]> wrote:
>>>>
>>>>> On 2014/06/04 16:30:37, Raymond Toy wrote:
>>>>>
>>>>>> On 2014/06/04 07:19:29, Yang wrote:
>>>>>> > On 2014/06/03 16:51:30, Raymond Toy wrote:
>>>>>> > > On 2014/06/03 07:01:45, Yang wrote:
>>>>>> > > > https://codereview.chromium.org/303753002/diff/40001/src/
>>>>>> math.js
>>>>>> > > > File src/math.js (right):
>>>>>> > > >
>>>>>> > > >
>>>>>> https://codereview.chromium.org/303753002/diff/40001/src/
>>>>>> math.js#newcode262
>>>>>> > > > src/math.js:262: }
>>>>>> > > > On 2014/06/02 17:26:11, Raymond Toy wrote:
>>>>>> > > > > As you mentioned via email, you've removed the 3rd iteration.
>>>>>> This is
>>>>>> > really
>>>>>> > > > > needed if you want to be able to reduce multiples of pi/2
>>>>>> accurately.
>>>>>> > > >
>>>>>> > > > That's true. However, the reduction step is not exposed as a
>>>>>> library
>>>>>> > function.
>>>>>> > > > From what I have seen, the third step seems to only affect y1.
>>>>>> With a y0
>>>>>> > > really
>>>>>> > > > close to y1, it does not change the result of sine or cosine.
>>>>>> This is
>>>>>>
>>>>> also
>>>>>
>>>>>> > why
>>>>>> > > I
>>>>>> > > > was asking for a test case where removing this third step would
>>>>>> make a
>>>>>> > > > difference.
>>>>>> > >
>>>>>> > > I don't understand what you mean by "y0 really close to y1".
>>>>>>  What are you
>>>>>> > > saying?
>>>>>> > >
>>>>>> > >
>>>>>> > > tan(Math.PI*45/2) requires the 3rd iteration. ieee754_rem_pio2
>>>>>> returns
>>>>>> > > [45, -9.790984586812941e-16, -6.820314736619894e-32]
>>>>>> > >
>>>>>> > > If you ignore the y1 result, we have
>>>>>> > > kernel_tan(-9.790984586812941e-16, 0e0, -1) -> 1021347742030824.2
>>>>>> > >
>>>>>> > > If you include the y1 result:
>>>>>> > > kernel_tan(-9.790984586812941e-16,-6.820314736619894e-32, -1) ->
>>>>>> > > 1021347742030824.1
>>>>>> >
>>>>>> > I somehow didn't type what I thought. I meant to say: if y0 is
>>>>>> really close
>>>>>>
>>>>> to
>>>>>
>>>>>> > 0, there does not seem to be any point to invest in the third loop.
>>>>>> (I am
>>>>>> aware
>>>>>> > that omitting y1 changes the result in some cases. I'm not arguing
>>>>>> this).
>>>>>> >
>>>>>> > So in the example here, if I omit the third iteration, I get
>>>>>> > [45, -9.790984586812941e-16, -6.820199415561299e-32]
>>>>>> >
>>>>>> > y0 is the same, y1 differs slightly, but the end result is still
>>>>>> > 1021347742030824.1.
>>>>>>
>>>>>
>>>>>  While I understand your desire to reduce the complexity, you are
>>>>>> modifying an
>>>>>> algorithm written by an expert.  I think the burden is on you to
>>>>>> prove that by
>>>>>> removing the third iteration you do not change the value of y0.
>>>>>>
>>>>>
>>>>>  Also, where is this coming from?  In reality, how often will you
>>>>>> compute
>>>>>>
>>>>> sin(x)
>>>>>
>>>>>> where x is very near a multiple of pi/2 (where the third iteration is
>>>>>> needed)?
>>>>>>
>>>>>
>>>>>  I suspect it occurs more often than we might expect, but also that if
>>>>>> you're
>>>>>> doing that, I think you're also computing zillions more values that
>>>>>> are not a
>>>>>> multiple of pi/2.
>>>>>>
>>>>>
>>>>>  For example, in 3d-morph, we compute sin((n-1)*pi/15) for n = 0 to
>>>>>> 119.  Thus
>>>>>> out of 120 values, we have a multiple of pi just 8 times out of 120.
>>>>>> If the
>>>>>>
>>>>> cost
>>>>>
>>>>>> of reduction for multiples of pi/2 AND the computation of sin were
>>>>>> reduced to
>>>>>> exactly zero, you would save about just 6.6% in runtime.
>>>>>>
>>>>>
>>>>>  I think there are more important things to look at.  We need profile
>>>>>> results.
>>>>>>
>>>>> We
>>>>>
>>>>>> need to understand what is really expensive in the reduction, not
>>>>>> what we
>>>>>>
>>>>> think
>>>>>
>>>>>> is expensive.
>>>>>>
>>>>>
>>>>> I added back the third iteration, and tweaked some places, so that the
>>>>> runtime
>>>>> is now down to 16ms (vs the current 12ms).
>>>>>
>>>>> https://codereview.chromium.org/303753002/
>>>>>
>>>>
>>>>
>>>
>>
>

-- 
-- 
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev
--- 
You received this message because you are subscribed to the Google Groups 
"v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to