Re: [sympy] Parallel evaluation of a mathematical expression

Ondřej Čertík Mon, 17 Mar 2014 14:54:59 -0700

On Mon, Mar 17, 2014 at 3:35 PM, Jason Moore <[email protected]> wrote:
> Ok, if we had some concrete examples of that non-cancellation then we could
> do something about it. This is goes for everything that we've been
> implementing, if we don't have a good benchmark problem then it is hard to
> design something that solve it. We can look into this more when it becomes
> and issue, I guess.


Just try this:

In [1]: from sympy import *

In [2]: var("x")
Out[2]: x

In [3]: e = legendre_poly(50, x)

In [4]: e.subs(x, 0.9)
Out[4]: 1.25000000000000

In [5]: e.subs(x, S(9)/10).n()
Out[5]: -0.170037659943837

and it only gets worse:

In [6]: e = legendre_poly(100, x)

In [7]: e.subs(x, 0.9)
Out[7]: -3.65635994747142e+17

In [8]: e.subs(x, S(9)/10).n()
Out[8]: 0.102265820558719


In other words, double precision is simply not enough at all, and
that's just Legendre polynomial of order 50, which is not much, i.e.
it looks like this:

In [9]: legendre_poly(50, x)
Out[9]: 12611418068195524166851562157*x**50/140737488355328 -
156050375086257748529223875175*x**48/140737488355328 +
226836112238787036521861509275*x**46/35184372088832 -
823773249709279237895181270525*x**44/35184372088832 +
4189728463575151392735706892025*x**42/70368744177664 -
7928255400303748020099876118755*x**40/70368744177664 +
5790298887862287879848224131675*x**38/35184372088832 -
6684039602901787158511168414725*x**36/35184372088832 +
24770264410753681822717859419275*x**34/140737488355328 -
18602568051449552212241926551825*x**32/140737488355328 +
1423900270604780539702468452115*x**30/17592186044416 -
712769410486857922635873160725*x**28/17592186044416 +
583174972216520118520259858775*x**26/35184372088832 -
194391657405506706173419952925*x**24/35184372088832 +
26248579962778792027330678575*x**22/17592186044416 -
5693353963757653481984400705*x**20/17592186044416 +
7838675747202566388239392275*x**18/140737488355328 -
1052956443654076082002306425*x**16/140737488355328 +
26998883170617335435956575*x**14/35184372088832 -
2052546673789621992207225*x**12/35184372088832 +
222078820442811559812585*x**10/70368744177664 -
8065816723104536070675*x**8/70368744177664 +
90048990529077755175*x**6/35184372088832 -
1067774591253886425*x**4/35184372088832 +
20146690401016725*x**2/140737488355328 -
15801325804719/140737488355328


By "minus" sign, I mean all the minus signs in there, which cause the
cancellation.

Ondrej


>
> Yes, we should have an option in the code generation to generate a function
> that can evaluate the Jacobian of the RHS. This can be used directly for
> implicit ODE integration routines that are best suited for for stiff
> problems.
>
>
> Jason
> moorepants.info
> +01 530-601-9791
>
>
> On Mon, Mar 17, 2014 at 5:26 PM, Gilbert Gede <[email protected]> wrote:
>>
>> Jason, I've definitely seen examples where small errors in subtraction can
>> lead to larger errors later on - especially if things are supposed to cancel
>> out. In our domain, I think it pops up in mass/inertia division operations.
>> I'm not sure how often it happens with real-world examples vs.
>> academic/idealized examples though.
>>
>> I would definitely agree that step-size (for stiff systems) is the bigger
>> challenge in practice - although I believe providing the Jacobian of your
>> RHS function can help with that. That might be something we want to automate
>> in the code-gen module.
>>
>> BTW, thanks for starting this discussion. Very informative.
>>
>> -Gilbert
>>
>>
>> On Mon, Mar 17, 2014 at 2:13 PM, Jason Moore <[email protected]> wrote:
>>>
>>> For numerical ODE integration, I don't think double precision presents
>>> near as much an issue in accuracy as picking the correct step size. We
>>> usually fight the step size issue and once that is good the accuracy is
>>> adequate for the solutions we need. I've never run into an issue with double
>>> precision being limiting. So I don't really know. I think we can often get
>>> away with much lower accuracy than double precision for realistic
>>> simulations of typically multibody systems.
>>>
>>> Wow, so a minus sign can cause errors so big that you only get 1 or 2
>>> digits of accuracy. That's new to me. I'm not sure what minus signs you are
>>> talking about. What would you do instead of minus signs?
>>>
>>> But like I said, numerical accuracy is generally not an issue for our
>>> systems, unless the ODE integration routine is bad.
>>>
>>>
>>> Jason
>>> moorepants.info
>>> +01 530-601-9791
>>>
>>>
>>> On Mon, Mar 17, 2014 at 1:00 PM, Ondřej Čertík <[email protected]>
>>> wrote:
>>>>
>>>> On Mon, Mar 17, 2014 at 9:21 AM, Jason Moore <[email protected]>
>>>> wrote:
>>>> > I'm still digesting what Matthew and Max wrote. Lots of new words for
>>>> > me :)
>>>> > But here is a simple example taken from C code we generate for a
>>>> > simple 2
>>>> > link pendulum.
>>>> >
>>>> > First the C code with SymPy's CSE expressions automatically generated:
>>>> >
>>>> > #include <math.h>
>>>> > #include "multibody_system_c.h"
>>>> >
>>>> > void mass_forcing(double constants[6], // constants = [g, m0, l0, m1,
>>>> > l1,
>>>> > m2]
>>>> >                   double coordinates[3], // coordinates = [q0, q1, q2]
>>>> >                   double speeds[3], // speeds = [u0, u1, u2]
>>>> >                   double mass_matrix[36], // computed
>>>> >                   double forcing_vector[6]) // computed
>>>> > {
>>>> >     // common subexpressions
>>>> >     double z_0 = coordinates[1];
>>>> >     double z_1 = sin(z_0);
>>>> >     double z_2 = constants[2]*z_1;
>>>> >     double z_3 = -constants[3]*z_2 - constants[5]*z_2;
>>>> >     double z_4 = coordinates[2];
>>>> >     double z_5 = sin(z_4);
>>>> >     double z_6 = -constants[4]*constants[5]*z_5;
>>>> >     double z_7 = pow(constants[2], 2);
>>>> >     double z_8 = constants[2]*constants[4]*constants[5];
>>>> >     double z_9 = cos(z_0);
>>>> >     double z_10 = cos(z_4);
>>>> >     double z_11 = z_8*(z_1*z_5 + z_10*z_9);
>>>> >     double z_12 = speeds[1];
>>>> >     double z_13 = speeds[2];
>>>> >     double z_14 = pow(z_12, 2);
>>>> >     double z_15 = constants[2]*z_14*z_9;
>>>> >     double z_16 = pow(z_13, 2);
>>>> >     double z_17 = constants[4]*constants[5]*z_10;
>>>> >     double z_18 = constants[0]*constants[2]*z_9;
>>>> >     double z_19 = z_5*z_9;
>>>> >     double z_20 = z_1*z_10;
>>>> >
>>>> >     // mass matrix
>>>> >     mass_matrix[0] = 1;
>>>> >     mass_matrix[1] = 0;
>>>> >     mass_matrix[2] = 0;
>>>> >     mass_matrix[3] = 0;
>>>> >     mass_matrix[4] = 0;
>>>> >     mass_matrix[5] = 0;
>>>> >     mass_matrix[6] = 0;
>>>> >     mass_matrix[7] = 1;
>>>> >     mass_matrix[8] = 0;
>>>> >     mass_matrix[9] = 0;
>>>> >     mass_matrix[10] = 0;
>>>> >     mass_matrix[11] = 0;
>>>> >     mass_matrix[12] = 0;
>>>> >     mass_matrix[13] = 0;
>>>> >     mass_matrix[14] = 1;
>>>> >     mass_matrix[15] = 0;
>>>> >     mass_matrix[16] = 0;
>>>> >     mass_matrix[17] = 0;
>>>> >     mass_matrix[18] = 0;
>>>> >     mass_matrix[19] = 0;
>>>> >     mass_matrix[20] = 0;
>>>> >     mass_matrix[21] = constants[1] + constants[3] + constants[5];
>>>> >     mass_matrix[22] = z_3;
>>>> >     mass_matrix[23] = z_6;
>>>> >     mass_matrix[24] = 0;
>>>> >     mass_matrix[25] = 0;
>>>> >     mass_matrix[26] = 0;
>>>> >     mass_matrix[27] = z_3;
>>>> >     mass_matrix[28] = constants[3]*z_7 + constants[5]*z_7;
>>>> >     mass_matrix[29] = z_11;
>>>> >     mass_matrix[30] = 0;
>>>> >     mass_matrix[31] = 0;
>>>> >     mass_matrix[32] = 0;
>>>> >     mass_matrix[33] = z_6;
>>>> >     mass_matrix[34] = z_11;
>>>> >     mass_matrix[35] = pow(constants[4], 2)*constants[5];
>>>> >
>>>> >     // forcing vector
>>>> >     forcing_vector[0] = speeds[0];
>>>> >     forcing_vector[1] = z_12;
>>>> >     forcing_vector[2] = z_13;
>>>> >     forcing_vector[3] = constants[3]*z_15 + constants[5]*z_15 +
>>>> > z_16*z_17;
>>>> >     forcing_vector[4] = -constants[3]*z_18 - constants[5]*z_18 +
>>>> > z_16*z_8*(z_19 - z_20);
>>>> >     forcing_vector[5] = -constants[0]*z_17 + z_14*z_8*(-z_19 + z_20);
>>>> > }
>>>> >
>>>> >
>>>> > Now I manually group these expression evaluations into "stacks", i.e.
>>>> > those
>>>> > calls which could happen in parallel (there is of course a bit more
>>>> > complicated dependency graph you can draw so that you maximize the
>>>> > time that
>>>> > your cores have a task).
>>>> >
>>>> > // These are not computations but just value assignments.
>>>> > z_0 = coordinates[1];
>>>> > z_4 = coordinates[2];
>>>> > z_12 = speeds[1];
>>>> > z_13 = speeds[2];
>>>> > mass_matrix[0] = 1;
>>>> > mass_matrix[1] = 0;
>>>> > mass_matrix[2] = 0;
>>>> > mass_matrix[3] = 0;
>>>> > mass_matrix[4] = 0;
>>>> > mass_matrix[5] = 0;
>>>> > mass_matrix[6] = 0;
>>>> > mass_matrix[7] = 1;
>>>> > mass_matrix[8] = 0;
>>>> > mass_matrix[9] = 0;
>>>> > mass_matrix[10] = 0;
>>>> > mass_matrix[11] = 0;
>>>> > mass_matrix[12] = 0;
>>>> > mass_matrix[13] = 0;
>>>> > mass_matrix[14] = 1;
>>>> > mass_matrix[15] = 0;
>>>> > mass_matrix[16] = 0;
>>>> > mass_matrix[17] = 0;
>>>> > mass_matrix[18] = 0;
>>>> > mass_matrix[19] = 0;
>>>> > mass_matrix[20] = 0;
>>>> > mass_matrix[24] = 0;
>>>> > mass_matrix[25] = 0;
>>>> > mass_matrix[26] = 0;
>>>> > mass_matrix[30] = 0;
>>>> > mass_matrix[31] = 0;
>>>> > mass_matrix[32] = 0;
>>>> > forcing_vector[0] = speeds[0];
>>>> > forcing_vector[1] = z_12;
>>>> > forcing_vector[2] = z_13;
>>>> >
>>>> > // These are computations that involve the initial values passed into
>>>> > the
>>>> > // function, i.e. stack #1.
>>>> > z_7 = pow(constants[2], 2);
>>>> > z_8 = constants[2]*constants[4]*constants[5];
>>>> > z_14 = pow(z_12, 2);
>>>> > z_16 = pow(z_13, 2);
>>>> > mass_matrix[21] = constants[1] + constants[3] + constants[5];
>>>> > mass_matrix[35] = pow(constants[4], 2)*constants[5];
>>>> >
>>>> > // Stack #2
>>>> > z_1 = sin(z_0);
>>>> > z_5 = sin(z_4);
>>>> > z_9 = cos(z_0);
>>>> > z_10 = cos(z_4);
>>>> > z_2 = constants[2]*z_1;
>>>> > mass_matrix[28] = constants[3]*z_7 + constants[5]*z_7;
>>>> >
>>>> > // Stack #3
>>>> > z_3 = -constants[3]*z_2 - constants[5]*z_2;
>>>> > z_6 = -constants[4]*constants[5]*z_5;
>>>> > z_11 = z_8*(z_1*z_5 + z_10*z_9);
>>>> > z_15 = constants[2]*z_14*z_9;
>>>> > z_17 = constants[4]*constants[5]*z_10;
>>>> > z_18 = constants[0]*constants[2]*z_9;
>>>> > z_19 = z_5*z_9;
>>>> > z_20 = z_1*z_10;
>>>> >
>>>> > // Stack #4
>>>> > mass_matrix[22] = z_3;
>>>> > mass_matrix[23] = z_6;
>>>> > mass_matrix[27] = z_3;
>>>> > mass_matrix[29] = z_11;
>>>> > mass_matrix[33] = z_6;
>>>> > mass_matrix[34] = z_11;
>>>> > forcing_vector[3] = constants[3]*z_15 + constants[5]*z_15 + z_16*z_17;
>>>> > forcing_vector[4] = -constants[3]*z_18 - constants[5]*z_18 +
>>>> > z_16*z_8*(z_19
>>>> > - z_20);
>>>> > forcing_vector[5] = -constants[0]*z_17 + z_14*z_8*(-z_19 + z_20);
>>>> >
>>>> >
>>>> > So this simplified example of the dependencies in the CSE's shows that
>>>> > if I
>>>> > had enough cores available I could parallelize each stack, potentially
>>>> > increasing the execution speed. So instead of 31 evaluations, you
>>>> > could have
>>>> > 4 evaluations in parallel, ideally a 7.75x speedup. For more
>>>> > complicated
>>>> > problems, there could be thousands and thousands of these CSEs, but
>>>> > I'll
>>>> > need to generate their dependencies with code to see if things stack
>>>> > this
>>>> > nicely for the big problems. I suspect the dependency chain could be
>>>> > such
>>>> > that the higher number stacks could have hundreds of expressions
>>>> > whereas the
>>>> > lower stacks have fewer, or vice versa.
>>>> >
>>>> > How do I generate a DAG for long expressions in SymPy? Is this part of
>>>> > the
>>>> > internal architecture of SymPy expressions? I don't understand how the
>>>> > cse()
>>>> > code works yet either, but it seems like this information should be
>>>> > computed
>>>> > already. I just need to visualize the graph for some of our bigger
>>>> > problems.
>>>> >
>>>> > Also, the for the number of scalars and number of operations in each.
>>>> > Here
>>>> > is an bigger problem with 2000 or so CSE's:
>>>> >
>>>> >
>>>> > https://github.com/moorepants/dissertation/blob/master/src/extensions/arms/ArmsDynamics.c
>>>>
>>>> One thing I wonder is ---- how accurate are your double precision
>>>> results from the C code?
>>>> Did you try to compare it with high accuracy (e.g. 100 digits) in
>>>> SymPy using Floats?
>>>>
>>>> The "minus" sign always causes some numerical cancellation.
>>>>
>>>> In my experience, if the symbolic expressions grow into hundrends of
>>>> terms and if you use few
>>>> "minus" signs in there, you always get numerical cancellations that
>>>> sometimes are just too big
>>>> for double precision to handle and you get bogus numbers, or only 1 or
>>>> 2 digits accuracy.
>>>>
>>>> One has to be very careful about this.
>>>>
>>>>
>>>> Otherwise I think it's a great idea to parallelize the evaluation.
>>>> Btw, with CSymPy I am also
>>>> interested in parallelizing the symbolic manipulation, including also
>>>> the numerical evaluation
>>>> (both double precision and higher accuracy). One can do it on the
>>>> "library" level as well
>>>> as on the code generation level, which is the problem you are tackling
>>>> now.
>>>>
>>>> Ondrej
>>>>
>>>> >
>>>> > This problem has 12 scalars that have 2000+ CSE's and there are 5840
>>>> > additions and subtractions, 9847 multiplications and divisions, 14
>>>> > cosines,
>>>> > and 14 sines. So roughly 1300 operations per scalar.
>>>> >
>>>> >
>>>> > Jason
>>>> > moorepants.info
>>>> > +01 530-601-9791
>>>> >
>>>> >
>>>> > On Mon, Mar 17, 2014 at 12:06 AM, Matthew Rocklin <[email protected]>
>>>> > wrote:
>>>> >>
>>>> >> Response from Max follows (for some reason he was getting bounced by
>>>> >> the
>>>> >> mailing list).
>>>> >>
>>>> >>
>>>> >> On Sun, Mar 16, 2014 at 8:55 PM, Max Hutchinson <[email protected]>
>>>> >> wrote:
>>>> >>>
>>>> >>> tl;dr it depends on the DAG, but improved ILP is is likely possible
>>>> >>> (if
>>>> >>> difficult) and there could be room for multi-core parallelism as
>>>> >>> well.
>>>> >>>
>>>> >>> As I understand it, we're talking about a long computation applied
>>>> >>> to
>>>> >>> short input vectors.  If the computation can be applied to many
>>>> >>> input
>>>> >>> vectors at once, independent of each other, then all levels of
>>>> >>> parallelism
>>>> >>> (multiple instructions, multiple cores, multiple sockets, multiple
>>>> >>> nodes)
>>>> >>> can be used.  This is data-parallelism, which is great! However, it
>>>> >>> doesn't
>>>> >>> sound like this is the case.
>>>> >>>
>>>> >>> It sounds like you're thinking of building a DAG of these CSEs and
>>>> >>> trying
>>>> >>> to use task-parallelism over independent parts of it (automatically
>>>> >>> using
>>>> >>> sympy or theano or what have you).  The tension here is going to be
>>>> >>> between
>>>> >>> locality and parallelism: how much compute hardware can you spread
>>>> >>> your data
>>>> >>> across without losing the nice cache performance that your small
>>>> >>> input
>>>> >>> vectors gain you.  I'd bet that going off-socket is way too wide.
>>>> >>> Modern
>>>> >>> multi-core architectures have core-local L2 and L1 caches, so if
>>>> >>> your input
>>>> >>> data fits nicely into L2 and your DAG isn't really local, you
>>>> >>> probably won't
>>>> >>> get anything out of multiple-cores.  Your last stand is single-core
>>>> >>> parallelism (instruction-level parallelism), which sympy et al may
>>>> >>> or may
>>>> >>> not be well equipped to influence.
>>>> >>>
>>>> >>> To start, I'd recommend that you take a look at your DAGs and try to
>>>> >>> figure out how large the independent chunks are.  Then, estimate the
>>>> >>> amount
>>>> >>> of instruction level parallelism when you run in 'serial' (which you
>>>> >>> can do
>>>> >>> with flop-counting).  If your demonstrated ILP is less than your
>>>> >>> independent
>>>> >>> chunk size, then at least improved ILP should be possible.
>>>> >>> Automatically
>>>> >>> splitting up these DAGs and expressing them in a low-level enough
>>>> >>> way to
>>>> >>> affect ILP is a considerable task, though.
>>>> >>>
>>>> >>> To see if multi-core parallelism is worth it, you need to estimate
>>>> >>> how
>>>> >>> many extra L3 loads you'd incur by spreading your data of multiple
>>>> >>> L2s.  I
>>>> >>> don't have great advice for that, maybe someone else here does.  The
>>>> >>> good
>>>> >>> news is that if your problem has this level of locality, then you
>>>> >>> can
>>>> >>> probably get away with emitting C code with pthreads or even openmp.
>>>> >>> Just
>>>> >>> bear in mind the thread creation/annihilation overhead (standing
>>>> >>> thread-pools are your friend) and pin them to cores.
>>>> >>>
>>>> >>> Good luck,
>>>> >>> Max
>>>> >>
>>>> >> --
>>>> >> You received this message because you are subscribed to the Google
>>>> >> Groups
>>>> >> "sympy" group.
>>>> >> To unsubscribe from this group and stop receiving emails from it,
>>>> >> send an
>>>> >> email to [email protected].
>>>> >> To post to this group, send email to [email protected].
>>>> >> Visit this group at http://groups.google.com/group/sympy.
>>>> >> To view this discussion on the web visit
>>>> >>
>>>> >> https://groups.google.com/d/msgid/sympy/CAJ8oX-Hc2y9C7FO07kkeraDAv7NNRGPkMJ2DvjgF2Oq7PzeS6g%40mail.gmail.com.
>>>> >>
>>>> >> For more options, visit https://groups.google.com/d/optout.
>>>> >
>>>> >
>>>> > --
>>>> > You received this message because you are subscribed to the Google
>>>> > Groups
>>>> > "sympy" group.
>>>> > To unsubscribe from this group and stop receiving emails from it, send
>>>> > an
>>>> > email to [email protected].
>>>> > To post to this group, send email to [email protected].
>>>> > Visit this group at http://groups.google.com/group/sympy.
>>>> > To view this discussion on the web visit
>>>> >
>>>> > https://groups.google.com/d/msgid/sympy/CAP7f1AggyuHprs7H%2B_PSVP37kG%2BWQv%3DuzSsBpiExh9U4HQe0dA%40mail.gmail.com.
>>>> >
>>>> > For more options, visit https://groups.google.com/d/optout.
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "sympy" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at http://groups.google.com/group/sympy.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/sympy/CADDwiVC2Zd4tdJRoAxK7iBin9PWc%2BEihSvf-6X5DTW9m9ph7XA%40mail.gmail.com.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "sympy" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an
>>> email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/sympy.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/sympy/CAP7f1AiZccNH3DdtCvqi7FaeZx5ARXJMAzkPFJ2nci9ChvEXpg%40mail.gmail.com.
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "sympy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at http://groups.google.com/group/sympy.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/sympy/CAGZqWdr6ye5LMvVfvMjDYDterj5oVEA_qObyLQx96Wq9c8SbiQ%40mail.gmail.com.
>>
>> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "sympy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/sympy.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/sympy/CAP7f1AhznhOHzKBNrRqxHsuKAWHfdDjF2Bwv0HJuL29HVXR1Jg%40mail.gmail.com.
>
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"sympy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/sympy.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/sympy/CADDwiVDiOmVeOCkUVwjNuwhdkSr4-7cr4eV4FRuKd5ZiAc01Bw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [sympy] Parallel evaluation of a mathematical expression

Reply via email to