Re: [sympy] Parallel evaluation of a mathematical expression

Jason Moore Mon, 17 Mar 2014 14:35:44 -0700

Ok, if we had some concrete examples of that non-cancellation then we could
do something about it. This is goes for everything that we've been
implementing, if we don't have a good benchmark problem then it is hard to
design something that solve it. We can look into this more when it becomes
and issue, I guess.


Yes, we should have an option in the code generation to generate a function
that can evaluate the Jacobian of the RHS. This can be used directly for
implicit ODE integration routines that are best suited for for stiff
problems.


Jason
moorepants.info
+01 530-601-9791


On Mon, Mar 17, 2014 at 5:26 PM, Gilbert Gede <[email protected]> wrote:

> Jason, I've definitely seen examples where small errors in subtraction can
> lead to larger errors later on - especially if things are supposed to
> cancel out. In our domain, I think it pops up in mass/inertia division
> operations. I'm not sure how often it happens with real-world examples vs.
> academic/idealized examples though.
>
> I would definitely agree that step-size (for stiff systems) is the bigger
> challenge in practice - although I believe providing the Jacobian of your
> RHS function can help with that. That might be something we want to
> automate in the code-gen module.
>
> BTW, thanks for starting this discussion. Very informative.
>
> -Gilbert
>
>
> On Mon, Mar 17, 2014 at 2:13 PM, Jason Moore <[email protected]> wrote:
>
>> For numerical ODE integration, I don't think double precision presents
>> near as much an issue in accuracy as picking the correct step size. We
>> usually fight the step size issue and once that is good the accuracy is
>> adequate for the solutions we need. I've never run into an issue with
>> double precision being limiting. So I don't really know. I think we can
>> often get away with much lower accuracy than double precision for realistic
>> simulations of typically multibody systems.
>>
>> Wow, so a minus sign can cause errors so big that you only get 1 or 2
>> digits of accuracy. That's new to me. I'm not sure what minus signs you are
>> talking about. What would you do instead of minus signs?
>>
>> But like I said, numerical accuracy is generally not an issue for our
>> systems, unless the ODE integration routine is bad.
>>
>>
>> Jason
>> moorepants.info
>> +01 530-601-9791
>>
>>
>> On Mon, Mar 17, 2014 at 1:00 PM, Ondřej Čertík 
>> <[email protected]>wrote:
>>
>>> On Mon, Mar 17, 2014 at 9:21 AM, Jason Moore <[email protected]>
>>> wrote:
>>> > I'm still digesting what Matthew and Max wrote. Lots of new words for
>>> me :)
>>> > But here is a simple example taken from C code we generate for a
>>> simple 2
>>> > link pendulum.
>>> >
>>> > First the C code with SymPy's CSE expressions automatically generated:
>>> >
>>> > #include <math.h>
>>> > #include "multibody_system_c.h"
>>> >
>>> > void mass_forcing(double constants[6], // constants = [g, m0, l0, m1,
>>> l1,
>>> > m2]
>>> >                   double coordinates[3], // coordinates = [q0, q1, q2]
>>> >                   double speeds[3], // speeds = [u0, u1, u2]
>>> >                   double mass_matrix[36], // computed
>>> >                   double forcing_vector[6]) // computed
>>> > {
>>> >     // common subexpressions
>>> >     double z_0 = coordinates[1];
>>> >     double z_1 = sin(z_0);
>>> >     double z_2 = constants[2]*z_1;
>>> >     double z_3 = -constants[3]*z_2 - constants[5]*z_2;
>>> >     double z_4 = coordinates[2];
>>> >     double z_5 = sin(z_4);
>>> >     double z_6 = -constants[4]*constants[5]*z_5;
>>> >     double z_7 = pow(constants[2], 2);
>>> >     double z_8 = constants[2]*constants[4]*constants[5];
>>> >     double z_9 = cos(z_0);
>>> >     double z_10 = cos(z_4);
>>> >     double z_11 = z_8*(z_1*z_5 + z_10*z_9);
>>> >     double z_12 = speeds[1];
>>> >     double z_13 = speeds[2];
>>> >     double z_14 = pow(z_12, 2);
>>> >     double z_15 = constants[2]*z_14*z_9;
>>> >     double z_16 = pow(z_13, 2);
>>> >     double z_17 = constants[4]*constants[5]*z_10;
>>> >     double z_18 = constants[0]*constants[2]*z_9;
>>> >     double z_19 = z_5*z_9;
>>> >     double z_20 = z_1*z_10;
>>> >
>>> >     // mass matrix
>>> >     mass_matrix[0] = 1;
>>> >     mass_matrix[1] = 0;
>>> >     mass_matrix[2] = 0;
>>> >     mass_matrix[3] = 0;
>>> >     mass_matrix[4] = 0;
>>> >     mass_matrix[5] = 0;
>>> >     mass_matrix[6] = 0;
>>> >     mass_matrix[7] = 1;
>>> >     mass_matrix[8] = 0;
>>> >     mass_matrix[9] = 0;
>>> >     mass_matrix[10] = 0;
>>> >     mass_matrix[11] = 0;
>>> >     mass_matrix[12] = 0;
>>> >     mass_matrix[13] = 0;
>>> >     mass_matrix[14] = 1;
>>> >     mass_matrix[15] = 0;
>>> >     mass_matrix[16] = 0;
>>> >     mass_matrix[17] = 0;
>>> >     mass_matrix[18] = 0;
>>> >     mass_matrix[19] = 0;
>>> >     mass_matrix[20] = 0;
>>> >     mass_matrix[21] = constants[1] + constants[3] + constants[5];
>>> >     mass_matrix[22] = z_3;
>>> >     mass_matrix[23] = z_6;
>>> >     mass_matrix[24] = 0;
>>> >     mass_matrix[25] = 0;
>>> >     mass_matrix[26] = 0;
>>> >     mass_matrix[27] = z_3;
>>> >     mass_matrix[28] = constants[3]*z_7 + constants[5]*z_7;
>>> >     mass_matrix[29] = z_11;
>>> >     mass_matrix[30] = 0;
>>> >     mass_matrix[31] = 0;
>>> >     mass_matrix[32] = 0;
>>> >     mass_matrix[33] = z_6;
>>> >     mass_matrix[34] = z_11;
>>> >     mass_matrix[35] = pow(constants[4], 2)*constants[5];
>>> >
>>> >     // forcing vector
>>> >     forcing_vector[0] = speeds[0];
>>> >     forcing_vector[1] = z_12;
>>> >     forcing_vector[2] = z_13;
>>> >     forcing_vector[3] = constants[3]*z_15 + constants[5]*z_15 +
>>> z_16*z_17;
>>> >     forcing_vector[4] = -constants[3]*z_18 - constants[5]*z_18 +
>>> > z_16*z_8*(z_19 - z_20);
>>> >     forcing_vector[5] = -constants[0]*z_17 + z_14*z_8*(-z_19 + z_20);
>>> > }
>>> >
>>> >
>>> > Now I manually group these expression evaluations into "stacks", i.e.
>>> those
>>> > calls which could happen in parallel (there is of course a bit more
>>> > complicated dependency graph you can draw so that you maximize the
>>> time that
>>> > your cores have a task).
>>> >
>>> > // These are not computations but just value assignments.
>>> > z_0 = coordinates[1];
>>> > z_4 = coordinates[2];
>>> > z_12 = speeds[1];
>>> > z_13 = speeds[2];
>>> > mass_matrix[0] = 1;
>>> > mass_matrix[1] = 0;
>>> > mass_matrix[2] = 0;
>>> > mass_matrix[3] = 0;
>>> > mass_matrix[4] = 0;
>>> > mass_matrix[5] = 0;
>>> > mass_matrix[6] = 0;
>>> > mass_matrix[7] = 1;
>>> > mass_matrix[8] = 0;
>>> > mass_matrix[9] = 0;
>>> > mass_matrix[10] = 0;
>>> > mass_matrix[11] = 0;
>>> > mass_matrix[12] = 0;
>>> > mass_matrix[13] = 0;
>>> > mass_matrix[14] = 1;
>>> > mass_matrix[15] = 0;
>>> > mass_matrix[16] = 0;
>>> > mass_matrix[17] = 0;
>>> > mass_matrix[18] = 0;
>>> > mass_matrix[19] = 0;
>>> > mass_matrix[20] = 0;
>>> > mass_matrix[24] = 0;
>>> > mass_matrix[25] = 0;
>>> > mass_matrix[26] = 0;
>>> > mass_matrix[30] = 0;
>>> > mass_matrix[31] = 0;
>>> > mass_matrix[32] = 0;
>>> > forcing_vector[0] = speeds[0];
>>> > forcing_vector[1] = z_12;
>>> > forcing_vector[2] = z_13;
>>> >
>>> > // These are computations that involve the initial values passed into
>>> the
>>> > // function, i.e. stack #1.
>>> > z_7 = pow(constants[2], 2);
>>> > z_8 = constants[2]*constants[4]*constants[5];
>>> > z_14 = pow(z_12, 2);
>>> > z_16 = pow(z_13, 2);
>>> > mass_matrix[21] = constants[1] + constants[3] + constants[5];
>>> > mass_matrix[35] = pow(constants[4], 2)*constants[5];
>>> >
>>> > // Stack #2
>>> > z_1 = sin(z_0);
>>> > z_5 = sin(z_4);
>>> > z_9 = cos(z_0);
>>> > z_10 = cos(z_4);
>>> > z_2 = constants[2]*z_1;
>>> > mass_matrix[28] = constants[3]*z_7 + constants[5]*z_7;
>>> >
>>> > // Stack #3
>>> > z_3 = -constants[3]*z_2 - constants[5]*z_2;
>>> > z_6 = -constants[4]*constants[5]*z_5;
>>> > z_11 = z_8*(z_1*z_5 + z_10*z_9);
>>> > z_15 = constants[2]*z_14*z_9;
>>> > z_17 = constants[4]*constants[5]*z_10;
>>> > z_18 = constants[0]*constants[2]*z_9;
>>> > z_19 = z_5*z_9;
>>> > z_20 = z_1*z_10;
>>> >
>>> > // Stack #4
>>> > mass_matrix[22] = z_3;
>>> > mass_matrix[23] = z_6;
>>> > mass_matrix[27] = z_3;
>>> > mass_matrix[29] = z_11;
>>> > mass_matrix[33] = z_6;
>>> > mass_matrix[34] = z_11;
>>> > forcing_vector[3] = constants[3]*z_15 + constants[5]*z_15 + z_16*z_17;
>>> > forcing_vector[4] = -constants[3]*z_18 - constants[5]*z_18 +
>>> z_16*z_8*(z_19
>>> > - z_20);
>>> > forcing_vector[5] = -constants[0]*z_17 + z_14*z_8*(-z_19 + z_20);
>>> >
>>> >
>>> > So this simplified example of the dependencies in the CSE's shows that
>>> if I
>>> > had enough cores available I could parallelize each stack, potentially
>>> > increasing the execution speed. So instead of 31 evaluations, you
>>> could have
>>> > 4 evaluations in parallel, ideally a 7.75x speedup. For more
>>> complicated
>>> > problems, there could be thousands and thousands of these CSEs, but
>>> I'll
>>> > need to generate their dependencies with code to see if things stack
>>> this
>>> > nicely for the big problems. I suspect the dependency chain could be
>>> such
>>> > that the higher number stacks could have hundreds of expressions
>>> whereas the
>>> > lower stacks have fewer, or vice versa.
>>> >
>>> > How do I generate a DAG for long expressions in SymPy? Is this part of
>>> the
>>> > internal architecture of SymPy expressions? I don't understand how the
>>> cse()
>>> > code works yet either, but it seems like this information should be
>>> computed
>>> > already. I just need to visualize the graph for some of our bigger
>>> problems.
>>> >
>>> > Also, the for the number of scalars and number of operations in each.
>>> Here
>>> > is an bigger problem with 2000 or so CSE's:
>>> >
>>> >
>>> https://github.com/moorepants/dissertation/blob/master/src/extensions/arms/ArmsDynamics.c
>>>
>>> One thing I wonder is ---- how accurate are your double precision
>>> results from the C code?
>>> Did you try to compare it with high accuracy (e.g. 100 digits) in
>>> SymPy using Floats?
>>>
>>> The "minus" sign always causes some numerical cancellation.
>>>
>>> In my experience, if the symbolic expressions grow into hundrends of
>>> terms and if you use few
>>> "minus" signs in there, you always get numerical cancellations that
>>> sometimes are just too big
>>> for double precision to handle and you get bogus numbers, or only 1 or
>>> 2 digits accuracy.
>>>
>>> One has to be very careful about this.
>>>
>>>
>>> Otherwise I think it's a great idea to parallelize the evaluation.
>>> Btw, with CSymPy I am also
>>> interested in parallelizing the symbolic manipulation, including also
>>> the numerical evaluation
>>> (both double precision and higher accuracy). One can do it on the
>>> "library" level as well
>>> as on the code generation level, which is the problem you are tackling
>>> now.
>>>
>>> Ondrej
>>>
>>> >
>>> > This problem has 12 scalars that have 2000+ CSE's and there are 5840
>>> > additions and subtractions, 9847 multiplications and divisions, 14
>>> cosines,
>>> > and 14 sines. So roughly 1300 operations per scalar.
>>> >
>>> >
>>> > Jason
>>> > moorepants.info
>>> > +01 530-601-9791
>>> >
>>> >
>>> > On Mon, Mar 17, 2014 at 12:06 AM, Matthew Rocklin <[email protected]>
>>> > wrote:
>>> >>
>>> >> Response from Max follows (for some reason he was getting bounced by
>>> the
>>> >> mailing list).
>>> >>
>>> >>
>>> >> On Sun, Mar 16, 2014 at 8:55 PM, Max Hutchinson <[email protected]>
>>> >> wrote:
>>> >>>
>>> >>> tl;dr it depends on the DAG, but improved ILP is is likely possible
>>> (if
>>> >>> difficult) and there could be room for multi-core parallelism as
>>> well.
>>> >>>
>>> >>> As I understand it, we're talking about a long computation applied to
>>> >>> short input vectors.  If the computation can be applied to many input
>>> >>> vectors at once, independent of each other, then all levels of
>>> parallelism
>>> >>> (multiple instructions, multiple cores, multiple sockets, multiple
>>> nodes)
>>> >>> can be used.  This is data-parallelism, which is great! However, it
>>> doesn't
>>> >>> sound like this is the case.
>>> >>>
>>> >>> It sounds like you're thinking of building a DAG of these CSEs and
>>> trying
>>> >>> to use task-parallelism over independent parts of it (automatically
>>> using
>>> >>> sympy or theano or what have you).  The tension here is going to be
>>> between
>>> >>> locality and parallelism: how much compute hardware can you spread
>>> your data
>>> >>> across without losing the nice cache performance that your small
>>> input
>>> >>> vectors gain you.  I'd bet that going off-socket is way too wide.
>>>  Modern
>>> >>> multi-core architectures have core-local L2 and L1 caches, so if
>>> your input
>>> >>> data fits nicely into L2 and your DAG isn't really local, you
>>> probably won't
>>> >>> get anything out of multiple-cores.  Your last stand is single-core
>>> >>> parallelism (instruction-level parallelism), which sympy et al may
>>> or may
>>> >>> not be well equipped to influence.
>>> >>>
>>> >>> To start, I'd recommend that you take a look at your DAGs and try to
>>> >>> figure out how large the independent chunks are.  Then, estimate the
>>> amount
>>> >>> of instruction level parallelism when you run in 'serial' (which you
>>> can do
>>> >>> with flop-counting).  If your demonstrated ILP is less than your
>>> independent
>>> >>> chunk size, then at least improved ILP should be possible.
>>>  Automatically
>>> >>> splitting up these DAGs and expressing them in a low-level enough
>>> way to
>>> >>> affect ILP is a considerable task, though.
>>> >>>
>>> >>> To see if multi-core parallelism is worth it, you need to estimate
>>> how
>>> >>> many extra L3 loads you'd incur by spreading your data of multiple
>>> L2s.  I
>>> >>> don't have great advice for that, maybe someone else here does.  The
>>> good
>>> >>> news is that if your problem has this level of locality, then you can
>>> >>> probably get away with emitting C code with pthreads or even openmp.
>>>  Just
>>> >>> bear in mind the thread creation/annihilation overhead (standing
>>> >>> thread-pools are your friend) and pin them to cores.
>>> >>>
>>> >>> Good luck,
>>> >>> Max
>>> >>
>>> >> --
>>> >> You received this message because you are subscribed to the Google
>>> Groups
>>> >> "sympy" group.
>>> >> To unsubscribe from this group and stop receiving emails from it,
>>> send an
>>> >> email to [email protected].
>>> >> To post to this group, send email to [email protected].
>>> >> Visit this group at http://groups.google.com/group/sympy.
>>> >> To view this discussion on the web visit
>>> >>
>>> https://groups.google.com/d/msgid/sympy/CAJ8oX-Hc2y9C7FO07kkeraDAv7NNRGPkMJ2DvjgF2Oq7PzeS6g%40mail.gmail.com
>>> .
>>> >>
>>> >> For more options, visit https://groups.google.com/d/optout.
>>> >
>>> >
>>> > --
>>> > You received this message because you are subscribed to the Google
>>> Groups
>>> > "sympy" group.
>>> > To unsubscribe from this group and stop receiving emails from it, send
>>> an
>>> > email to [email protected].
>>> > To post to this group, send email to [email protected].
>>> > Visit this group at http://groups.google.com/group/sympy.
>>> > To view this discussion on the web visit
>>> >
>>> https://groups.google.com/d/msgid/sympy/CAP7f1AggyuHprs7H%2B_PSVP37kG%2BWQv%3DuzSsBpiExh9U4HQe0dA%40mail.gmail.com
>>> .
>>> >
>>> > For more options, visit https://groups.google.com/d/optout.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "sympy" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/sympy.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/sympy/CADDwiVC2Zd4tdJRoAxK7iBin9PWc%2BEihSvf-6X5DTW9m9ph7XA%40mail.gmail.com
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "sympy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at http://groups.google.com/group/sympy.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/sympy/CAP7f1AiZccNH3DdtCvqi7FaeZx5ARXJMAzkPFJ2nci9ChvEXpg%40mail.gmail.com<https://groups.google.com/d/msgid/sympy/CAP7f1AiZccNH3DdtCvqi7FaeZx5ARXJMAzkPFJ2nci9ChvEXpg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "sympy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/sympy.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/sympy/CAGZqWdr6ye5LMvVfvMjDYDterj5oVEA_qObyLQx96Wq9c8SbiQ%40mail.gmail.com<https://groups.google.com/d/msgid/sympy/CAGZqWdr6ye5LMvVfvMjDYDterj5oVEA_qObyLQx96Wq9c8SbiQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"sympy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/sympy.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/sympy/CAP7f1AhznhOHzKBNrRqxHsuKAWHfdDjF2Bwv0HJuL29HVXR1Jg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [sympy] Parallel evaluation of a mathematical expression

Reply via email to