theano.printing.pydotprint SymPy has something similar in sympy.printing.dot
On Mon, Mar 17, 2014 at 9:52 AM, Jason Moore <[email protected]> wrote: > How did you create the DAG visualization here: > > http://matthewrocklin.com/blog/work/2013/04/05/SymPy-Theano-part-3/ > > > Jason > moorepants.info > +01 530-601-9791 > > > On Mon, Mar 17, 2014 at 12:46 PM, Matthew Rocklin <[email protected]>wrote: > >> Logically we think of SymPy expressions as Trees. When you consider >> common sub-expressions then repeated branches of these trees get merged. >> The result is a DAG. >> >> >> On Mon, Mar 17, 2014 at 9:42 AM, Jason Moore <[email protected]>wrote: >> >>> Thanks Max. I'll look into the FLOP counting as you suggest. >>> >>> Matthew has created visual DAG's before, I'm just not sure what software >>> he uses for it. He'll probably pipe in about that. >>> >>> >>> Jason >>> moorepants.info >>> +01 530-601-9791 >>> >>> >>> On Mon, Mar 17, 2014 at 12:39 PM, Max Hutchinson <[email protected]>wrote: >>> >>>> Before trying to improve ILP, you should probably see how well the >>>> compiler is already doing. You can calculate a lower bound for the actual >>>> ILP by dividing your FLOP rate (FLOPs/sec) by the core frequency >>>> (cycles/sec) to get the number of FLOPs per cycle. Then compare that to >>>> the number of execution units (or whatever they're calling them now) for >>>> your processor. >>>> >>>> That number may be hard to find, and it is highly idealized, so better >>>> comparison might be your FLOP rate compared to a well known compute-bound >>>> vectorized problem: the linpack benchmark [1]. If your single core FLOP >>>> rate is near the linpack number, there isn't going to much room for >>>> single-core (ILP) improvement. Be sure to run linpack large enough to get >>>> it compute-bound. >>>> >>>> What to do to actually improve ILP is probably architecture/compiler >>>> specific and very much outside my area of expertise. Stack Overflow or the >>>> Computational Science SE [2] might be able to help. >>>> >>>> Hopefully someone else on this list can help with DAGs and SymPy. >>>> >>>> Max >>>> >>>> [1] http://www.top500.org/project/linpack/ >>>> [2] http://scicomp.stackexchange.com/ >>>> >>>> >>>> On Mon, Mar 17, 2014 at 11:13 AM, Jason Moore <[email protected]>wrote: >>>> >>>>> Thanks for that clarification on the locality. I understand that now. >>>>> >>>>> How do I generate DAG's? Is this something SymPy does automatically? >>>>> Or are there other software that does it? Or is it easy to code up myself? >>>>> >>>>> How would I "help" the compiler with more information for better ILP? >>>>> >>>>> >>>>> Jason >>>>> moorepants.info >>>>> +01 530-601-9791 >>>>> >>>>> >>>>> On Mon, Mar 17, 2014 at 11:49 AM, Max Hutchinson >>>>> <[email protected]>wrote: >>>>> >>>>>> If you think about what the DAG would look like, your 'stacks' are >>>>>> like horizontal layers in the graph. The width of each layer (length of >>>>>> each stack) gives an upper bound on the speedup, but it doesn't tell the >>>>>> whole story: you need a way to deal with data locality. >>>>>> >>>>>> For example, let's look at stack #3. You have 8 independent >>>>>> expressions, so it would seem like you should be able to use 8 pieces of >>>>>> computational hardware (let's call it core). However, z_6, z_11, and >>>>>> z_19 >>>>>> all depend on z_5. Therefore, either z_6, z_11, and z_19 need to be >>>>>> computed local to z_5, or z_5 needs to be copied somewhere else. The >>>>>> copying is much more expensive than the computing (50-100 cycles [1]), so >>>>>> if you only have 3 things that depend on z_5, you're going to want to >>>>>> just >>>>>> compute them all on the same core as z_5. >>>>>> >>>>>> The complicated thing is that z_5 and z_10 both share a dependency, >>>>>> z_4, so they should be computed locally. Now, we have to compute >>>>>> everything that depends on z_5 or z_10 on the same core. If we don't >>>>>> break >>>>>> locality anywhere, we won't have any available parallelism. This is the >>>>>> tension: copies are expensive but without them we can't expose any >>>>>> parallelism and will be stuck with one core. This is why we really need >>>>>> to >>>>>> build a DAG, not just stacks, and then try to break it into chunks with >>>>>> the >>>>>> fewest edges between them. The number of chunks is the amount of >>>>>> parallelism and the number of edges are the number of copies. >>>>>> >>>>>> Fortunately, even if the DAGs are strongly connected and you're stuck >>>>>> with one core there is still ILP. In a nutshell: each core can actually >>>>>> do >>>>>> a couple operations at the same time. The core uses a single cache, so >>>>>> the >>>>>> data is local and doesn't require copies. The compiler is supposed to >>>>>> figure out ILP for you, but you might be able to help it out using all >>>>>> the >>>>>> extra information sympy/theano knows about your computation. >>>>>> >>>>>> Max >>>>>> >>>>>> [1] >>>>>> http://stackoverflow.com/questions/4087280/approximate-cost-to-access-various-caches-and-main-memory >>>>>> >>>>>> >>>>>> On Mon, Mar 17, 2014 at 10:21 AM, Jason Moore >>>>>> <[email protected]>wrote: >>>>>> >>>>>>> I'm still digesting what Matthew and Max wrote. Lots of new words >>>>>>> for me :) But here is a simple example taken from C code we generate >>>>>>> for a >>>>>>> simple 2 link pendulum. >>>>>>> >>>>>>> First the C code with SymPy's CSE expressions automatically >>>>>>> generated: >>>>>>> >>>>>>> #include <math.h> >>>>>>> #include "multibody_system_c.h" >>>>>>> >>>>>>> void mass_forcing(double constants[6], // constants = [g, m0, l0, >>>>>>> m1, l1, m2] >>>>>>> double coordinates[3], // coordinates = [q0, q1, >>>>>>> q2] >>>>>>> double speeds[3], // speeds = [u0, u1, u2] >>>>>>> double mass_matrix[36], // computed >>>>>>> double forcing_vector[6]) // computed >>>>>>> { >>>>>>> // common subexpressions >>>>>>> double z_0 = coordinates[1]; >>>>>>> double z_1 = sin(z_0); >>>>>>> double z_2 = constants[2]*z_1; >>>>>>> double z_3 = -constants[3]*z_2 - constants[5]*z_2; >>>>>>> double z_4 = coordinates[2]; >>>>>>> double z_5 = sin(z_4); >>>>>>> double z_6 = -constants[4]*constants[5]*z_5; >>>>>>> double z_7 = pow(constants[2], 2); >>>>>>> double z_8 = constants[2]*constants[4]*constants[5]; >>>>>>> double z_9 = cos(z_0); >>>>>>> double z_10 = cos(z_4); >>>>>>> double z_11 = z_8*(z_1*z_5 + z_10*z_9); >>>>>>> double z_12 = speeds[1]; >>>>>>> double z_13 = speeds[2]; >>>>>>> double z_14 = pow(z_12, 2); >>>>>>> double z_15 = constants[2]*z_14*z_9; >>>>>>> double z_16 = pow(z_13, 2); >>>>>>> double z_17 = constants[4]*constants[5]*z_10; >>>>>>> double z_18 = constants[0]*constants[2]*z_9; >>>>>>> double z_19 = z_5*z_9; >>>>>>> double z_20 = z_1*z_10; >>>>>>> >>>>>>> // mass matrix >>>>>>> mass_matrix[0] = 1; >>>>>>> mass_matrix[1] = 0; >>>>>>> mass_matrix[2] = 0; >>>>>>> mass_matrix[3] = 0; >>>>>>> mass_matrix[4] = 0; >>>>>>> mass_matrix[5] = 0; >>>>>>> mass_matrix[6] = 0; >>>>>>> mass_matrix[7] = 1; >>>>>>> mass_matrix[8] = 0; >>>>>>> mass_matrix[9] = 0; >>>>>>> mass_matrix[10] = 0; >>>>>>> mass_matrix[11] = 0; >>>>>>> mass_matrix[12] = 0; >>>>>>> mass_matrix[13] = 0; >>>>>>> mass_matrix[14] = 1; >>>>>>> mass_matrix[15] = 0; >>>>>>> mass_matrix[16] = 0; >>>>>>> mass_matrix[17] = 0; >>>>>>> mass_matrix[18] = 0; >>>>>>> mass_matrix[19] = 0; >>>>>>> mass_matrix[20] = 0; >>>>>>> mass_matrix[21] = constants[1] + constants[3] + constants[5]; >>>>>>> mass_matrix[22] = z_3; >>>>>>> mass_matrix[23] = z_6; >>>>>>> mass_matrix[24] = 0; >>>>>>> mass_matrix[25] = 0; >>>>>>> mass_matrix[26] = 0; >>>>>>> mass_matrix[27] = z_3; >>>>>>> mass_matrix[28] = constants[3]*z_7 + constants[5]*z_7; >>>>>>> mass_matrix[29] = z_11; >>>>>>> mass_matrix[30] = 0; >>>>>>> mass_matrix[31] = 0; >>>>>>> mass_matrix[32] = 0; >>>>>>> mass_matrix[33] = z_6; >>>>>>> mass_matrix[34] = z_11; >>>>>>> mass_matrix[35] = pow(constants[4], 2)*constants[5]; >>>>>>> >>>>>>> // forcing vector >>>>>>> forcing_vector[0] = speeds[0]; >>>>>>> forcing_vector[1] = z_12; >>>>>>> forcing_vector[2] = z_13; >>>>>>> forcing_vector[3] = constants[3]*z_15 + constants[5]*z_15 + >>>>>>> z_16*z_17; >>>>>>> forcing_vector[4] = -constants[3]*z_18 - constants[5]*z_18 + >>>>>>> z_16*z_8*(z_19 - z_20); >>>>>>> forcing_vector[5] = -constants[0]*z_17 + z_14*z_8*(-z_19 + z_20); >>>>>>> } >>>>>>> >>>>>>> >>>>>>> Now I manually group these expression evaluations into "stacks", >>>>>>> i.e. those calls which could happen in parallel (there is of course a >>>>>>> bit >>>>>>> more complicated dependency graph you can draw so that you maximize the >>>>>>> time that your cores have a task). >>>>>>> >>>>>>> // These are not computations but just value assignments. >>>>>>> z_0 = coordinates[1]; >>>>>>> z_4 = coordinates[2]; >>>>>>> z_12 = speeds[1]; >>>>>>> z_13 = speeds[2]; >>>>>>> mass_matrix[0] = 1; >>>>>>> mass_matrix[1] = 0; >>>>>>> mass_matrix[2] = 0; >>>>>>> mass_matrix[3] = 0; >>>>>>> mass_matrix[4] = 0; >>>>>>> mass_matrix[5] = 0; >>>>>>> mass_matrix[6] = 0; >>>>>>> mass_matrix[7] = 1; >>>>>>> mass_matrix[8] = 0; >>>>>>> mass_matrix[9] = 0; >>>>>>> mass_matrix[10] = 0; >>>>>>> mass_matrix[11] = 0; >>>>>>> mass_matrix[12] = 0; >>>>>>> mass_matrix[13] = 0; >>>>>>> mass_matrix[14] = 1; >>>>>>> mass_matrix[15] = 0; >>>>>>> mass_matrix[16] = 0; >>>>>>> mass_matrix[17] = 0; >>>>>>> mass_matrix[18] = 0; >>>>>>> mass_matrix[19] = 0; >>>>>>> mass_matrix[20] = 0; >>>>>>> mass_matrix[24] = 0; >>>>>>> mass_matrix[25] = 0; >>>>>>> mass_matrix[26] = 0; >>>>>>> mass_matrix[30] = 0; >>>>>>> mass_matrix[31] = 0; >>>>>>> mass_matrix[32] = 0; >>>>>>> forcing_vector[0] = speeds[0]; >>>>>>> forcing_vector[1] = z_12; >>>>>>> forcing_vector[2] = z_13; >>>>>>> >>>>>>> // These are computations that involve the initial values passed >>>>>>> into the >>>>>>> // function, i.e. stack #1. >>>>>>> z_7 = pow(constants[2], 2); >>>>>>> z_8 = constants[2]*constants[4]*constants[5]; >>>>>>> z_14 = pow(z_12, 2); >>>>>>> z_16 = pow(z_13, 2); >>>>>>> mass_matrix[21] = constants[1] + constants[3] + constants[5]; >>>>>>> mass_matrix[35] = pow(constants[4], 2)*constants[5]; >>>>>>> >>>>>>> // Stack #2 >>>>>>> z_1 = sin(z_0); >>>>>>> z_5 = sin(z_4); >>>>>>> z_9 = cos(z_0); >>>>>>> z_10 = cos(z_4); >>>>>>> z_2 = constants[2]*z_1; >>>>>>> mass_matrix[28] = constants[3]*z_7 + constants[5]*z_7; >>>>>>> >>>>>>> // Stack #3 >>>>>>> z_3 = -constants[3]*z_2 - constants[5]*z_2; >>>>>>> z_6 = -constants[4]*constants[5]*z_5; >>>>>>> z_11 = z_8*(z_1*z_5 + z_10*z_9); >>>>>>> z_15 = constants[2]*z_14*z_9; >>>>>>> z_17 = constants[4]*constants[5]*z_10; >>>>>>> z_18 = constants[0]*constants[2]*z_9; >>>>>>> z_19 = z_5*z_9; >>>>>>> z_20 = z_1*z_10; >>>>>>> >>>>>>> // Stack #4 >>>>>>> mass_matrix[22] = z_3; >>>>>>> mass_matrix[23] = z_6; >>>>>>> mass_matrix[27] = z_3; >>>>>>> mass_matrix[29] = z_11; >>>>>>> mass_matrix[33] = z_6; >>>>>>> mass_matrix[34] = z_11; >>>>>>> forcing_vector[3] = constants[3]*z_15 + constants[5]*z_15 + >>>>>>> z_16*z_17; >>>>>>> forcing_vector[4] = -constants[3]*z_18 - constants[5]*z_18 + >>>>>>> z_16*z_8*(z_19 - z_20); >>>>>>> forcing_vector[5] = -constants[0]*z_17 + z_14*z_8*(-z_19 + z_20); >>>>>>> >>>>>>> >>>>>>> So this simplified example of the dependencies in the CSE's shows >>>>>>> that if I had enough cores available I could parallelize each stack, >>>>>>> potentially increasing the execution speed. So instead of 31 >>>>>>> evaluations, >>>>>>> you could have 4 evaluations in parallel, ideally a 7.75x speedup. For >>>>>>> more >>>>>>> complicated problems, there could be thousands and thousands of these >>>>>>> CSEs, >>>>>>> but I'll need to generate their dependencies with code to see if things >>>>>>> stack this nicely for the big problems. I suspect the dependency chain >>>>>>> could be such that the higher number stacks could have hundreds of >>>>>>> expressions whereas the lower stacks have fewer, or vice versa. >>>>>>> >>>>>>> How do I generate a DAG for long expressions in SymPy? Is this part >>>>>>> of the internal architecture of SymPy expressions? I don't understand >>>>>>> how >>>>>>> the cse() code works yet either, but it seems like this information >>>>>>> should >>>>>>> be computed already. I just need to visualize the graph for some of our >>>>>>> bigger problems. >>>>>>> >>>>>>> Also, the for the number of scalars and number of operations in >>>>>>> each. Here is an bigger problem with 2000 or so CSE's: >>>>>>> >>>>>>> >>>>>>> https://github.com/moorepants/dissertation/blob/master/src/extensions/arms/ArmsDynamics.c >>>>>>> >>>>>>> This problem has 12 scalars that have 2000+ CSE's and there are 5840 >>>>>>> additions and subtractions, 9847 multiplications and divisions, 14 >>>>>>> cosines, >>>>>>> and 14 sines. So roughly 1300 operations per scalar. >>>>>>> >>>>>>> >>>>>>> Jason >>>>>>> moorepants.info >>>>>>> +01 530-601-9791 >>>>>>> >>>>>>> >>>>>>> On Mon, Mar 17, 2014 at 12:06 AM, Matthew Rocklin < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Response from Max follows (for some reason he was getting bounced >>>>>>>> by the mailing list). >>>>>>>> >>>>>>>> >>>>>>>> On Sun, Mar 16, 2014 at 8:55 PM, Max Hutchinson <[email protected] >>>>>>>> > wrote: >>>>>>>> >>>>>>>>> tl;dr it depends on the DAG, but improved ILP is is likely >>>>>>>>> possible (if difficult) and there could be room for multi-core >>>>>>>>> parallelism >>>>>>>>> as well. >>>>>>>>> >>>>>>>>> As I understand it, we're talking about a long computation applied >>>>>>>>> to short input vectors. If the computation can be applied to many >>>>>>>>> input >>>>>>>>> vectors at once, independent of each other, then all levels of >>>>>>>>> parallelism >>>>>>>>> (multiple instructions, multiple cores, multiple sockets, multiple >>>>>>>>> nodes) >>>>>>>>> can be used. This is data-parallelism, which is great! However, it >>>>>>>>> doesn't >>>>>>>>> sound like this is the case. >>>>>>>>> >>>>>>>>> It sounds like you're thinking of building a DAG of these CSEs and >>>>>>>>> trying to use task-parallelism over independent parts of it >>>>>>>>> (automatically >>>>>>>>> using sympy or theano or what have you). The tension here is going >>>>>>>>> to be >>>>>>>>> between locality and parallelism: how much compute hardware can you >>>>>>>>> spread >>>>>>>>> your data across without losing the nice cache performance that your >>>>>>>>> small >>>>>>>>> input vectors gain you. I'd bet that going off-socket is way too >>>>>>>>> wide. >>>>>>>>> Modern multi-core architectures have core-local L2 and L1 caches, so >>>>>>>>> if >>>>>>>>> your input data fits nicely into L2 and your DAG isn't really local, >>>>>>>>> you >>>>>>>>> probably won't get anything out of multiple-cores. Your last stand is >>>>>>>>> single-core parallelism (instruction-level >>>>>>>>> parallelism<http://en.wikipedia.org/wiki/Instruction-level_parallelism>), >>>>>>>>> which sympy et al may or may not be well equipped to influence. >>>>>>>>> >>>>>>>>> To start, I'd recommend that you take a look at your DAGs and try >>>>>>>>> to figure out how large the independent chunks are. Then, estimate >>>>>>>>> the >>>>>>>>> amount of instruction level parallelism when you run in 'serial' >>>>>>>>> (which you >>>>>>>>> can do with flop-counting). If your demonstrated ILP is less than >>>>>>>>> your >>>>>>>>> independent chunk size, then at least improved ILP should be possible. >>>>>>>>> Automatically splitting up these DAGs and expressing them in a >>>>>>>>> low-level >>>>>>>>> enough way to affect ILP is a considerable task, though. >>>>>>>>> >>>>>>>>> To see if multi-core parallelism is worth it, you need to estimate >>>>>>>>> how many extra L3 loads you'd incur by spreading your data of >>>>>>>>> multiple L2s. >>>>>>>>> I don't have great advice for that, maybe someone else here does. >>>>>>>>> The >>>>>>>>> good news is that if your problem has this level of locality, then >>>>>>>>> you can >>>>>>>>> probably get away with emitting C code with pthreads or even openmp. >>>>>>>>> Just >>>>>>>>> bear in mind the thread creation/annihilation overhead (standing >>>>>>>>> thread-pools are your friend) and pin them to cores. >>>>>>>>> >>>>>>>>> Good luck, >>>>>>>>> Max >>>>>>>>> >>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "sympy" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to [email protected]. >>>>>>>> To post to this group, send email to [email protected]. >>>>>>>> Visit this group at http://groups.google.com/group/sympy. >>>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/sympy/CAJ8oX-Hc2y9C7FO07kkeraDAv7NNRGPkMJ2DvjgF2Oq7PzeS6g%40mail.gmail.com<https://groups.google.com/d/msgid/sympy/CAJ8oX-Hc2y9C7FO07kkeraDAv7NNRGPkMJ2DvjgF2Oq7PzeS6g%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>>>> . >>>>>>>> >>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "sympy" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at http://groups.google.com/group/sympy. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/sympy/CAP7f1Ajyc6RSDmTecMk4P9GF8%2BjF6qYQ32rNj8wZPtFh-G5zfA%40mail.gmail.com<https://groups.google.com/d/msgid/sympy/CAP7f1Ajyc6RSDmTecMk4P9GF8%2BjF6qYQ32rNj8wZPtFh-G5zfA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "sympy" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at http://groups.google.com/group/sympy. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/sympy/CAJ8oX-GLamse8tVYw2WecjgQWbRxDUSCAKsRYmbC4qKBwWY%3D-w%40mail.gmail.com<https://groups.google.com/d/msgid/sympy/CAJ8oX-GLamse8tVYw2WecjgQWbRxDUSCAKsRYmbC4qKBwWY%3D-w%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- > You received this message because you are subscribed to the Google Groups > "sympy" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/sympy. > To view this discussion on the web visit > https://groups.google.com/d/msgid/sympy/CAP7f1AgODZ7c4dwL7HF6pXo%2BLpDFSFG3GhjAwmJSLjQcM-NQUg%40mail.gmail.com<https://groups.google.com/d/msgid/sympy/CAP7f1AgODZ7c4dwL7HF6pXo%2BLpDFSFG3GhjAwmJSLjQcM-NQUg%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "sympy" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/sympy. To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CAJ8oX-HwT%2B2FDrXxe2DyHY6L09mCzUAYXain9unh-hfCCHAJeA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
