Response from Max follows (for some reason he was getting bounced by the mailing list).
On Sun, Mar 16, 2014 at 8:55 PM, Max Hutchinson <[email protected]> wrote: > tl;dr it depends on the DAG, but improved ILP is is likely possible (if > difficult) and there could be room for multi-core parallelism as well. > > As I understand it, we're talking about a long computation applied to > short input vectors. If the computation can be applied to many input > vectors at once, independent of each other, then all levels of parallelism > (multiple instructions, multiple cores, multiple sockets, multiple nodes) > can be used. This is data-parallelism, which is great! However, it doesn't > sound like this is the case. > > It sounds like you're thinking of building a DAG of these CSEs and trying > to use task-parallelism over independent parts of it (automatically using > sympy or theano or what have you). The tension here is going to be between > locality and parallelism: how much compute hardware can you spread your > data across without losing the nice cache performance that your small input > vectors gain you. I'd bet that going off-socket is way too wide. Modern > multi-core architectures have core-local L2 and L1 caches, so if your input > data fits nicely into L2 and your DAG isn't really local, you probably > won't get anything out of multiple-cores. Your last stand is single-core > parallelism (instruction-level > parallelism<http://en.wikipedia.org/wiki/Instruction-level_parallelism>), > which sympy et al may or may not be well equipped to influence. > > To start, I'd recommend that you take a look at your DAGs and try to > figure out how large the independent chunks are. Then, estimate the amount > of instruction level parallelism when you run in 'serial' (which you can do > with flop-counting). If your demonstrated ILP is less than your > independent chunk size, then at least improved ILP should be possible. > Automatically splitting up these DAGs and expressing them in a low-level > enough way to affect ILP is a considerable task, though. > > To see if multi-core parallelism is worth it, you need to estimate how > many extra L3 loads you'd incur by spreading your data of multiple L2s. I > don't have great advice for that, maybe someone else here does. The good > news is that if your problem has this level of locality, then you can > probably get away with emitting C code with pthreads or even openmp. Just > bear in mind the thread creation/annihilation overhead (standing > thread-pools are your friend) and pin them to cores. > > Good luck, > Max > -- You received this message because you are subscribed to the Google Groups "sympy" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/sympy. To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CAJ8oX-Hc2y9C7FO07kkeraDAv7NNRGPkMJ2DvjgF2Oq7PzeS6g%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
