Hi,
I have a concept of how to implement data parallelism to utilize multiple
GPUs, and I'd appreciate any feedback before I start on this.
First, some background:
--I'm working within an established, fairly complex code base. It builds
NNs using Lasagne and computes gradients and other things to perform
somewhat sophisticated optimization. To the extent possible, I'd like to
not modify this, but only wrap it. (And for this reason I'm under the
impression that Platoon is not the solution?)
--All Theano functions are expectation values over big data batches, so I'd
like to split up the data and do the same computation on multiple GPUs in
parallel, and average the results (even with multiple GPUs the data will
probably be bigger than can fit, so using inputs instead of shareds for
this, to cycle through the data).
--I already have a version running with multiprocessing, and it performs
well for CPU-only parallelism, and for GPU parallelism. But the new GPU
backend and the prospect of running in only one python process, seems to
offer a much cleaner solution, which would be easier to maintain and more
immediately portable to other specific cases which I haven't yet
parallelized, and this is what I hope to achieve.
What I think I have to do is: go through the full construction process of
the NN and all related inference / training functions, once for each GPU
(context). That is,
1. Build a separate copy of the network, each with a different GPU as the `
*target`* of the shared variables comprising the network parameters (I
think this can be done cleanly by providing Lasagne with an initialization
function for the weights, with the target already set in it.)
2. Build a separate copy of the rest of the graphs, with e.g.
`.transfer('dev0')` appended to each of the input variables, and re-build
functions with those inputs and NN of corresponding GPU, and while we're
here, also must put the corresponding `.transfer('dev0')` on all Theano
function outputs, so the functions can be run concurrently.
3. All functions must be compiled separately for each GPU. (Even if they
are all grouped together into one Theano function.)
4. Figure out how to use the collectives so that I can all_reduce whatever
outputs I need directly on the GPUs. (Possibly outputs need to be
re-factored as updates to GPU-bound shared variables.)
This is certainly do-able, although slightly painful. Is there another way
that I am missing?? Maybe some little trick using `givens` (I haven't been
able to get it). *If there was some way to take a Theano function
(pre-compilation) or graph and make a copy of it but assign a different
context, this would be helpful!*
The only part I'm nervous about, performance-wise, is when some part of a
computation ends up on the CPU for whatever reason... would Theano only
service one of those at a time, and potentially slow down? As opposed to
the multiprocessing approach where everything is definitely concurrent (and
I can pin the processes to non-overlapping subsets of the CPU cores to make
sure they don't interfere).
Thanks,
Adam
--
---
You received this message because you are subscribed to the Google Groups
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.