Hi,

  I have a concept of how to implement data parallelism to utilize multiple 
GPUs, and I'd appreciate any feedback before I start on this.


First, some background:
--I'm working within an established, fairly complex code base.  It builds 
NNs using Lasagne and computes gradients and other things to perform 
somewhat sophisticated optimization.  To the extent possible, I'd like to 
not modify this, but only wrap it.  (And for this reason I'm under the 
impression that Platoon is not the solution?)
--All Theano functions are expectation values over big data batches, so I'd 
like to split up the data and do the same computation on multiple GPUs in 
parallel, and average the results (even with multiple GPUs the data will 
probably be bigger than can fit, so using inputs instead of shareds for 
this, to cycle through the data).  
--I already have a version running with multiprocessing, and it performs 
well for CPU-only parallelism, and for GPU parallelism.  But the new GPU 
backend and the prospect of running in only one python process, seems to 
offer a much cleaner solution, which would be easier to maintain and more 
immediately portable to other specific cases which I haven't yet 
parallelized, and this is what I hope to achieve.


What I think I have to do is: go through the full construction process of 
the NN and all related inference / training functions, once for each GPU 
(context).  That is, 

1. Build a separate copy of the network, each with a different GPU as the `
*target`* of the shared variables comprising the network parameters (I 
think this can be done cleanly by providing Lasagne with an initialization 
function for the weights, with the target already set in it.)

2. Build a separate copy of the rest of the graphs, with e.g. 
`.transfer('dev0')` appended to each of the input variables, and re-build 
functions with those inputs and NN of corresponding GPU, and while we're 
here, also must put the corresponding `.transfer('dev0')` on all Theano 
function outputs, so the functions can be run concurrently.

3. All functions must be compiled separately for each GPU.  (Even if they 
are all grouped together into one Theano function.)

4. Figure out how to use the collectives so that I can all_reduce whatever 
outputs I need directly on the GPUs.  (Possibly outputs need to be 
re-factored as updates to GPU-bound shared variables.)


This is certainly do-able, although slightly painful.  Is there another way 
that I am missing??  Maybe some little trick using `givens` (I haven't been 
able to get it).  *If there was some way to take a Theano function 
(pre-compilation) or graph and make a copy of it but assign a different 
context, this would be helpful!*


The only part I'm nervous about, performance-wise, is when some part of a 
computation ends up on the CPU for whatever reason... would Theano only 
service one of those at a time, and potentially slow down?  As opposed to 
the multiprocessing approach where everything is definitely concurrent (and 
I can pin the processes to non-overlapping subsets of the CPU cores to make 
sure they don't interfere).



Thanks,
Adam

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to