[theano-users] Not getting expercted result from multi-GPU example in the documentation

Srikanth Cherla Mon, 06 Feb 2017 09:02:10 -0800

Hi,

I managed to get the new libgpuarray backend (with CuDNN) working on my 
machine. I'm using the bleeding-edge version of Theano 
(0.9.0beta1.dev-f09900c53b31558866dfcd42af38e6418b65729b).


When I try running the basic script for testing multi-GPU support 
<http://deeplearning.net/software/theano/tutorial/using_multi_gpu.html#a-simple-graph-on-two-gpus>
 
with this setup, there doesn't seem to be any speed-up when using multiple 
GPUs instead of one. Here's what my .theanorc looks like:

[global]                        
> floatX=float32                  
> allow_gc=False                  
> optmizer=fast_run               
> contexts=dev2->cuda2;dev3->cuda3
>                                 
> [lib]                           
> cnmem=0                         
>                                 
> [dnn]                           
> enabled=auto                    
>                                 
> [nvcc]                          
> fastmath=False                  
>

The machine has four GPUs, and I'm using the last two (as the other two are 
partially in use for some other computation). And here is the Python script 
that compares the two:

import 
> numpy                                                                 
> import 
> theano                                                                
> import 
> time                                                                  
>                                                                              
>
>                                                                              
>
>
def 
> run_serial_computation():                                                
>     """Run four serial matrix multiplications on the same device."""    
>     v21 = theano.shared(numpy.random.random((1024, 
> 1024)).astype('float32'), 
>                         
> target='dev2')                                       
>     v22 = theano.shared(numpy.random.random((1024, 
> 1024)).astype('float32'), 
>                         
> target='dev2')                                       
>     v31 = theano.shared(numpy.random.random((1024, 
> 1024)).astype('float32'), 
>                         
> target='dev2')                                       
>     v32 = theano.shared(numpy.random.random((1024, 
> 1024)).astype('float32'), 
>                         
> target='dev2')                                       
>                                                                              
>
>     t0 = 
> time.time()                                                         
>     f = theano.function([], [theano.tensor.dot(v21, 
> v22),                    
>                              theano.tensor.dot(v31, 
> v32)])                   
>     t1 = 
> time.time()                                                         
>     print("It took %f seconds to build the serial graph." % (t1 - 
> t0))       
>                                                                              
>
>     t0 = 
> time.time()                                                         
>     
> f()                                                                      
>     t1 = 
> time.time()                                                         
>     print("It took %f seconds to carry out serial matrix 
> multiplications."   
>           % (t1 - 
> t0))                                                       
>                                                                              
>
>                                                                              
>
> def 
> run_parallel_computation():                                              
>     """Run four parallel matrix multiplications on different 
> devices."""    
>     v21 = theano.shared(numpy.random.random((1024, 
> 1024)).astype('float32'), 
>                         
> target='dev2')                                       
>     v22 = theano.shared(numpy.random.random((1024, 
> 1024)).astype('float32'), 
>                         
> target='dev2')                                       
>     v31 = theano.shared(numpy.random.random((1024, 
> 1024)).astype('float32'), 
>                         
> target='dev3')                                       
>     v32 = theano.shared(numpy.random.random((1024, 
> 1024)).astype('float32'), 
>                         
> target='dev3')                                       
>                                                                              
>
>     t0 = 
> time.time()                                                         
>     f = theano.function([], [theano.tensor.dot(v21, 
> v22),                    
>                              theano.tensor.dot(v31, 
> v32)])                   
>     t1 = 
> time.time()                                                         
>     print("It took %f seconds to build the parallel graph." % (t1 - 
> t0))     
>                                                                              
>
>     t0 = 
> time.time()                                                         
>     
> f()                                                                      
>
    t1 = time.time() 
>
    print("It took %f seconds to carry out parallel matrix multiplications."
>           % (t1 - 
> t0))                                                      
>                                                                             
>
>                                                                             
>
> if 
> __name__=='__main__':                                                    
>     
> run_serial_computation()                                                
>     
> run_parallel_computation()                                              
>

And the output is the following:

Using cuDNN version 5110 on context dev2
> Mapped name dev2 to device cuda2: GeForce GTX 980 Ti (0000:0A:00.0)
> Using cuDNN version 5110 on context dev3
> Mapped name dev3 to device cuda3: GeForce GTX 980 Ti (0000:05:00.0)
> It took 0.041683 seconds to build the serial graph.
> It took 0.002978 seconds to carry out serial matrix multiplications.
> It took 0.015313 seconds to build the parallel graph.
> It took 0.002568 seconds to carry out parallel matrix multiplications.
>

It looks like the GPUs are loaded and mapped correctly. I noticed that the 
compilation in the multi-GPU case is faster than the single-GPU case, but I 
assume this is not the speed-up I'm looking for. What could be wrong here?

Another related question - what if I decide to use all four GPUs by 
extending the above script, and two of the GPUs are partially in use. The 
speed-up should be less than expected, shouldn't it? Or is there a chance 
of it being slower than the single-GPU case?

Any inputs on the matter would be great! Thanks!

Regards,
Srikanth 

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[theano-users] Not getting expercted result from multi-GPU example in the documentation

Reply via email to