Hi, I managed to get the new libgpuarray backend (with CuDNN) working on my machine. I'm using the bleeding-edge version of Theano (0.9.0beta1.dev-f09900c53b31558866dfcd42af38e6418b65729b).
When I try running the basic script for testing multi-GPU support <http://deeplearning.net/software/theano/tutorial/using_multi_gpu.html#a-simple-graph-on-two-gpus> with this setup, there doesn't seem to be any speed-up when using multiple GPUs instead of one. Here's what my .theanorc looks like: [global] > floatX=float32 > allow_gc=False > optmizer=fast_run > contexts=dev2->cuda2;dev3->cuda3 > > [lib] > cnmem=0 > > [dnn] > enabled=auto > > [nvcc] > fastmath=False > The machine has four GPUs, and I'm using the last two (as the other two are partially in use for some other computation). And here is the Python script that compares the two: import > numpy > import > theano > import > time > > > > > def > run_serial_computation(): > """Run four serial matrix multiplications on the same device.""" > v21 = theano.shared(numpy.random.random((1024, > 1024)).astype('float32'), > > target='dev2') > v22 = theano.shared(numpy.random.random((1024, > 1024)).astype('float32'), > > target='dev2') > v31 = theano.shared(numpy.random.random((1024, > 1024)).astype('float32'), > > target='dev2') > v32 = theano.shared(numpy.random.random((1024, > 1024)).astype('float32'), > > target='dev2') > > > t0 = > time.time() > f = theano.function([], [theano.tensor.dot(v21, > v22), > theano.tensor.dot(v31, > v32)]) > t1 = > time.time() > print("It took %f seconds to build the serial graph." % (t1 - > t0)) > > > t0 = > time.time() > > f() > t1 = > time.time() > print("It took %f seconds to carry out serial matrix > multiplications." > % (t1 - > t0)) > > > > > def > run_parallel_computation(): > """Run four parallel matrix multiplications on different > devices.""" > v21 = theano.shared(numpy.random.random((1024, > 1024)).astype('float32'), > > target='dev2') > v22 = theano.shared(numpy.random.random((1024, > 1024)).astype('float32'), > > target='dev2') > v31 = theano.shared(numpy.random.random((1024, > 1024)).astype('float32'), > > target='dev3') > v32 = theano.shared(numpy.random.random((1024, > 1024)).astype('float32'), > > target='dev3') > > > t0 = > time.time() > f = theano.function([], [theano.tensor.dot(v21, > v22), > theano.tensor.dot(v31, > v32)]) > t1 = > time.time() > print("It took %f seconds to build the parallel graph." % (t1 - > t0)) > > > t0 = > time.time() > > f() > t1 = time.time() > print("It took %f seconds to carry out parallel matrix multiplications." > % (t1 - > t0)) > > > > > if > __name__=='__main__': > > run_serial_computation() > > run_parallel_computation() > And the output is the following: Using cuDNN version 5110 on context dev2 > Mapped name dev2 to device cuda2: GeForce GTX 980 Ti (0000:0A:00.0) > Using cuDNN version 5110 on context dev3 > Mapped name dev3 to device cuda3: GeForce GTX 980 Ti (0000:05:00.0) > It took 0.041683 seconds to build the serial graph. > It took 0.002978 seconds to carry out serial matrix multiplications. > It took 0.015313 seconds to build the parallel graph. > It took 0.002568 seconds to carry out parallel matrix multiplications. > It looks like the GPUs are loaded and mapped correctly. I noticed that the compilation in the multi-GPU case is faster than the single-GPU case, but I assume this is not the speed-up I'm looking for. What could be wrong here? Another related question - what if I decide to use all four GPUs by extending the above script, and two of the GPUs are partially in use. The speed-up should be less than expected, shouldn't it? Or is there a chance of it being slower than the single-GPU case? Any inputs on the matter would be great! Thanks! Regards, Srikanth -- --- You received this message because you are subscribed to the Google Groups "theano-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
