Hi,

I'm training Baidu's Deepspeech 
<https://github.com/baidu-research/ba-dls-deepspeech> on a Nvidia 1080 Ti 
and I realized on larger (46 at-most) batches GPU utilization is zero 
nearly half of the time, but training time per input, continuously becomes 
better by increasing batch size. It looks counter-intuitive to me and I 
doubt whether I could do better with hardware or not.

To give a context about model:

- It's five GRU layers on top of a 1D convolution.
- It uses batch normalizations after each RNN and convolution
- Loss function is CTC implemented by Baidu's warp-ctc 
<https://github.com/baidu-research/warp-ctc>

It seems GPU utilization is zero at start of each batch and during 
`theano.function` execution. There is no CPU computation node in graph 
computation as reported by `assert_no_cpu_op`. Mode is `FAST_RUN`. I 
tweaked  optimizers, SGD with nesterov and momentum had same overhead as 
Adam.

Cuda version is 8.0, CuDNN is 5.1.10, libgpuarray is 0.6.5 and Theano is 
0.9.0.

Thanks in advance, any hint is welcome

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to