Re: [theano-users] Performance issue while indexing with a large vector

Seppo Enarvi Mon, 21 Nov 2016 14:53:38 -0800

Ok. Is random number generation working in the new GPU backend yet? I can 
see some code related to it, but a call to *uniform()* produces the error 
messages "context name None not defined" and "Could not infer context from 
inputs". Looks like it's not possible to specify the target device to 
*uniform()*.


On Monday, November 21, 2016 at 2:40:03 AM UTC+2, Pascal Lamblin wrote:
>
> Right, now I remember that the _dev20 version only works on a limited 
> number of dimensions. That would explain why adding a new axis helped. 
>
> It may be fixed already in the new GPU back-end (it needs libgpuarray, 
> then use device=cudaX instead of gpuX) already, otherwise this is where 
> we should fix that. 
>
> On Fri, Nov 18, 2016, Seppo Enarvi wrote: 
> > 
> > 
> > That's interesting, because this function is not supposed to update the 
> > bias. It just computes the cost and its gradient. Maybe that op is used 
> > to update the gradient. 
> > 
> > My GPU is Quadro K2000. I don't think it's too old because the graph 
> > contains other instances of GpuAdvancedIncSubtensor1_dev20. 
> > 
> > Anyway, I started to think why I don't have this problem with the weight 
> > matrix. I'm selecting vectors from the weight matrix in the same manner. 
> > So I tried converting the bias vector into a matrix, and selecting rows 
> > from the matrix (each of which contain only one element): 
> > 
> >      bias = bias[class_ids] 
> > => 
> >      bias = bias[:, None] 
> >      bias = bias[class_ids, 0] 
> > 
> > It's a lot faster this way. I updated to the latest version of Theano 
> > from Git and I still see the huge speed difference. 
> > 
> > Seppo 
> > 
> > 
> > 
> > On Friday, November 18, 2016 at 6:49:56 PM UTC+2, Pascal Lamblin wrote: 
> > > 
> > > Hi, 
> > > 
> > > This operation is actually the _update_ of the selected elements of 
> the 
> > > bias. 
> > > 
> > > There is a faster implementation (named GpuAdvancedIncSubtensor1_dev20 
> > > IIRC) that uses atomic addition to speed up that operation. It has the 
> > > downside of not yielding a deterministic order of summation if the 
> same 
> > > element is updated more than once in the same operation. 
> > > 
> > > One of the issues seems to be that this faster implementation is not 
> > > selected. Could it be that you have an old GPU? 
> > > 
> > > Another potential issue is that your graph seems to first apply 
> updates 
> > > on a tensor of zeros, and then apply another update on the bias 
> itself. 
> > > There may be a way of simplifying that. 
> > > 
> > > On Fri, Nov 18, 2016, Seppo Enarvi wrote: 
> > > > 
> > > > I'm implementing sampling based softmax alternatives, where I 
> compute 
> > > the 
> > > > preactivations only for certain output classes. I get a very bad 
> > > > performance due to a GpuAdvancedIncSubtensor1 op, which consumes 90 
> % of 
> > > > the processing time of the update function: 
> > > > 
> > > > <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> 
> > > > <Gflops/s> <Apply name> 89.0% 89.0% 725.413s 2.44e-01s 2968 115 
> > > > 
> > > 
> GpuAdvancedIncSubtensor1{inplace,inc}(GpuAdvancedIncSubtensor1{inplace,inc}.0,
>  
>
> > > 
> > > > GpuFromHost.0, Elemwise{Cast{int64}}.0) input 0: dtype=float32, 
> > > > shape=(10001,), strides=(1,) input 1: dtype=float32, shape=(25600,), 
> > > > strides=(1,) input 2: dtype=int64, shape=(25600,), strides=c output 
> 0: 
> > > > dtype=float32, shape=(10001,), strides=(1,) 
> > > > 
> > > > Looking at the computation graph of that function, I noticed it's 
> > > operating 
> > > > on the bias vector: 
> > > > 
> > > > GpuAdvancedIncSubtensor1{inplace,inc} [id FL] ''   115 
> > > >  |GpuAdvancedIncSubtensor1{inplace,inc} [id FM] ''   112 
> > > >  | |GpuAlloc{memset_0=True} [id FN] ''   17 
> > > >  | | |CudaNdarrayConstant{[ 0.]} [id FO] 
> > > >  | | |Shape_i{0} [id FP] ''   7 
> > > >  | |   |bias [id BU] 
> > > > 
> > > > More precisely, the performance hit seems to come from selecting 
> from 
> > > the 
> > > > bias vector those values that correspond to the output classes (bias 
> = 
> > > > bias[class_ids]). Is that a particularly expensive operation? 
> class_ids 
> > > can 
> > > > be large (1,000 - 10,000). If I don't use the bias, my speed 
> improves 
> > > > tenfold. Is there a way to circumvent that problem? 
> > > 
> > 
> > -- 
> > 
> > --- 
> > You received this message because you are subscribed to the Google 
> Groups "theano-users" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to [email protected] <javascript:>. 
> > For more options, visit https://groups.google.com/d/optout. 
>
>
> -- 
> Pascal 
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [theano-users] Performance issue while indexing with a large vector

Reply via email to