That's interesting, because this function is not supposed to update the
bias. It just computes the cost and its gradient. Maybe that op is used
to update the gradient.
My GPU is Quadro K2000. I don't think it's too old because the graph
contains other instances of GpuAdvancedIncSubtensor1_dev20.
Anyway, I started to think why I don't have this problem with the weight
matrix. I'm selecting vectors from the weight matrix in the same manner.
So I tried converting the bias vector into a matrix, and selecting rows
from the matrix (each of which contain only one element):
bias = bias[class_ids]
=>
bias = bias[:, None]
bias = bias[class_ids, 0]
It's a lot faster this way. I updated to the latest version of Theano
from Git and I still see the huge speed difference.
Seppo
On Friday, November 18, 2016 at 6:49:56 PM UTC+2, Pascal Lamblin wrote:
>
> Hi,
>
> This operation is actually the _update_ of the selected elements of the
> bias.
>
> There is a faster implementation (named GpuAdvancedIncSubtensor1_dev20
> IIRC) that uses atomic addition to speed up that operation. It has the
> downside of not yielding a deterministic order of summation if the same
> element is updated more than once in the same operation.
>
> One of the issues seems to be that this faster implementation is not
> selected. Could it be that you have an old GPU?
>
> Another potential issue is that your graph seems to first apply updates
> on a tensor of zeros, and then apply another update on the bias itself.
> There may be a way of simplifying that.
>
> On Fri, Nov 18, 2016, Seppo Enarvi wrote:
> >
> > I'm implementing sampling based softmax alternatives, where I compute
> the
> > preactivations only for certain output classes. I get a very bad
> > performance due to a GpuAdvancedIncSubtensor1 op, which consumes 90 % of
> > the processing time of the update function:
> >
> > <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops>
> > <Gflops/s> <Apply name> 89.0% 89.0% 725.413s 2.44e-01s 2968 115
> >
> GpuAdvancedIncSubtensor1{inplace,inc}(GpuAdvancedIncSubtensor1{inplace,inc}.0,
>
>
> > GpuFromHost.0, Elemwise{Cast{int64}}.0) input 0: dtype=float32,
> > shape=(10001,), strides=(1,) input 1: dtype=float32, shape=(25600,),
> > strides=(1,) input 2: dtype=int64, shape=(25600,), strides=c output 0:
> > dtype=float32, shape=(10001,), strides=(1,)
> >
> > Looking at the computation graph of that function, I noticed it's
> operating
> > on the bias vector:
> >
> > GpuAdvancedIncSubtensor1{inplace,inc} [id FL] '' 115
> > |GpuAdvancedIncSubtensor1{inplace,inc} [id FM] '' 112
> > | |GpuAlloc{memset_0=True} [id FN] '' 17
> > | | |CudaNdarrayConstant{[ 0.]} [id FO]
> > | | |Shape_i{0} [id FP] '' 7
> > | | |bias [id BU]
> >
> > More precisely, the performance hit seems to come from selecting from
> the
> > bias vector those values that correspond to the output classes (bias =
> > bias[class_ids]). Is that a particularly expensive operation? class_ids
> can
> > be large (1,000 - 10,000). If I don't use the bias, my speed improves
> > tenfold. Is there a way to circumvent that problem?
>
--
---
You received this message because you are subscribed to the Google Groups
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.