Thank you, this will come in handy if and when I manage to find the time to 
look into this in more detail.

Regards,
Arvid Fahlström Myrman

On Friday, 2 September 2016 11:38:52 CEST Frédéric Bastien wrote:
> Some information. The op to modify/make a new version of is GpuGemmBatch:
> 
> https://github.com/Theano/Theano/blob/master/theano/gpuarray/blas.py#L335
> 
> Here is Theano tutorial on how to make new ops:
> 
> http://deeplearning.net/software/theano/extending/
> 
> For this one, the new op will take an extra parame, the list of index and
> would use it when calling cuda.
> 
> Fred
> 
> On Thu, Sep 1, 2016 at 4:16 PM, Arvid Fahlström Myrman <[email protected]>
> 
> wrote:
> > Unfortunately I don't really have time to delve into the C side of Theano
> > at
> > present. For the time being I've settled on simply ensuring I don't
> > perform
> > the operation on too much data at once. I'd appreciate any hints (e.g.
> > documentation etc) you might have for doing this in C though, in case I
> > need
> > it in the future for performance reasons; I'm not familiar with what
> > batchedGemm is nor how to interface with it from Theano.
> > 
> > Regards,
> > Arvid Fahlström Myrman
> > 
> > On Thursday, 1 September 2016 13:00:55 CEST Frédéric Bastien wrote:
> > > Currently it is not possible. But batchedGemm for Douglas allow this I
> > > think. If you use the GPU you could call batchedGemm directly and modify
> > 
> > it
> > 
> > > to take as input y. You need to be a C programmer, no need for cuda
> > > programming.
> > > 
> > > Are you interested? We don't have time to do it shortly.
> > > 
> > > Fred
> > > 
> > > Le 25 août 2016 13:23, "Arvid Fahlström Myrman" <[email protected]> a
> > 
> > écrit :
> > > > Hi,
> > > > 
> > > > I have an array W of size K×N×M which serves as a collection of K 2D
> > > > matrices. I also have an input matrix x of size D×N, and a
> > 
> > corresponding
> > 
> > > > list of indices y of size D. The value of each element in y indicates
> > > > which
> > > > matrix in W that the corresponding vector in x should be multiplied
> > 
> > by. In
> > 
> > > > other words, for each i, 0 <= i <= D - 1, I want to calculate
> > 
> > T.dot(x[i],
> > 
> > > > W[y[i]]).
> > > > 
> > > > I am currently doing this using batched_dot as follows: output =
> > > > T.batched_dot(x, W[y]). However, in general D >> K, and as such y will
> > > > tend
> > > > to contain repeated indices. As a result, it seems that the W[y]
> > 
> > operation
> > 
> > > > will cause a lot of memory to be duplicated, which quickly leads to
> > 
> > out of
> > 
> > > > memory issues for large D.
> > > > 
> > > > I tried implementing it using scan instead:
> > > > 
> > > > output, _ = theano.scan(
> > > > 
> > > >     fn=lambda x, y, W: T.dot(x, W[y]),
> > > >     sequences=[x, y],
> > > >     non_sequences=W)
> > > > 
> > > > but this turns out to be orders of magnitude slower:
> > > > 
> > > > Class
> > > > ---
> > > > <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
> > > > <Class name>
> > > > 
> > > >   98.9%    98.9%      32.331s       6.94e-02s     Py     466       2
> > > > 
> > > > theano.scan_module.scan_op.Scan
> > > > 
> > > >    0.6%    99.5%       0.188s       2.44e-05s     C     7689      33
> > > > 
> > > > theano.sandbox.cuda.basic_ops.GpuElemwise
> > > > [...]
> > > > Ops
> > > > ---
> > > > <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
> > 
> > <Op
> > 
> > > > name>
> > > > 
> > > >   90.6%    90.6%      29.602s       1.27e-01s     Py     233        1
> > > > 
> > > > forall_inplace,gpu,grad_of_scan_fn}
> > > > 
> > > >    8.3%    98.9%       2.729s       1.17e-02s     Py     233        1
> > > > 
> > > > for{gpu,scan_fn}
> > > > 
> > > >    0.2%    99.1%       0.072s       5.12e-05s     C     1398        6
> > > > 
> > > > GpuCAReduce{add}{0,1}
> > > > [...]
> > > > 
> > > > I created a gist comparing different approaches here:
> > > > https://gist.github.com/arvidfm/4cff3e8d215e8d0c5629d968e355f0d9. On
> > 
> > my
> > 
> > > > system this outputs:
> > > > 
> > > > Running batched_dot1
> > > > Took 4.75645505997818 seconds
> > > > Running batched_dot2
> > > > Took 4.3897430250654 seconds
> > > > Running batched_dot3
> > > > Took 26.59151006198954 seconds
> > > > 
> > > > Is there any way to perform this calculation efficiently without
> > > > duplicating memory?
> > > > 
> > > > I'm running Theano 0.9.0.dev2.
> > > > 
> > > > Regards,
> > > > Arvid Fahlström Myrman
> > > > 
> > > > --
> > > > 
> > > > ---
> > > > You received this message because you are subscribed to the Google
> > 
> > Groups
> > 
> > > > "theano-users" group.
> > > > To unsubscribe from this group and stop receiving emails from it, send
> > 
> > an
> > 
> > > > email to [email protected].
> > > > For more options, visit https://groups.google.com/d/optout.
> > 
> > --
> > 
> > ---
> > You received this message because you are subscribed to the Google Groups
> > "theano-users" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email to [email protected].
> > For more options, visit https://groups.google.com/d/optout.


-- 

--- 
You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to