Currently it is not possible. But batchedGemm for Douglas allow this I
think. If you use the GPU you could call batchedGemm directly and modify it
to take as input y. You need to be a C programmer, no need for cuda
programming.

Are you interested? We don't have time to do it shortly.

Fred

Le 25 août 2016 13:23, "Arvid Fahlström Myrman" <[email protected]> a écrit :

> Hi,
>
> I have an array W of size K×N×M which serves as a collection of K 2D
> matrices. I also have an input matrix x of size D×N, and a corresponding
> list of indices y of size D. The value of each element in y indicates which
> matrix in W that the corresponding vector in x should be multiplied by. In
> other words, for each i, 0 <= i <= D - 1, I want to calculate T.dot(x[i],
> W[y[i]]).
>
> I am currently doing this using batched_dot as follows: output =
> T.batched_dot(x, W[y]). However, in general D >> K, and as such y will tend
> to contain repeated indices. As a result, it seems that the W[y] operation
> will cause a lot of memory to be duplicated, which quickly leads to out of
> memory issues for large D.
>
> I tried implementing it using scan instead:
>
> output, _ = theano.scan(
>     fn=lambda x, y, W: T.dot(x, W[y]),
>     sequences=[x, y],
>     non_sequences=W)
>
> but this turns out to be orders of magnitude slower:
>
> Class
> ---
> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
> <Class name>
>   98.9%    98.9%      32.331s       6.94e-02s     Py     466       2
> theano.scan_module.scan_op.Scan
>    0.6%    99.5%       0.188s       2.44e-05s     C     7689      33
> theano.sandbox.cuda.basic_ops.GpuElemwise
> [...]
> Ops
> ---
> <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op
> name>
>   90.6%    90.6%      29.602s       1.27e-01s     Py     233        1
> forall_inplace,gpu,grad_of_scan_fn}
>    8.3%    98.9%       2.729s       1.17e-02s     Py     233        1
> for{gpu,scan_fn}
>    0.2%    99.1%       0.072s       5.12e-05s     C     1398        6
> GpuCAReduce{add}{0,1}
> [...]
>
> I created a gist comparing different approaches here:
> https://gist.github.com/arvidfm/4cff3e8d215e8d0c5629d968e355f0d9. On my
> system this outputs:
>
> Running batched_dot1
> Took 4.75645505997818 seconds
> Running batched_dot2
> Took 4.3897430250654 seconds
> Running batched_dot3
> Took 26.59151006198954 seconds
>
> Is there any way to perform this calculation efficiently without
> duplicating memory?
>
> I'm running Theano 0.9.0.dev2.
>
> Regards,
> Arvid Fahlström Myrman
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "theano-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to