Thanks for the suggestion, I'll try to see what kind of overhead that results in compared to doing the batched_dot in one go. For training I'm already well off, as I'm training using minibatches, it's mostly that it's a pain to have use minibatches for the validation/test sets too just to get around this issue.
Regards, Arvid Fahlström Myrman On Thursday, 1 September 2016 22:38:42 CEST Pascal Lamblin wrote: > I guess that right now, your best bet may be to use scan on "chunks" of > x and W[y]. > > At one extreme, for chunks of length 1, it is equivalent to your > approach with scan (one example at a time), with minimal memory usage, > but maximum slowness. The other extreme, with only 1 chunk, would be > equivalent to calling only one tensordot. Hopefully there is a sweeter > spot in the middle. > > On Thu, Sep 01, 2016, Arvid Fahlström Myrman wrote: > > Unfortunately I don't really have time to delve into the C side of Theano > > at present. For the time being I've settled on simply ensuring I don't > > perform the operation on too much data at once. I'd appreciate any hints > > (e.g. documentation etc) you might have for doing this in C though, in > > case I need it in the future for performance reasons; I'm not familiar > > with what batchedGemm is nor how to interface with it from Theano. > > > > Regards, > > Arvid Fahlström Myrman > > > > On Thursday, 1 September 2016 13:00:55 CEST Frédéric Bastien wrote: > > > Currently it is not possible. But batchedGemm for Douglas allow this I > > > think. If you use the GPU you could call batchedGemm directly and modify > > > it > > > to take as input y. You need to be a C programmer, no need for cuda > > > programming. > > > > > > Are you interested? We don't have time to do it shortly. > > > > > > Fred > > > > > > Le 25 août 2016 13:23, "Arvid Fahlström Myrman" <[email protected]> a écrit : > > > > Hi, > > > > > > > > I have an array W of size K×N×M which serves as a collection of K 2D > > > > matrices. I also have an input matrix x of size D×N, and a > > > > corresponding > > > > list of indices y of size D. The value of each element in y indicates > > > > which > > > > matrix in W that the corresponding vector in x should be multiplied > > > > by. In > > > > other words, for each i, 0 <= i <= D - 1, I want to calculate > > > > T.dot(x[i], > > > > W[y[i]]). > > > > > > > > I am currently doing this using batched_dot as follows: output = > > > > T.batched_dot(x, W[y]). However, in general D >> K, and as such y will > > > > tend > > > > to contain repeated indices. As a result, it seems that the W[y] > > > > operation > > > > will cause a lot of memory to be duplicated, which quickly leads to > > > > out of > > > > memory issues for large D. > > > > > > > > I tried implementing it using scan instead: > > > > > > > > output, _ = theano.scan( > > > > > > > > fn=lambda x, y, W: T.dot(x, W[y]), > > > > sequences=[x, y], > > > > non_sequences=W) > > > > > > > > but this turns out to be orders of magnitude slower: > > > > > > > > Class > > > > --- > > > > <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> > > > > <Class name> > > > > > > > > 98.9% 98.9% 32.331s 6.94e-02s Py 466 2 > > > > > > > > theano.scan_module.scan_op.Scan > > > > > > > > 0.6% 99.5% 0.188s 2.44e-05s C 7689 33 > > > > > > > > theano.sandbox.cuda.basic_ops.GpuElemwise > > > > [...] > > > > Ops > > > > --- > > > > <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> > > > > <Op > > > > name> > > > > > > > > 90.6% 90.6% 29.602s 1.27e-01s Py 233 1 > > > > > > > > forall_inplace,gpu,grad_of_scan_fn} > > > > > > > > 8.3% 98.9% 2.729s 1.17e-02s Py 233 1 > > > > > > > > for{gpu,scan_fn} > > > > > > > > 0.2% 99.1% 0.072s 5.12e-05s C 1398 6 > > > > > > > > GpuCAReduce{add}{0,1} > > > > [...] > > > > > > > > I created a gist comparing different approaches here: > > > > https://gist.github.com/arvidfm/4cff3e8d215e8d0c5629d968e355f0d9. On > > > > my > > > > system this outputs: > > > > > > > > Running batched_dot1 > > > > Took 4.75645505997818 seconds > > > > Running batched_dot2 > > > > Took 4.3897430250654 seconds > > > > Running batched_dot3 > > > > Took 26.59151006198954 seconds > > > > > > > > Is there any way to perform this calculation efficiently without > > > > duplicating memory? > > > > > > > > I'm running Theano 0.9.0.dev2. > > > > > > > > Regards, > > > > Arvid Fahlström Myrman > > > > > > > > -- > > > > > > > > --- > > > > You received this message because you are subscribed to the Google > > > > Groups > > > > "theano-users" group. > > > > To unsubscribe from this group and stop receiving emails from it, send > > > > an > > > > email to [email protected]. > > > > For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups "theano-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
