[theano-users] Some help optimizing a function involving 1D dot products for multidimensional tensors

Eelke Spaak Thu, 16 Mar 2017 08:02:25 -0700

Dear Theano community,

I am using PyMC3 to sample from a Bernoulli logistic regression model of 
neural spike time data. I am using a custom likelihood function for this, 
which basically computes a bunch of vector dot products. This function is 
embedded in more complicated PyMC3 machinery (specifically, NUTS sampling). 
The sampling is quite slow, and upon profiling I realized that it is 
actually not so much the sampling machinery as it is my likelihood which is 
taking up most of the computational time. This is the relevant code:


# shapes:
# theta            (ntrial, ncell)
# beta             (ntrial, ncell, ncell)
# xi               (ntrial, ncell, ncell)
# _unit_influence  (ntrial, ncell, ncell, ntime)
# _time_rate_prod  (ntrial, ncell, 1,     ntime)
# Y                (ntrial, ncell, ntime)

weighted_sum = (
    theta[:,:,np.newaxis] + 
    T.sum(beta[:,:,:,np.newaxis] * _unit_influence, axis=1)
)

weighted_sum += (T.sum(xi[:,:,:,np.newaxis] *
        _time_rate_prod, axis=1))

retval = T.sum(Y.astype('float32') * weighted_sum - 
T.log1p(T.exp(weighted_sum)))


Note that I'm basically computing a vector sum product twice along one 
dimension, except not using T.dot but by doing an elementwise 
multiplication followed by T.sum. I'm doing this because I couldn't find a 
way to get the dot products I wanted in Theano (in plain Numpy, einsum 
would be the solution; see also my question here on StackOverflow 
<http://stackoverflow.com/questions/42766107/vector-dot-product-along-one-dimension-for-multidimensional-arrays>
).


Now it could very well be that what I'm doing is already the optimal way to 
approach this, and my slow sampling is just a consequence of having a 
rather complex model. But I would hope that there is still some way to 
improve this, and would be very grateful for any tips anyone can offer.


The profiling summary is as follows:


Class --- <% time> <sum %> <apply time> <time per call> <type> <#call> 
<#apply> <Class name> 46.2% 46.2% 10.971s 2.74e-05s C 400764 42 
theano.sandbox.cuda.basic_ops.GpuElemwise 29.9% 76.0% 7.098s 3.72e-05s C 
190840 20 theano.sandbox.cuda.basic_ops.GpuCAReduce 7.2% 83.2% 1.699s 
1.48e-05s C 114504 12 theano.sandbox.cuda.blas.GpuDot22 3.8% 87.0% 0.911s 
4.78e-05s C 19084 2 theano.sandbox.cuda.basic_ops.GpuJoin 3.8% 90.9% 0.907s 
5.59e-06s C 162214 17 theano.sandbox.cuda.basic_ops.GpuFromHost 2.9% 93.8% 
0.700s 1.05e-05s C 66794 7 theano.sandbox.cuda.basic_ops.HostFromGpu 2.1% 
95.9% 0.501s 1.14e-06s C 438932 46 theano.sandbox.cuda.basic_ops.GpuReshape 
1.5% 97.4% 0.348s 1.46e-06s C 238550 25 theano.tensor.elemwise.Elemwise 
1.4% 98.7% 0.327s 3.43e-05s C 9542 1 theano.sandbox.cuda.blas.GpuGemv 0.4% 
99.2% 0.097s 9.28e-07s C 104962 11 
theano.sandbox.cuda.basic_ops.GpuDimShuffle 0.3% 99.5% 0.081s 1.06e-06s C 
76336 8 theano.sandbox.cuda.basic_ops.GpuSubtensor 0.2% 99.7% 0.042s 
4.35e-06s C 9542 1 theano.tensor.basic.Join 0.1% 99.8% 0.033s 8.62e-07s C 
38168 4 theano.tensor.elemwise.DimShuffle 0.1% 99.9% 0.019s 9.75e-07s C 
19084 2 theano.tensor.subtensor.Subtensor 0.1% 99.9% 0.015s 1.54e-06s C 
9542 1 theano.sandbox.cuda.basic_ops.GpuAllocEmpty 0.1% 100.0% 0.012s 
6.46e-07s C 19084 2 theano.compile.ops.ViewOp ... (remaining 0 Classes 
account for 0.00%(0.00s) of the runtime) Ops --- <% time> <sum %> <apply 
time> <time per call> <type> <#call> <#apply> <Op name> 24.7% 24.7% 5.860s 
6.14e-05s C 95420 10 GpuElemwise{mul,no_inplace} 17.6% 42.2% 4.173s 
1.09e-04s C 38168 4 GpuCAReduce{add}{1,1,1} 7.2% 49.4% 1.699s 1.48e-05s C 
114504 12 GpuDot22 4.1% 53.5% 0.974s 2.55e-05s C 38168 4 
GpuCAReduce{add}{0,1,0} 4.1% 57.6% 0.972s 2.55e-05s C 38168 4 
GpuCAReduce{add}{0,1} 3.8% 61.4% 0.911s 4.78e-05s C 19084 2 GpuJoin 3.8% 
65.2% 0.907s 5.59e-06s C 162214 17 GpuFromHost 2.9% 68.2% 0.700s 1.05e-05s 
C 66794 7 HostFromGpu 2.6% 70.7% 0.611s 6.40e-05s C 9542 1 
GpuElemwise{Composite{(i0 + (-scalar_sigmoid(((i1 + i2) + i3))))}}[(0, 2)] 
2.1% 72.9% 0.503s 5.28e-05s C 9542 1 GpuElemwise{Composite{((i0 * i1) - 
scalar_softplus(i1))},no_inplace} 2.0% 74.8% 0.468s 4.91e-05s C 9542 1 
GpuElemwise{Composite{(i0 + (-scalar_sigmoid(i1)))}}[(0, 1)] 1.9% 76.7% 
0.444s 1.16e-05s C 38168 4 GpuCAReduce{add}{0,1,1} 1.7% 78.4% 0.404s 
4.24e-05s C 9542 1 GpuElemwise{Composite{((i0 + i1) + i2)}}[(0, 1)] 1.4% 
79.8% 0.327s 3.43e-05s C 9542 1 GpuGemv{inplace} 1.4% 81.1% 0.322s 
1.69e-05s C 19084 2 GpuCAReduce{add}{0,0,1} 1.3% 82.4% 0.313s 1.09e-05s C 
28626 3 GpuElemwise{Composite{((i0 * i1) + i2)}}[(0, 2)] 1.0% 83.5% 0.246s 
1.29e-05s C 19084 2 GpuElemwise{scalar_sigmoid,no_inplace} 0.9% 84.4% 
0.221s 1.16e-06s C 190840 20 GpuReshape{3} 0.9% 85.3% 0.219s 1.15e-06s C 
190840 20 GpuReshape{2} 0.9% 86.2% 0.214s 1.12e-05s C 19084 2 
GpuElemwise{Composite{(i0 + (i1 * sqr(i2)))},no_inplace} ... (remaining 49 
Ops account for 13.76%(3.27s) of the runtime) Apply ------ <% time> <sum %> 
<apply time> <time per call> <#call> <id> <Apply name> 16.3% 16.3% 3.882s 
4.07e-04s 9542 165 GpuCAReduce{add}{1,1,1}(GpuElemwise{Composite{((i0 * i1) 
- scalar_softplus(i1))},no_inplace}.0) 3.4% 19.7% 0.810s 8.48e-05s 9542 169 
GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,x,1,2}.0, CudaNdarrayConstant{ 
3.4% 23.1% 0.802s 8.40e-05s 9542 71 
GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,x,1,2}.0, CudaNdarrayConstant{ 
3.1% 26.2% 0.730s 7.65e-05s 9542 70 
GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,x,1,2}.0, CudaNdarrayConstant{ 
3.0% 29.2% 0.720s 7.55e-05s 9542 170 
GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,x,1,2}.0, CudaNdarrayConstant{ 
2.9% 32.1% 0.692s 7.25e-05s 9542 47 
GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,1,2,x}.0, CudaNdarrayConstant{ 
2.9% 35.0% 0.681s 7.13e-05s 9542 134 
GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,1,2,x}.0, CudaNdarrayConstant{ 
2.6% 37.6% 0.611s 6.40e-05s 9542 63 GpuElemwise{Composite{(i0 + 
(-scalar_sigmoid(((i1 + i2) + i3))))}}[(0, 2)](CudaNdarrayConstant{ 2.6% 
40.1% 0.608s 6.37e-05s 9542 46 
GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,1,2,x}.0, CudaNdarrayConstant{ 
2.5% 42.7% 0.603s 6.32e-05s 9542 135 
GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,1,2,x}.0, CudaNdarrayConstant{ 
2.1% 44.8% 0.503s 5.28e-05s 9542 161 GpuElemwise{Composite{((i0 * i1) - 
scalar_softplus(i1))},no_inplace}(CudaNdarrayConstant{ 2.0% 46.8% 0.468s 
4.91e-05s 9542 163 GpuElemwise{Composite{(i0 + (-scalar_sigmoid(i1)))}}[(0, 
1)](CudaNdarrayConstant{[[[ 0. 0. 0. ..., 0. 0. 0.] ... ... (remaining 181 
Apply instances account for 42.13%(10.01s) of the runtime)


Thank you very much!


All the best,

Eelke

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[theano-users] Some help optimizing a function involving 1D dot products for multidimensional tensors

Reply via email to