Dear Theano community,
I am using PyMC3 to sample from a Bernoulli logistic regression model of
neural spike time data. I am using a custom likelihood function for this,
which basically computes a bunch of vector dot products. This function is
embedded in more complicated PyMC3 machinery (specifically, NUTS sampling).
The sampling is quite slow, and upon profiling I realized that it is
actually not so much the sampling machinery as it is my likelihood which is
taking up most of the computational time. This is the relevant code:
# shapes:
# theta (ntrial, ncell)
# beta (ntrial, ncell, ncell)
# xi (ntrial, ncell, ncell)
# _unit_influence (ntrial, ncell, ncell, ntime)
# _time_rate_prod (ntrial, ncell, 1, ntime)
# Y (ntrial, ncell, ntime)
weighted_sum = (
theta[:,:,np.newaxis] +
T.sum(beta[:,:,:,np.newaxis] * _unit_influence, axis=1)
)
weighted_sum += (T.sum(xi[:,:,:,np.newaxis] *
_time_rate_prod, axis=1))
retval = T.sum(Y.astype('float32') * weighted_sum -
T.log1p(T.exp(weighted_sum)))
Note that I'm basically computing a vector sum product twice along one
dimension, except not using T.dot but by doing an elementwise
multiplication followed by T.sum. I'm doing this because I couldn't find a
way to get the dot products I wanted in Theano (in plain Numpy, einsum
would be the solution; see also my question here on StackOverflow
<http://stackoverflow.com/questions/42766107/vector-dot-product-along-one-dimension-for-multidimensional-arrays>
).
Now it could very well be that what I'm doing is already the optimal way to
approach this, and my slow sampling is just a consequence of having a
rather complex model. But I would hope that there is still some way to
improve this, and would be very grateful for any tips anyone can offer.
The profiling summary is as follows:
Class --- <% time> <sum %> <apply time> <time per call> <type> <#call>
<#apply> <Class name> 46.2% 46.2% 10.971s 2.74e-05s C 400764 42
theano.sandbox.cuda.basic_ops.GpuElemwise 29.9% 76.0% 7.098s 3.72e-05s C
190840 20 theano.sandbox.cuda.basic_ops.GpuCAReduce 7.2% 83.2% 1.699s
1.48e-05s C 114504 12 theano.sandbox.cuda.blas.GpuDot22 3.8% 87.0% 0.911s
4.78e-05s C 19084 2 theano.sandbox.cuda.basic_ops.GpuJoin 3.8% 90.9% 0.907s
5.59e-06s C 162214 17 theano.sandbox.cuda.basic_ops.GpuFromHost 2.9% 93.8%
0.700s 1.05e-05s C 66794 7 theano.sandbox.cuda.basic_ops.HostFromGpu 2.1%
95.9% 0.501s 1.14e-06s C 438932 46 theano.sandbox.cuda.basic_ops.GpuReshape
1.5% 97.4% 0.348s 1.46e-06s C 238550 25 theano.tensor.elemwise.Elemwise
1.4% 98.7% 0.327s 3.43e-05s C 9542 1 theano.sandbox.cuda.blas.GpuGemv 0.4%
99.2% 0.097s 9.28e-07s C 104962 11
theano.sandbox.cuda.basic_ops.GpuDimShuffle 0.3% 99.5% 0.081s 1.06e-06s C
76336 8 theano.sandbox.cuda.basic_ops.GpuSubtensor 0.2% 99.7% 0.042s
4.35e-06s C 9542 1 theano.tensor.basic.Join 0.1% 99.8% 0.033s 8.62e-07s C
38168 4 theano.tensor.elemwise.DimShuffle 0.1% 99.9% 0.019s 9.75e-07s C
19084 2 theano.tensor.subtensor.Subtensor 0.1% 99.9% 0.015s 1.54e-06s C
9542 1 theano.sandbox.cuda.basic_ops.GpuAllocEmpty 0.1% 100.0% 0.012s
6.46e-07s C 19084 2 theano.compile.ops.ViewOp ... (remaining 0 Classes
account for 0.00%(0.00s) of the runtime) Ops --- <% time> <sum %> <apply
time> <time per call> <type> <#call> <#apply> <Op name> 24.7% 24.7% 5.860s
6.14e-05s C 95420 10 GpuElemwise{mul,no_inplace} 17.6% 42.2% 4.173s
1.09e-04s C 38168 4 GpuCAReduce{add}{1,1,1} 7.2% 49.4% 1.699s 1.48e-05s C
114504 12 GpuDot22 4.1% 53.5% 0.974s 2.55e-05s C 38168 4
GpuCAReduce{add}{0,1,0} 4.1% 57.6% 0.972s 2.55e-05s C 38168 4
GpuCAReduce{add}{0,1} 3.8% 61.4% 0.911s 4.78e-05s C 19084 2 GpuJoin 3.8%
65.2% 0.907s 5.59e-06s C 162214 17 GpuFromHost 2.9% 68.2% 0.700s 1.05e-05s
C 66794 7 HostFromGpu 2.6% 70.7% 0.611s 6.40e-05s C 9542 1
GpuElemwise{Composite{(i0 + (-scalar_sigmoid(((i1 + i2) + i3))))}}[(0, 2)]
2.1% 72.9% 0.503s 5.28e-05s C 9542 1 GpuElemwise{Composite{((i0 * i1) -
scalar_softplus(i1))},no_inplace} 2.0% 74.8% 0.468s 4.91e-05s C 9542 1
GpuElemwise{Composite{(i0 + (-scalar_sigmoid(i1)))}}[(0, 1)] 1.9% 76.7%
0.444s 1.16e-05s C 38168 4 GpuCAReduce{add}{0,1,1} 1.7% 78.4% 0.404s
4.24e-05s C 9542 1 GpuElemwise{Composite{((i0 + i1) + i2)}}[(0, 1)] 1.4%
79.8% 0.327s 3.43e-05s C 9542 1 GpuGemv{inplace} 1.4% 81.1% 0.322s
1.69e-05s C 19084 2 GpuCAReduce{add}{0,0,1} 1.3% 82.4% 0.313s 1.09e-05s C
28626 3 GpuElemwise{Composite{((i0 * i1) + i2)}}[(0, 2)] 1.0% 83.5% 0.246s
1.29e-05s C 19084 2 GpuElemwise{scalar_sigmoid,no_inplace} 0.9% 84.4%
0.221s 1.16e-06s C 190840 20 GpuReshape{3} 0.9% 85.3% 0.219s 1.15e-06s C
190840 20 GpuReshape{2} 0.9% 86.2% 0.214s 1.12e-05s C 19084 2
GpuElemwise{Composite{(i0 + (i1 * sqr(i2)))},no_inplace} ... (remaining 49
Ops account for 13.76%(3.27s) of the runtime) Apply ------ <% time> <sum %>
<apply time> <time per call> <#call> <id> <Apply name> 16.3% 16.3% 3.882s
4.07e-04s 9542 165 GpuCAReduce{add}{1,1,1}(GpuElemwise{Composite{((i0 * i1)
- scalar_softplus(i1))},no_inplace}.0) 3.4% 19.7% 0.810s 8.48e-05s 9542 169
GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,x,1,2}.0, CudaNdarrayConstant{
3.4% 23.1% 0.802s 8.40e-05s 9542 71
GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,x,1,2}.0, CudaNdarrayConstant{
3.1% 26.2% 0.730s 7.65e-05s 9542 70
GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,x,1,2}.0, CudaNdarrayConstant{
3.0% 29.2% 0.720s 7.55e-05s 9542 170
GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,x,1,2}.0, CudaNdarrayConstant{
2.9% 32.1% 0.692s 7.25e-05s 9542 47
GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,1,2,x}.0, CudaNdarrayConstant{
2.9% 35.0% 0.681s 7.13e-05s 9542 134
GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,1,2,x}.0, CudaNdarrayConstant{
2.6% 37.6% 0.611s 6.40e-05s 9542 63 GpuElemwise{Composite{(i0 +
(-scalar_sigmoid(((i1 + i2) + i3))))}}[(0, 2)](CudaNdarrayConstant{ 2.6%
40.1% 0.608s 6.37e-05s 9542 46
GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,1,2,x}.0, CudaNdarrayConstant{
2.5% 42.7% 0.603s 6.32e-05s 9542 135
GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,1,2,x}.0, CudaNdarrayConstant{
2.1% 44.8% 0.503s 5.28e-05s 9542 161 GpuElemwise{Composite{((i0 * i1) -
scalar_softplus(i1))},no_inplace}(CudaNdarrayConstant{ 2.0% 46.8% 0.468s
4.91e-05s 9542 163 GpuElemwise{Composite{(i0 + (-scalar_sigmoid(i1)))}}[(0,
1)](CudaNdarrayConstant{[[[ 0. 0. 0. ..., 0. 0. 0.] ... ... (remaining 181
Apply instances account for 42.13%(10.01s) of the runtime)
Thank you very much!
All the best,
Eelke
--
---
You received this message because you are subscribed to the Google Groups
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.