[theano-users] Re: Some help optimizing a function involving 1D dot products for multidimensional tensors

2017-03-16 Thread Jesse Livezey
If I'm understanding your code correctly, you should be able to use 
tensordot
http://deeplearning.net/software/theano/library/tensor/basic.html#theano.tensor.tensordot
rather than doing the multiply and sum.

On Thursday, March 16, 2017 at 10:59:14 AM UTC-4, Eelke Spaak wrote:
>
> Apologies for the messed up profiling code, here is attempt 2:
>
> Class
> ---
> <% time> <#call> <#apply> 
> 
>   46.2%46.2%  10.971s   2.74e-05s C   400764  42   
> theano.sandbox.cuda.basic_ops.GpuElemwise
>   29.9%76.0%   7.098s   3.72e-05s C   190840  20   
> theano.sandbox.cuda.basic_ops.GpuCAReduce
>7.2%83.2%   1.699s   1.48e-05s C   114504  12   
> theano.sandbox.cuda.blas.GpuDot22
>3.8%87.0%   0.911s   4.78e-05s C19084   2   
> theano.sandbox.cuda.basic_ops.GpuJoin
>3.8%90.9%   0.907s   5.59e-06s C   162214  17   
> theano.sandbox.cuda.basic_ops.GpuFromHost
>2.9%93.8%   0.700s   1.05e-05s C66794   7   
> theano.sandbox.cuda.basic_ops.HostFromGpu
>2.1%95.9%   0.501s   1.14e-06s C   438932  46   
> theano.sandbox.cuda.basic_ops.GpuReshape
>1.5%97.4%   0.348s   1.46e-06s C   238550  25   
> theano.tensor.elemwise.Elemwise
>1.4%98.7%   0.327s   3.43e-05s C 9542   1   
> theano.sandbox.cuda.blas.GpuGemv
>0.4%99.2%   0.097s   9.28e-07s C   104962  11   
> theano.sandbox.cuda.basic_ops.GpuDimShuffle
>0.3%99.5%   0.081s   1.06e-06s C76336   8   
> theano.sandbox.cuda.basic_ops.GpuSubtensor
>0.2%99.7%   0.042s   4.35e-06s C 9542   1   
> theano.tensor.basic.Join
>0.1%99.8%   0.033s   8.62e-07s C38168   4   
> theano.tensor.elemwise.DimShuffle
>0.1%99.9%   0.019s   9.75e-07s C19084   2   
> theano.tensor.subtensor.Subtensor
>0.1%99.9%   0.015s   1.54e-06s C 9542   1   
> theano.sandbox.cuda.basic_ops.GpuAllocEmpty
>0.1%   100.0%   0.012s   6.46e-07s C19084   2   
> theano.compile.ops.ViewOp
>... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)
>
> Ops
> ---
> <% time> <#call> <#apply>  name>
>   24.7%24.7%   5.860s   6.14e-05s C 95420   10   
> GpuElemwise{mul,no_inplace}
>   17.6%42.2%   4.173s   1.09e-04s C 381684   
> GpuCAReduce{add}{1,1,1}
>7.2%49.4%   1.699s   1.48e-05s C 114504   12   
> GpuDot22
>4.1%53.5%   0.974s   2.55e-05s C 381684   
> GpuCAReduce{add}{0,1,0}
>4.1%57.6%   0.972s   2.55e-05s C 381684   
> GpuCAReduce{add}{0,1}
>3.8%61.4%   0.911s   4.78e-05s C 190842   
> GpuJoin
>3.8%65.2%   0.907s   5.59e-06s C 162214   17   
> GpuFromHost
>2.9%68.2%   0.700s   1.05e-05s C 667947   
> HostFromGpu
>2.6%70.7%   0.611s   6.40e-05s C 95421   
> GpuElemwise{Composite{(i0 + (-scalar_sigmoid(((i1 + i2) + i3}}[(0, 2)]
>2.1%72.9%   0.503s   5.28e-05s C 95421   
> GpuElemwise{Composite{((i0 * i1) - scalar_softplus(i1))},no_inplace}
>2.0%74.8%   0.468s   4.91e-05s C 95421   
> GpuElemwise{Composite{(i0 + (-scalar_sigmoid(i1)))}}[(0, 1)]
>1.9%76.7%   0.444s   1.16e-05s C 381684   
> GpuCAReduce{add}{0,1,1}
>1.7%78.4%   0.404s   4.24e-05s C 95421   
> GpuElemwise{Composite{((i0 + i1) + i2)}}[(0, 1)]
>1.4%79.8%   0.327s   3.43e-05s C 95421   
> GpuGemv{inplace}
>1.4%81.1%   0.322s   1.69e-05s C 190842   
> GpuCAReduce{add}{0,0,1}
>1.3%82.4%   0.313s   1.09e-05s C 286263   
> GpuElemwise{Composite{((i0 * i1) + i2)}}[(0, 2)]
>1.0%83.5%   0.246s   1.29e-05s C 190842   
> GpuElemwise{scalar_sigmoid,no_inplace}
>0.9%84.4%   0.221s   1.16e-06s C 190840   20   
> GpuReshape{3}
>0.9%85.3%   0.219s   1.15e-06s C 190840   20   
> GpuReshape{2}
>0.9%86.2%   0.214s   1.12e-05s C 190842   
> GpuElemwise{Composite{(i0 + (i1 * sqr(i2)))},no_inplace}
>... (remaining 49 Ops account for  13.76%(3.27s) of the runtime)
>
> Apply
> --
> <% time><#call>  
>   16.3%16.3%   3.882s   4.07e-04s   9542   165   
> GpuCAReduce{add}{1,1,1}(GpuElemwise{Composite{((i0 * i1) - 
> scalar_softplus(i1))},no_inplace}.0)
>3.4%19.7%   0.810s   8.48e-05s   9542   169   
> GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,x,1,2}.0, CudaNdarrayConstant{
>3.4%23.1%   0.802s   8.40e-05s   

[theano-users] Re: Some help optimizing a function involving 1D dot products for multidimensional tensors

2017-03-16 Thread Eelke Spaak
Apologies for the messed up profiling code, here is attempt 2:

Class
---
<% time> <#call> <#apply> 

  46.2%46.2%  10.971s   2.74e-05s C   400764  42   
theano.sandbox.cuda.basic_ops.GpuElemwise
  29.9%76.0%   7.098s   3.72e-05s C   190840  20   
theano.sandbox.cuda.basic_ops.GpuCAReduce
   7.2%83.2%   1.699s   1.48e-05s C   114504  12   
theano.sandbox.cuda.blas.GpuDot22
   3.8%87.0%   0.911s   4.78e-05s C19084   2   
theano.sandbox.cuda.basic_ops.GpuJoin
   3.8%90.9%   0.907s   5.59e-06s C   162214  17   
theano.sandbox.cuda.basic_ops.GpuFromHost
   2.9%93.8%   0.700s   1.05e-05s C66794   7   
theano.sandbox.cuda.basic_ops.HostFromGpu
   2.1%95.9%   0.501s   1.14e-06s C   438932  46   
theano.sandbox.cuda.basic_ops.GpuReshape
   1.5%97.4%   0.348s   1.46e-06s C   238550  25   
theano.tensor.elemwise.Elemwise
   1.4%98.7%   0.327s   3.43e-05s C 9542   1   
theano.sandbox.cuda.blas.GpuGemv
   0.4%99.2%   0.097s   9.28e-07s C   104962  11   
theano.sandbox.cuda.basic_ops.GpuDimShuffle
   0.3%99.5%   0.081s   1.06e-06s C76336   8   
theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.2%99.7%   0.042s   4.35e-06s C 9542   1   
theano.tensor.basic.Join
   0.1%99.8%   0.033s   8.62e-07s C38168   4   
theano.tensor.elemwise.DimShuffle
   0.1%99.9%   0.019s   9.75e-07s C19084   2   
theano.tensor.subtensor.Subtensor
   0.1%99.9%   0.015s   1.54e-06s C 9542   1   
theano.sandbox.cuda.basic_ops.GpuAllocEmpty
   0.1%   100.0%   0.012s   6.46e-07s C19084   2   
theano.compile.ops.ViewOp
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <#call> <#apply> 
  24.7%24.7%   5.860s   6.14e-05s C 95420   10   
GpuElemwise{mul,no_inplace}
  17.6%42.2%   4.173s   1.09e-04s C 381684   
GpuCAReduce{add}{1,1,1}
   7.2%49.4%   1.699s   1.48e-05s C 114504   12   
GpuDot22
   4.1%53.5%   0.974s   2.55e-05s C 381684   
GpuCAReduce{add}{0,1,0}
   4.1%57.6%   0.972s   2.55e-05s C 381684   
GpuCAReduce{add}{0,1}
   3.8%61.4%   0.911s   4.78e-05s C 190842   
GpuJoin
   3.8%65.2%   0.907s   5.59e-06s C 162214   17   
GpuFromHost
   2.9%68.2%   0.700s   1.05e-05s C 667947   
HostFromGpu
   2.6%70.7%   0.611s   6.40e-05s C 95421   
GpuElemwise{Composite{(i0 + (-scalar_sigmoid(((i1 + i2) + i3}}[(0, 2)]
   2.1%72.9%   0.503s   5.28e-05s C 95421   
GpuElemwise{Composite{((i0 * i1) - scalar_softplus(i1))},no_inplace}
   2.0%74.8%   0.468s   4.91e-05s C 95421   
GpuElemwise{Composite{(i0 + (-scalar_sigmoid(i1)))}}[(0, 1)]
   1.9%76.7%   0.444s   1.16e-05s C 381684   
GpuCAReduce{add}{0,1,1}
   1.7%78.4%   0.404s   4.24e-05s C 95421   
GpuElemwise{Composite{((i0 + i1) + i2)}}[(0, 1)]
   1.4%79.8%   0.327s   3.43e-05s C 95421   
GpuGemv{inplace}
   1.4%81.1%   0.322s   1.69e-05s C 190842   
GpuCAReduce{add}{0,0,1}
   1.3%82.4%   0.313s   1.09e-05s C 286263   
GpuElemwise{Composite{((i0 * i1) + i2)}}[(0, 2)]
   1.0%83.5%   0.246s   1.29e-05s C 190842   
GpuElemwise{scalar_sigmoid,no_inplace}
   0.9%84.4%   0.221s   1.16e-06s C 190840   20   
GpuReshape{3}
   0.9%85.3%   0.219s   1.15e-06s C 190840   20   
GpuReshape{2}
   0.9%86.2%   0.214s   1.12e-05s C 190842   
GpuElemwise{Composite{(i0 + (i1 * sqr(i2)))},no_inplace}
   ... (remaining 49 Ops account for  13.76%(3.27s) of the runtime)

Apply
--
<% time><#call>  
  16.3%16.3%   3.882s   4.07e-04s   9542   165   
GpuCAReduce{add}{1,1,1}(GpuElemwise{Composite{((i0 * i1) - 
scalar_softplus(i1))},no_inplace}.0)
   3.4%19.7%   0.810s   8.48e-05s   9542   169   
GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,x,1,2}.0, CudaNdarrayConstant{
   3.4%23.1%   0.802s   8.40e-05s   954271   
GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,x,1,2}.0, CudaNdarrayConstant{
   3.1%26.2%   0.730s   7.65e-05s   954270   
GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,x,1,2}.0, CudaNdarrayConstant{
   3.0%29.2%   0.720s   7.55e-05s   9542   170   
GpuElemwise{mul,no_inplace}(GpuDimShuffle{0,x,1,2}.0, CudaNdarrayConstant{
   2.9%32.1%   0.692s   7.25e-05s   954247