Hi,

for the gpu, int are only supported in the new gpu back-end (device=cuda*).
In the old back-end, they would end up being on CPU. This is why at many
places it is told to not use int on the GPU. But it isn't true with the new
back-end.

For the reduction being slow, we didn't parallelize it on the CPU. It
wasn't a bottleneck on the CPU and we don't have much time to optimize the
CPU. So I would recommand to time your real model on the CPU before
spending much time thinking about the parallel reduction on CPU as it is
probably not a problem.

Fred

On Mon, Feb 6, 2017 at 8:11 PM Kiuhnm Mnhuik <[email protected]> wrote:

> Reductions are quite slow. Without the final reduction I get a 100x speed
> up.
> Why is Y.sum(axis=1) so slow? I think that if each core handled a single
> row it'd be 10 times faster for matrices with many rows like in this case.
> Theano is probably using an O(logn) algorithm which is only useful when
> one needs to reduce a single but long vector.
> Can you confirm?
>
>
> On Tuesday, February 7, 2017 at 12:37:02 AM UTC+1, Kiuhnm Mnhuik wrote:
>
> I tried the following code:
>
>     def test_speed():
>         print('Computing X and X2...', end='', flush=True)
>         X_np = np.random.uniform(0, 100, size=(10000, 1000)).astype(floatX)
>         X2_np = np.random.uniform(0, 100, size=(10000,
> 1000)).astype(floatX)
>         print('done!', flush=True)
>
>         print('Moving X and X2 to the GPU...', end='', flush=True)
>         X = theano.shared(X_np)
>         X2 = theano.shared(X2_np)
>         print('done!', flush=True)
>
>         print('Building the graph...', end='', flush=True)
>         Y = X
>         for _ in range(100):
>             # Y = Y * (Y <= X2)
>             Y = Y * (Y - X2)
>         Y.sum(axis=1)
>         print('done!', flush=True)
>
>         print('compiling...', end='', flush=True)
>         f = theano.function([], Y)
>         print('done!', flush=True)
>
>         import time
>         t = time.clock()
>         f()
>         print(time.clock() - t)
>
> Note that there is a line with '<=' and another with '-' in the loop.
> They're exclusive. Here are the timings in seconds:
>
>             CPU      GPU
>     '-'     0.21    0.016
>     <=      0.39    0.019
>
> I'd say I don't need to worry about using comparisons.
>
> On Monday, February 6, 2017 at 1:20:13 PM UTC+1, Kiuhnm Mnhuik wrote:
>
> I'm using Theano 0.9.0b1 with the new back-end.
> Should I use float32 for everything (even for bool masks) for maximum
> speed on GPU (GTX 970)?
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "theano-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to