Re: [theano-users] Re: Info about types on new GPU backend

Kiuhnm Mnhuik Wed, 15 Feb 2017 10:51:57 -0800

That's very useful, thank you!

On Wednesday, February 15, 2017 at 6:04:44 PM UTC+1, nouiz wrote:
>
> I have no idea if what you propose would work well. You can make a new OP 
> that use pycuda for the computation. We do that for our fft op in the new 
> back-end:
>
> https://github.com/Theano/Theano/blob/master/theano/gpuarray/fft.py
>
> On Sat, Feb 11, 2017 at 6:45 AM Kiuhnm Mnhuik <[email protected] 
> <javascript:>> wrote:
>
>> What do you mean by "reusing a row"? If each core does one and only one 
>> reduction on a single row, therefore there shouldn't be any reuse.
>> I mean that one core of the GPU accesses and reduces one and only one 
>> specific row:
>>
>>        *** row1 ***        <----- just core 1 
>>        *** row2 ***        <----- just core 2
>>        *** row3 ***        <----- just core 3
>>        *** row4 ***        <----- just core 4
>>        *** row5 ***        <----- just core 5
>>
>> This makes sense because there are so many rows that all cores can run in 
>> parallel, each one working on its own row.
>>
>> Reductions aren't usually a bottleneck, but I'm doing something quite 
>> unusual.
>>
>> Can I use pyCuda to work *directly* on Theano data already allocated on 
>> the GPU? This might be my only option. I can't copy or move the data back 
>> to the CPU or it'll kill performance.
>>
>>
>> On Friday, February 10, 2017 at 7:37:57 PM UTC+1, nouiz wrote:
>>
>>> X+Y is trivially parallelisable. Bug not X.sum(axis=1). I'm pretty sure 
>>> we do something sensible. I check the code and it is the case.
>>>
>>> Reduction isn't trivially parallelisable. This is way it get less speed 
>>> up. When we reuse a row, we can't parallelize it as much as when adding 2 
>>> matrix.
>>> But in all cases, in a real model, it shouldn't make a difference, 
>>> reduction aren't bottleneck normally. If you have such case, I would like 
>>> to see a profile that show this.
>>>
>>> Fred
>>>
>>> On Tue, Feb 7, 2017 at 6:28 PM Kiuhnm Mnhuik <[email protected]> wrote:
>>>
>>>> Hi Fred,
>>>>
>>>> I'm talking about the GPU. With a 10000x1000 matrix X, X.sum(axis=1) is 
>>>> 10 times slower than X + Y, where Y is another matrix of the same shape, 
>>>> according to my tests.
>>>> I suspect that you're reducing each row using some O(logn) algorithm 
>>>> which makes sense when one needs to reduce a single long vector. But in 
>>>> this case, shouldn't we assign each row to a single core of the GPU and 
>>>> reduce the row as we would do on the CPU? The parallelism would result 
>>>> from 
>>>> having so many rows.
>>>> Of course, if the matrix had just 10 rows this algorithm would be very 
>>>> slow, but with 10000 rows it should be faster than what you're doing right 
>>>> now. It might be almost as fast as doing X + Y.
>>>> I'm speculating since I've never looked into CUDA programming (it's on 
>>>> my TODO list!).
>>>>
>>>>
>>>> On Tuesday, February 7, 2017 at 10:49:47 PM UTC+1, nouiz wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> for the gpu, int are only supported in the new gpu back-end 
>>>>> (device=cuda*). In the old back-end, they would end up being on CPU. This 
>>>>> is why at many places it is told to not use int on the GPU. But it isn't 
>>>>> true with the new back-end.
>>>>>
>>>>> For the reduction being slow, we didn't parallelize it on the CPU. It 
>>>>> wasn't a bottleneck on the CPU and we don't have much time to optimize 
>>>>> the 
>>>>> CPU. So I would recommand to time your real model on the CPU before 
>>>>> spending much time thinking about the parallel reduction on CPU as it is 
>>>>> probably not a problem.
>>>>>
>>>>>  
>>>>  
>>>>
>>>>> Fred
>>>>>
>>>>> On Mon, Feb 6, 2017 at 8:11 PM Kiuhnm Mnhuik <[email protected]> 
>>>>> wrote:
>>>>>
>>>> Reductions are quite slow. Without the final reduction I get a 100x 
>>>>>> speed up.
>>>>>> Why is Y.sum(axis=1) so slow? I think that if each core handled a 
>>>>>> single row it'd be 10 times faster for matrices with many rows like in 
>>>>>> this 
>>>>>> case.
>>>>>> Theano is probably using an O(logn) algorithm which is only useful 
>>>>>> when one needs to reduce a single but long vector.
>>>>>> Can you confirm?
>>>>>>
>>>>>>
>>>>>> On Tuesday, February 7, 2017 at 12:37:02 AM UTC+1, Kiuhnm Mnhuik 
>>>>>> wrote:
>>>>>>>
>>>>>>> I tried the following code:
>>>>>>>
>>>>>>>     def test_speed():
>>>>>>>         print('Computing X and X2...', end='', flush=True)
>>>>>>>         X_np = np.random.uniform(0, 100, size=(10000, 
>>>>>>> 1000)).astype(floatX)
>>>>>>>         X2_np = np.random.uniform(0, 100, size=(10000, 
>>>>>>> 1000)).astype(floatX)
>>>>>>>         print('done!', flush=True)
>>>>>>>
>>>>>>>         print('Moving X and X2 to the GPU...', end='', flush=True)
>>>>>>>         X = theano.shared(X_np)
>>>>>>>         X2 = theano.shared(X2_np)
>>>>>>>         print('done!', flush=True)
>>>>>>>
>>>>>>>         print('Building the graph...', end='', flush=True)
>>>>>>>         Y = X
>>>>>>>         for _ in range(100):
>>>>>>>             # Y = Y * (Y <= X2)
>>>>>>>             Y = Y * (Y - X2)
>>>>>>>         Y.sum(axis=1)
>>>>>>>         print('done!', flush=True)
>>>>>>>
>>>>>>>         print('compiling...', end='', flush=True)
>>>>>>>         f = theano.function([], Y)
>>>>>>>         print('done!', flush=True)
>>>>>>>
>>>>>>>         import time
>>>>>>>         t = time.clock()
>>>>>>>         f()
>>>>>>>         print(time.clock() - t)
>>>>>>>
>>>>>>> Note that there is a line with '<=' and another with '-' in the 
>>>>>>> loop. They're exclusive. Here are the timings in seconds:
>>>>>>>
>>>>>>>             CPU      GPU
>>>>>>>     '-'     0.21    0.016
>>>>>>>     <=      0.39    0.019
>>>>>>>
>>>>>>> I'd say I don't need to worry about using comparisons.
>>>>>>>
>>>>>>> On Monday, February 6, 2017 at 1:20:13 PM UTC+1, Kiuhnm Mnhuik wrote:
>>>>>>>>
>>>>>>>> I'm using Theano 0.9.0b1 with the new back-end.
>>>>>>>> Should I use float32 for everything (even for bool masks) for 
>>>>>>>> maximum speed on GPU (GTX 970)?
>>>>>>>>
>>>>>>> -- 
>>>>>>
>>>>>> --- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "theano-users" group.
>>>>>>
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>>> an email to [email protected].
>>>>>
>>>>>
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>> -- 
>>>>
>>>> --- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "theano-users" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>> -- 
>>
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "theano-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>


-- 

--- 
You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [theano-users] Re: Info about types on new GPU backend

Reply via email to