That's very useful, thank you! On Wednesday, February 15, 2017 at 6:04:44 PM UTC+1, nouiz wrote: > > I have no idea if what you propose would work well. You can make a new OP > that use pycuda for the computation. We do that for our fft op in the new > back-end: > > https://github.com/Theano/Theano/blob/master/theano/gpuarray/fft.py > > On Sat, Feb 11, 2017 at 6:45 AM Kiuhnm Mnhuik <[email protected] > <javascript:>> wrote: > >> What do you mean by "reusing a row"? If each core does one and only one >> reduction on a single row, therefore there shouldn't be any reuse. >> I mean that one core of the GPU accesses and reduces one and only one >> specific row: >> >> *** row1 *** <----- just core 1 >> *** row2 *** <----- just core 2 >> *** row3 *** <----- just core 3 >> *** row4 *** <----- just core 4 >> *** row5 *** <----- just core 5 >> >> This makes sense because there are so many rows that all cores can run in >> parallel, each one working on its own row. >> >> Reductions aren't usually a bottleneck, but I'm doing something quite >> unusual. >> >> Can I use pyCuda to work *directly* on Theano data already allocated on >> the GPU? This might be my only option. I can't copy or move the data back >> to the CPU or it'll kill performance. >> >> >> On Friday, February 10, 2017 at 7:37:57 PM UTC+1, nouiz wrote: >> >>> X+Y is trivially parallelisable. Bug not X.sum(axis=1). I'm pretty sure >>> we do something sensible. I check the code and it is the case. >>> >>> Reduction isn't trivially parallelisable. This is way it get less speed >>> up. When we reuse a row, we can't parallelize it as much as when adding 2 >>> matrix. >>> But in all cases, in a real model, it shouldn't make a difference, >>> reduction aren't bottleneck normally. If you have such case, I would like >>> to see a profile that show this. >>> >>> Fred >>> >>> On Tue, Feb 7, 2017 at 6:28 PM Kiuhnm Mnhuik <[email protected]> wrote: >>> >>>> Hi Fred, >>>> >>>> I'm talking about the GPU. With a 10000x1000 matrix X, X.sum(axis=1) is >>>> 10 times slower than X + Y, where Y is another matrix of the same shape, >>>> according to my tests. >>>> I suspect that you're reducing each row using some O(logn) algorithm >>>> which makes sense when one needs to reduce a single long vector. But in >>>> this case, shouldn't we assign each row to a single core of the GPU and >>>> reduce the row as we would do on the CPU? The parallelism would result >>>> from >>>> having so many rows. >>>> Of course, if the matrix had just 10 rows this algorithm would be very >>>> slow, but with 10000 rows it should be faster than what you're doing right >>>> now. It might be almost as fast as doing X + Y. >>>> I'm speculating since I've never looked into CUDA programming (it's on >>>> my TODO list!). >>>> >>>> >>>> On Tuesday, February 7, 2017 at 10:49:47 PM UTC+1, nouiz wrote: >>>>> >>>>> Hi, >>>>> >>>>> for the gpu, int are only supported in the new gpu back-end >>>>> (device=cuda*). In the old back-end, they would end up being on CPU. This >>>>> is why at many places it is told to not use int on the GPU. But it isn't >>>>> true with the new back-end. >>>>> >>>>> For the reduction being slow, we didn't parallelize it on the CPU. It >>>>> wasn't a bottleneck on the CPU and we don't have much time to optimize >>>>> the >>>>> CPU. So I would recommand to time your real model on the CPU before >>>>> spending much time thinking about the parallel reduction on CPU as it is >>>>> probably not a problem. >>>>> >>>>> >>>> >>>> >>>>> Fred >>>>> >>>>> On Mon, Feb 6, 2017 at 8:11 PM Kiuhnm Mnhuik <[email protected]> >>>>> wrote: >>>>> >>>> Reductions are quite slow. Without the final reduction I get a 100x >>>>>> speed up. >>>>>> Why is Y.sum(axis=1) so slow? I think that if each core handled a >>>>>> single row it'd be 10 times faster for matrices with many rows like in >>>>>> this >>>>>> case. >>>>>> Theano is probably using an O(logn) algorithm which is only useful >>>>>> when one needs to reduce a single but long vector. >>>>>> Can you confirm? >>>>>> >>>>>> >>>>>> On Tuesday, February 7, 2017 at 12:37:02 AM UTC+1, Kiuhnm Mnhuik >>>>>> wrote: >>>>>>> >>>>>>> I tried the following code: >>>>>>> >>>>>>> def test_speed(): >>>>>>> print('Computing X and X2...', end='', flush=True) >>>>>>> X_np = np.random.uniform(0, 100, size=(10000, >>>>>>> 1000)).astype(floatX) >>>>>>> X2_np = np.random.uniform(0, 100, size=(10000, >>>>>>> 1000)).astype(floatX) >>>>>>> print('done!', flush=True) >>>>>>> >>>>>>> print('Moving X and X2 to the GPU...', end='', flush=True) >>>>>>> X = theano.shared(X_np) >>>>>>> X2 = theano.shared(X2_np) >>>>>>> print('done!', flush=True) >>>>>>> >>>>>>> print('Building the graph...', end='', flush=True) >>>>>>> Y = X >>>>>>> for _ in range(100): >>>>>>> # Y = Y * (Y <= X2) >>>>>>> Y = Y * (Y - X2) >>>>>>> Y.sum(axis=1) >>>>>>> print('done!', flush=True) >>>>>>> >>>>>>> print('compiling...', end='', flush=True) >>>>>>> f = theano.function([], Y) >>>>>>> print('done!', flush=True) >>>>>>> >>>>>>> import time >>>>>>> t = time.clock() >>>>>>> f() >>>>>>> print(time.clock() - t) >>>>>>> >>>>>>> Note that there is a line with '<=' and another with '-' in the >>>>>>> loop. They're exclusive. Here are the timings in seconds: >>>>>>> >>>>>>> CPU GPU >>>>>>> '-' 0.21 0.016 >>>>>>> <= 0.39 0.019 >>>>>>> >>>>>>> I'd say I don't need to worry about using comparisons. >>>>>>> >>>>>>> On Monday, February 6, 2017 at 1:20:13 PM UTC+1, Kiuhnm Mnhuik wrote: >>>>>>>> >>>>>>>> I'm using Theano 0.9.0b1 with the new back-end. >>>>>>>> Should I use float32 for everything (even for bool masks) for >>>>>>>> maximum speed on GPU (GTX 970)? >>>>>>>> >>>>>>> -- >>>>>> >>>>>> --- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "theano-users" group. >>>>>> >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>>> an email to [email protected]. >>>>> >>>>> >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> -- >>>> >>>> --- >>>> You received this message because you are subscribed to the Google >>>> Groups "theano-users" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> -- >> >> --- >> You received this message because you are subscribed to the Google Groups >> "theano-users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> For more options, visit https://groups.google.com/d/optout. >> >
-- --- You received this message because you are subscribed to the Google Groups "theano-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
