[theano-users] How do you optimise code for a GPU ? (Currently getting worse performance on GPU than CPU)

Lindley Lentati Thu, 14 Jul 2016 01:18:12 -0700

Hi there, 

So i've recently started getting into theano, and wanted to try and take my 
existing code and put it on a GPU. I tried the simple example on the 
website, and got similar improvements to those quoted, so was quite hopeful 
going in !  The code that i'm evaluating is the following, hopefully 
commented sufficiently to make it clear whats going on:


amps   = tt.vector('amps', dtype=theano.config.floatX)
offs   = tt.vector('offs', dtype=theano.config.floatX)
sigs   = tt.vector('sigs', dtype=theano.config.floatX)
phase  = tt.scalar('phase', dtype=theano.config.floatX)

#TFlatTimes is a float32 shared vector that is 1024*useToAs long and 
contains the observed times of useToAs light curves, each of 
#which is sampled with 1024 bins.  useToAs is set to 100 in this case but 
will eventually be tens of thousands.

#ReferencePeriod is a float32 shared scalar
#Tg1width, Tg2amp, Tg2width are float32 shared scalars that define a double 
Gaussian model
#phase is  a single free parameter that defines when to evaluate the 
gausisan model jointly for each light curve


#first shift TFlatTimes by phase, and then wrap between values of 
-ReferencePeriod/2 and +ReferencePeriod/2, store as x
#Then evaluate first gaussian as y
#repeat for the position of the second gaussian and evaluate it as y2

x = ( TFlatTimes - phase + ReferencePeriod/2) % (ReferencePeriod ) - 
ReferencePeriod/2
y = tt.exp(-0.5*(x)**2/Tg1width**2)
x2 = ( TFlatTimes - phase - gsep + ReferencePeriod/2) % (ReferencePeriod ) - 
ReferencePeriod/2
y2 = Tg2amp*tt.exp(-0.5*(x2)**2/Tg2width**2)

#AmpVec, OffVec, and SigVec contain the overall amplitude of each curve, an 
offset, and the noise level
#Each is 1024*useToAs in length and is a single number (ie amps[0]) 
repeated 1024 times, then amps[1] 1024 times etc


AmpVec = theano.tensor.extra_ops.repeat(amps, 1024)
OffVec = theano.tensor.extra_ops.repeat(offs, 1024)
SigVec = theano.tensor.extra_ops.repeat(sigs, 1024)

Nbins=Nbins.astype(int)
TNbins=theano.shared(Nbins)

#construct final signal vector, the sum of the two gaussians multipled by 
the overall amplitude for that curve, plus the offset

s = AmpVec*(y+y2) + OffVec


#calculate log likelihood

like = 0.5*tt.sum(((TFlatData-s)/SigVec)**2)  + 0.5*tt.sum(TNbins[:useToAs]*
tt.log(sigs**2))


#calculate gradient with respect to the parameters


glike = tt.grad(like, [phase, amps, offs, sigs])


#define functions to return likelihood, gradient, and the signal vector

getS = theano.function([phase, amps, offs], s)
getX = theano.function([phase, amps, offs, sigs], like)    
getG = theano.function([phase, amps, offs, sigs], glike)


#Wrap these in a single function that is passed vectors of parameters
def TheanoFunc2(phaseval, ampvec, offvec, sigvec):

    l=getX(phaseval, ampvec, offvec, sigvec)*1
    g=getG(phaseval, ampvec, offvec, sigvec)    
    return l, g


I then wanted to test this by evaluating TheanoFunc2 20000 times using 
random numbers as the input:


pval = np.float32(0.00288206)
Tpval = theano.shared(pval)


ltot = 0

#define random number functions

from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams

theano_rng = RandomStreams(189)


avals = theano.function([], theano_rng.normal( size = (useToAs,), avg = 0.0, 
std = 1.0, dtype=theano.config.floatX))
ovals = theano.function([], theano_rng.normal( size = (useToAs,), avg = 0.0, 
std = 1.0, dtype=theano.config.floatX))
nvals = theano.function([], theano_rng.normal( size = (useToAs,), avg = 0.0, 
std = 1.0, dtype=theano.config.floatX)**2)


start = time.clock()


for i in range(20000):
    if(i%100 == 0):
        print i



    l, g = TheanoFunc2(pval, avals(), ovals(), nvals())

    ltot += l

end = time.clock()

print "time", start-end



I then timed this for CPU and GPU uses using: 

setenv THEANO_FLAGS 'mode=FAST_RUN,device=cpu,floatX=float32' 

and 

setenv THEANO_FLAGS 'mode=FAST_RUN,device=gpu,floatX=float32'


and get times of 469.33s on CPU, and 561.29s on a GPU.


Unfortunately I have no idea why that might be, is there any way to see how 
much/when stuff is being copied to and from the GPU?  In principle all i 
need to do is copy my initial vector of parameters to the GPU, and then 
just return the likelihood and gradient, everything else can be made and 
kept on the GPU.

If anyone was able to look through this and shed some light, I would 
greatly appreciate it!

Thanks

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[theano-users] How do you optimise code for a GPU ? (Currently getting worse performance on GPU than CPU)

Reply via email to