Hi Wonghang,
     Good news. I thought about the issue you pointed out that scan runs 
serial and it would
be necessary to write a truly parallel GPU implementation. So after looking 
at some CUDA tutorials
I was able to get Cholesky to run as a batch with each batch sample 
running  on an individual core.

In order to interface it to Theano as a new class, I had to add it as a new 
function to
Magma and link it into libmagma.so.  It is working and produces the correct 
answer and is even
 faster. I don't know yet how much faster because I have not eliminated the 
time of the back-substitution,
but the execution time went from 2.9 sec to 2.0 sec. Once I isolate just 
the Cholesky routine,
I think the speedup will be huge. And, once I get back-substitution also 
parallelized, solving the symmetric
matrix problem should be super-fast.   I think that the method I used 
(linking the CUDA code into libmagma.so)
 is just temporary. Once I get some  routines finished, it should be moved 
to a separate library.

I hope to get some help in that because my knowledge of the inner workings 
of Theano is limited.

Let me know if you (or anyone) is interested in this project.

Paul


-- 

--- 
You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/theano-users/f84140b6-238d-4065-8a66-ec73e3d83943%40googlegroups.com.

Reply via email to