Hi Wonghang,
Good news. I thought about the issue you pointed out that scan runs
serial and it would
be necessary to write a truly parallel GPU implementation. So after looking
at some CUDA tutorials
I was able to get Cholesky to run as a batch with each batch sample
running on an individual core.
In order to interface it to Theano as a new class, I had to add it as a new
function to
Magma and link it into libmagma.so. It is working and produces the correct
answer and is even
faster. I don't know yet how much faster because I have not eliminated the
time of the back-substitution,
but the execution time went from 2.9 sec to 2.0 sec. Once I isolate just
the Cholesky routine,
I think the speedup will be huge. And, once I get back-substitution also
parallelized, solving the symmetric
matrix problem should be super-fast. I think that the method I used
(linking the CUDA code into libmagma.so)
is just temporary. Once I get some routines finished, it should be moved
to a separate library.
I hope to get some help in that because my knowledge of the inner workings
of Theano is limited.
Let me know if you (or anyone) is interested in this project.
Paul
--
---
You received this message because you are subscribed to the Google Groups
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/theano-users/f84140b6-238d-4065-8a66-ec73e3d83943%40googlegroups.com.