Hello all,

I fixed some bugs in my pycudafft module and added PyOpenCL support,
so it is called just pyfft now (and it sort of resolves the question
about including it to PyCuda distribution).

At the moment, the most annoying (me, at least)  things are:
1. OpenCL performance tests show up to 6 times slower speed as
compared to Cuda. Unfortunately, I still can't find the reason.
(The interesting thing is that PyOpenCL is still noticeably faster
than original Apple's C program with the same FFT algorithm).
2. I tried to support different ways of using plans, including
precreated contexts, streams/queues and asynchronous execution. This
resulted in quite messy interface. Any suggestions about making it
more clear are welcome.
3. Currently, the only criterion for kernel's block sizes is maximum
allowed by the number of used registers. Resulting occupancy in Cuda
kernels is 0.25 - 0.33 most of the time. But when I try to recompile
kernels with different block sizes in order to find maximum occupancy,
this makes kernels even slower.

Best regards,
Bogdan

_______________________________________________
PyCUDA mailing list
pyc...@host304.hostmonster.com
http://host304.hostmonster.com/mailman/listinfo/pycuda_tiker.net

Reply via email to