Hello all, I fixed some bugs in my pycudafft module and added PyOpenCL support, so it is called just pyfft now (and it sort of resolves the question about including it to PyCuda distribution).
At the moment, the most annoying (me, at least) things are: 1. OpenCL performance tests show up to 6 times slower speed as compared to Cuda. Unfortunately, I still can't find the reason. (The interesting thing is that PyOpenCL is still noticeably faster than original Apple's C program with the same FFT algorithm). 2. I tried to support different ways of using plans, including precreated contexts, streams/queues and asynchronous execution. This resulted in quite messy interface. Any suggestions about making it more clear are welcome. 3. Currently, the only criterion for kernel's block sizes is maximum allowed by the number of used registers. Resulting occupancy in Cuda kernels is 0.25 - 0.33 most of the time. But when I try to recompile kernels with different block sizes in order to find maximum occupancy, this makes kernels even slower. Best regards, Bogdan _______________________________________________ PyCUDA mailing list pyc...@host304.hostmonster.com http://host304.hostmonster.com/mailman/listinfo/pycuda_tiker.net