Re: [PyCUDA] Pycudafft becomes Pyfft
Hello Imran, kernel.py requires patching too: - from .kernel_helpers import * + from .kernel_helpers import log2, getRadixArray, getGlobalRadixInfo, getPadding, getSharedMemorySize I hope this will be enough. Sorry for the inconvenience, I'm going to commit it in the repository. I need to add some version check too, because there will definitely be other bugs on Python 2.4, which is still used by some Linux distros ) Best regards, Bogdan On Thu, Mar 25, 2010 at 11:36 AM, Bogdan Opanchuk manti...@gmail.com wrote: Hello Imran, I tested it only on 2.6, so it can be the case. Thanks for the bug report though, this sort of compatibility is easy to add. Can you please just put from .kernel import GlobalFFTKernel, LocalFFTKernel, X_DIRECTION, Y_DIRECTION, Z_DIRECTION instead of this line? Best regards, Bogdan On Thu, Mar 25, 2010 at 11:19 AM, Imran Haque iha...@stanford.edu wrote: Didn't work - does it require newer than Python 2.5? $ python test_performance.py Running performance tests... Traceback (most recent call last): File test_performance.py, line 57, in module run(isCudaAvailable(), isCLAvailable(), DEFAULT_BUFFER_SIZE) File test_performance.py, line 52, in run testPerformance(ctx, shape, buffer_size) File test_performance.py, line 22, in testPerformance plan = ctx.getPlan(shape, context=ctx.context, wait_for_finish=True) File /home/ihaque/pyfft-0.3/pyfft_test/helpers.py, line 116, in getPlan import pyfft.cl File /usr/lib/python2.5/site-packages/pyfft-0.3-py2.5.egg/pyfft/cl.py, line 9, in module from .plan import FFTPlan File /usr/lib/python2.5/site-packages/pyfft-0.3-py2.5.egg/pyfft/plan.py, line 3 from .kernel import * SyntaxError: 'import *' not allowed with 'from .' Bogdan Opanchuk wrote: Hello Imran, (sorry, forgot to add maillist to CC) Thank you for prompt reply, results from 5870 are interesting too. If you have pyopencl installed, just run test_performance.py from pyfft_test folder, located in pyfft package. It will print the results in stdout. Best regards, Bogdan. On Thu, Mar 25, 2010 at 11:11 AM, Imran Haque iha...@stanford.edu wrote: Hi Bogdan, I have access to a Radeon 5870, but it's installed in a slow host machine (2.8GHz dual core Pentium 4). If this is still useful, I could run a test for you if you can send along a quick test case. Cheers, Imran Bogdan Opanchuk wrote: By the way, if it is not too much to ask: if anybody has access to ATI 59** series card and/or GTX 295 - could you please run performance tests from the module (pyfft_test/test_performance.py) and post the results here? I suspect that the poor performance in case of OpenCL can be (partially) caused by nVidia drivers. Thank you in advance. On Sat, Mar 20, 2010 at 10:36 PM, Bogdan Opanchuk manti...@gmail.com wrote: Hello all, I fixed some bugs in my pycudafft module and added PyOpenCL support, so it is called just pyfft now (and it sort of resolves the question about including it to PyCuda distribution). At the moment, the most annoying (me, at least) things are: 1. OpenCL performance tests show up to 6 times slower speed as compared to Cuda. Unfortunately, I still can't find the reason. (The interesting thing is that PyOpenCL is still noticeably faster than original Apple's C program with the same FFT algorithm). 2. I tried to support different ways of using plans, including precreated contexts, streams/queues and asynchronous execution. This resulted in quite messy interface. Any suggestions about making it more clear are welcome. 3. Currently, the only criterion for kernel's block sizes is maximum allowed by the number of used registers. Resulting occupancy in Cuda kernels is 0.25 - 0.33 most of the time. But when I try to recompile kernels with different block sizes in order to find maximum occupancy, this makes kernels even slower. Best regards, Bogdan ___ PyCUDA mailing list pyc...@host304.hostmonster.com http://host304.hostmonster.com/mailman/listinfo/pycuda_tiker.net ___ PyCUDA mailing list pyc...@host304.hostmonster.com http://host304.hostmonster.com/mailman/listinfo/pycuda_tiker.net
Re: [PyCUDA] Pycudafft becomes Pyfft
Hi Imran, Thank you for the info, I'll fix the code - python 2.5 is still widely used. As for the ATI drivers, I thought the latest release version of Stream (2.01) supports OpenCL. I wonder if the terrible performance (these tests run faster on my GF9600) and this deadlock issue are really caused by the drivers you use... I was actually going to order server with ATI GPU for my simulations (because of their advertised Gflops numbers for both single and double precision), but I am starting to reconsider this decision now. Best regards, Bogdan On Thu, Mar 25, 2010 at 12:13 PM, Imran Haque iha...@stanford.edu wrote: Hi Bogdan, I also had to do the following to get the test to run: - kernel.py:45: change except AssertionError as e: to except AssertionError: - plan.py:4: add getRadixArray to import list from .kernel_helpers I was able to get the following pair of results, but then the test hung. The machine has prerelease ATI drivers installed, so that might be the issue. However, I've also encountered cases in my own work with code that is formally incorrect (e.g., barriers that are not uniformly executed) on which the Nvidia runtime does not deadlock but the ATI runtime does, so it might be worth checking to see if you have any situations like that. $ python test_performance.py Running performance tests... * cl, (16,), batch 131072: 1.85770988464 ms, 22.5778203296 GFLOPS * cl, (1024,), batch 2048: 13.0976915359 ms, 8.00580771903 GFLOPS Cheers, Imran Bogdan Opanchuk wrote: Hello Imran, kernel.py requires patching too: - from .kernel_helpers import * + from .kernel_helpers import log2, getRadixArray, getGlobalRadixInfo, getPadding, getSharedMemorySize I hope this will be enough. Sorry for the inconvenience, I'm going to commit it in the repository. I need to add some version check too, because there will definitely be other bugs on Python 2.4, which is still used by some Linux distros ) Best regards, Bogdan On Thu, Mar 25, 2010 at 11:36 AM, Bogdan Opanchuk manti...@gmail.com wrote: Hello Imran, I tested it only on 2.6, so it can be the case. Thanks for the bug report though, this sort of compatibility is easy to add. Can you please just put from .kernel import GlobalFFTKernel, LocalFFTKernel, X_DIRECTION, Y_DIRECTION, Z_DIRECTION instead of this line? Best regards, Bogdan On Thu, Mar 25, 2010 at 11:19 AM, Imran Haque iha...@stanford.edu wrote: Didn't work - does it require newer than Python 2.5? $ python test_performance.py Running performance tests... Traceback (most recent call last): File test_performance.py, line 57, in module run(isCudaAvailable(), isCLAvailable(), DEFAULT_BUFFER_SIZE) File test_performance.py, line 52, in run testPerformance(ctx, shape, buffer_size) File test_performance.py, line 22, in testPerformance plan = ctx.getPlan(shape, context=ctx.context, wait_for_finish=True) File /home/ihaque/pyfft-0.3/pyfft_test/helpers.py, line 116, in getPlan import pyfft.cl File /usr/lib/python2.5/site-packages/pyfft-0.3-py2.5.egg/pyfft/cl.py, line 9, in module from .plan import FFTPlan File /usr/lib/python2.5/site-packages/pyfft-0.3-py2.5.egg/pyfft/plan.py, line 3 from .kernel import * SyntaxError: 'import *' not allowed with 'from .' Bogdan Opanchuk wrote: Hello Imran, (sorry, forgot to add maillist to CC) Thank you for prompt reply, results from 5870 are interesting too. If you have pyopencl installed, just run test_performance.py from pyfft_test folder, located in pyfft package. It will print the results in stdout. Best regards, Bogdan. On Thu, Mar 25, 2010 at 11:11 AM, Imran Haque iha...@stanford.edu wrote: Hi Bogdan, I have access to a Radeon 5870, but it's installed in a slow host machine (2.8GHz dual core Pentium 4). If this is still useful, I could run a test for you if you can send along a quick test case. Cheers, Imran Bogdan Opanchuk wrote: By the way, if it is not too much to ask: if anybody has access to ATI 59** series card and/or GTX 295 - could you please run performance tests from the module (pyfft_test/test_performance.py) and post the results here? I suspect that the poor performance in case of OpenCL can be (partially) caused by nVidia drivers. Thank you in advance. On Sat, Mar 20, 2010 at 10:36 PM, Bogdan Opanchuk manti...@gmail.com wrote: Hello all, I fixed some bugs in my pycudafft module and added PyOpenCL support, so it is called just pyfft now (and it sort of resolves the question about including it to PyCuda distribution). At the moment, the most annoying (me, at least) things are: 1. OpenCL performance tests show up to 6 times slower speed as compared to Cuda. Unfortunately, I still can't find the reason. (The interesting thing is that PyOpenCL is still noticeably faster than original Apple's C program with the same FFT algorithm). 2. I tried to support different ways of using plans, including precreated
[PyCUDA] Pycudafft becomes Pyfft
Hello all, I fixed some bugs in my pycudafft module and added PyOpenCL support, so it is called just pyfft now (and it sort of resolves the question about including it to PyCuda distribution). At the moment, the most annoying (me, at least) things are: 1. OpenCL performance tests show up to 6 times slower speed as compared to Cuda. Unfortunately, I still can't find the reason. (The interesting thing is that PyOpenCL is still noticeably faster than original Apple's C program with the same FFT algorithm). 2. I tried to support different ways of using plans, including precreated contexts, streams/queues and asynchronous execution. This resulted in quite messy interface. Any suggestions about making it more clear are welcome. 3. Currently, the only criterion for kernel's block sizes is maximum allowed by the number of used registers. Resulting occupancy in Cuda kernels is 0.25 - 0.33 most of the time. But when I try to recompile kernels with different block sizes in order to find maximum occupancy, this makes kernels even slower. Best regards, Bogdan ___ PyCUDA mailing list pyc...@host304.hostmonster.com http://host304.hostmonster.com/mailman/listinfo/pycuda_tiker.net