Re: [PyCUDA] Pycudafft becomes Pyfft

2010-03-24 Thread Bogdan Opanchuk
Hello Imran,

kernel.py requires patching too:
- from .kernel_helpers import *
+ from .kernel_helpers import log2, getRadixArray, getGlobalRadixInfo,
getPadding, getSharedMemorySize

I hope this will be enough. Sorry for the inconvenience, I'm going to
commit it in the repository. I need to add some version check too,
because there will definitely be other bugs on Python 2.4, which is
still used by some Linux distros )

Best regards,
Bogdan

On Thu, Mar 25, 2010 at 11:36 AM, Bogdan Opanchuk manti...@gmail.com wrote:
 Hello Imran,

 I tested it only on 2.6, so it can be the case. Thanks for the bug
 report though, this sort of compatibility is easy to add. Can you
 please just put from .kernel import GlobalFFTKernel, LocalFFTKernel,
 X_DIRECTION, Y_DIRECTION, Z_DIRECTION instead of this line?

 Best regards,
 Bogdan

 On Thu, Mar 25, 2010 at 11:19 AM, Imran Haque iha...@stanford.edu wrote:
 Didn't work - does it require newer than Python 2.5?

 $ python test_performance.py
 Running performance tests...
 Traceback (most recent call last):
  File test_performance.py, line 57, in module
   run(isCudaAvailable(), isCLAvailable(), DEFAULT_BUFFER_SIZE)
  File test_performance.py, line 52, in run
   testPerformance(ctx, shape, buffer_size)
  File test_performance.py, line 22, in testPerformance
   plan = ctx.getPlan(shape, context=ctx.context, wait_for_finish=True)
  File /home/ihaque/pyfft-0.3/pyfft_test/helpers.py, line 116, in getPlan
   import pyfft.cl
  File /usr/lib/python2.5/site-packages/pyfft-0.3-py2.5.egg/pyfft/cl.py,
 line 9, in module
   from .plan import FFTPlan
  File /usr/lib/python2.5/site-packages/pyfft-0.3-py2.5.egg/pyfft/plan.py,
 line 3
   from .kernel import *
 SyntaxError: 'import *' not allowed with 'from .'


 Bogdan Opanchuk wrote:

 Hello Imran,

 (sorry, forgot to add maillist to CC)

 Thank you for prompt reply, results from 5870 are interesting too. If
 you have pyopencl installed, just run test_performance.py from
 pyfft_test folder, located in pyfft package. It will print the results
 in stdout.

 Best regards,
 Bogdan.

 On Thu, Mar 25, 2010 at 11:11 AM, Imran Haque iha...@stanford.edu wrote:


 Hi Bogdan,

 I have access to a Radeon 5870, but it's installed in a slow host machine
 (2.8GHz dual core Pentium 4). If this is still useful, I could run a test
 for you if you can send along a quick test case.

 Cheers,

 Imran

 Bogdan Opanchuk wrote:


 By the way, if it is not too much to ask: if anybody has access to ATI
 59** series card and/or GTX 295 - could you please run performance
 tests from the module (pyfft_test/test_performance.py) and post the
 results here? I suspect that the poor performance in case of OpenCL
 can be (partially) caused by nVidia drivers.

 Thank you in advance.

 On Sat, Mar 20, 2010 at 10:36 PM, Bogdan Opanchuk manti...@gmail.com
 wrote:



 Hello all,

 I fixed some bugs in my pycudafft module and added PyOpenCL support,
 so it is called just pyfft now (and it sort of resolves the question
 about including it to PyCuda distribution).

 At the moment, the most annoying (me, at least)  things are:
 1. OpenCL performance tests show up to 6 times slower speed as
 compared to Cuda. Unfortunately, I still can't find the reason.
 (The interesting thing is that PyOpenCL is still noticeably faster
 than original Apple's C program with the same FFT algorithm).
 2. I tried to support different ways of using plans, including
 precreated contexts, streams/queues and asynchronous execution. This
 resulted in quite messy interface. Any suggestions about making it
 more clear are welcome.
 3. Currently, the only criterion for kernel's block sizes is maximum
 allowed by the number of used registers. Resulting occupancy in Cuda
 kernels is 0.25 - 0.33 most of the time. But when I try to recompile
 kernels with different block sizes in order to find maximum occupancy,
 this makes kernels even slower.

 Best regards,
 Bogdan




 ___
 PyCUDA mailing list
 pyc...@host304.hostmonster.com
 http://host304.hostmonster.com/mailman/listinfo/pycuda_tiker.net





___
PyCUDA mailing list
pyc...@host304.hostmonster.com
http://host304.hostmonster.com/mailman/listinfo/pycuda_tiker.net


Re: [PyCUDA] Pycudafft becomes Pyfft

2010-03-24 Thread Bogdan Opanchuk
Hi Imran,

Thank you for the info, I'll fix the code - python 2.5 is still widely
used. As for the ATI drivers, I thought the latest release version of
Stream (2.01) supports OpenCL. I wonder if the terrible performance
(these tests run faster on my GF9600) and this deadlock issue are
really caused by the drivers you use... I was actually going to order
server with ATI GPU for my simulations (because of their advertised
Gflops numbers for both single and double precision), but I am
starting to reconsider this decision now.

Best regards,
Bogdan

On Thu, Mar 25, 2010 at 12:13 PM, Imran Haque iha...@stanford.edu wrote:
 Hi Bogdan,

 I also had to do the following to get the test to run:

   - kernel.py:45: change except AssertionError as e: to except
 AssertionError:
   - plan.py:4: add getRadixArray to import list from .kernel_helpers

 I was able to get the following pair of results, but then the test hung. The
 machine has prerelease ATI drivers installed, so that might be the issue.
 However, I've also encountered cases in my own work with code that is
 formally incorrect (e.g., barriers that are not uniformly executed) on which
 the Nvidia runtime does not deadlock but the ATI runtime does, so it might
 be worth checking to see if you have any situations like that.

 $ python test_performance.py
 Running performance tests...
 * cl, (16,), batch 131072: 1.85770988464 ms, 22.5778203296 GFLOPS
 * cl, (1024,), batch 2048: 13.0976915359 ms, 8.00580771903 GFLOPS

 Cheers,

 Imran

 Bogdan Opanchuk wrote:

 Hello Imran,

 kernel.py requires patching too:
 - from .kernel_helpers import *
 + from .kernel_helpers import log2, getRadixArray, getGlobalRadixInfo,
 getPadding, getSharedMemorySize

 I hope this will be enough. Sorry for the inconvenience, I'm going to
 commit it in the repository. I need to add some version check too,
 because there will definitely be other bugs on Python 2.4, which is
 still used by some Linux distros )

 Best regards,
 Bogdan

 On Thu, Mar 25, 2010 at 11:36 AM, Bogdan Opanchuk manti...@gmail.com
 wrote:


 Hello Imran,

 I tested it only on 2.6, so it can be the case. Thanks for the bug
 report though, this sort of compatibility is easy to add. Can you
 please just put from .kernel import GlobalFFTKernel, LocalFFTKernel,
 X_DIRECTION, Y_DIRECTION, Z_DIRECTION instead of this line?

 Best regards,
 Bogdan

 On Thu, Mar 25, 2010 at 11:19 AM, Imran Haque iha...@stanford.edu
 wrote:


 Didn't work - does it require newer than Python 2.5?

 $ python test_performance.py
 Running performance tests...
 Traceback (most recent call last):
  File test_performance.py, line 57, in module
  run(isCudaAvailable(), isCLAvailable(), DEFAULT_BUFFER_SIZE)
  File test_performance.py, line 52, in run
  testPerformance(ctx, shape, buffer_size)
  File test_performance.py, line 22, in testPerformance
  plan = ctx.getPlan(shape, context=ctx.context, wait_for_finish=True)
  File /home/ihaque/pyfft-0.3/pyfft_test/helpers.py, line 116, in
 getPlan
  import pyfft.cl
  File
 /usr/lib/python2.5/site-packages/pyfft-0.3-py2.5.egg/pyfft/cl.py,
 line 9, in module
  from .plan import FFTPlan
  File
 /usr/lib/python2.5/site-packages/pyfft-0.3-py2.5.egg/pyfft/plan.py,
 line 3
  from .kernel import *
 SyntaxError: 'import *' not allowed with 'from .'


 Bogdan Opanchuk wrote:


 Hello Imran,

 (sorry, forgot to add maillist to CC)

 Thank you for prompt reply, results from 5870 are interesting too. If
 you have pyopencl installed, just run test_performance.py from
 pyfft_test folder, located in pyfft package. It will print the results
 in stdout.

 Best regards,
 Bogdan.

 On Thu, Mar 25, 2010 at 11:11 AM, Imran Haque iha...@stanford.edu
 wrote:



 Hi Bogdan,

 I have access to a Radeon 5870, but it's installed in a slow host
 machine
 (2.8GHz dual core Pentium 4). If this is still useful, I could run a
 test
 for you if you can send along a quick test case.

 Cheers,

 Imran

 Bogdan Opanchuk wrote:



 By the way, if it is not too much to ask: if anybody has access to
 ATI
 59** series card and/or GTX 295 - could you please run performance
 tests from the module (pyfft_test/test_performance.py) and post the
 results here? I suspect that the poor performance in case of OpenCL
 can be (partially) caused by nVidia drivers.

 Thank you in advance.

 On Sat, Mar 20, 2010 at 10:36 PM, Bogdan Opanchuk
 manti...@gmail.com
 wrote:




 Hello all,

 I fixed some bugs in my pycudafft module and added PyOpenCL support,
 so it is called just pyfft now (and it sort of resolves the question
 about including it to PyCuda distribution).

 At the moment, the most annoying (me, at least)  things are:
 1. OpenCL performance tests show up to 6 times slower speed as
 compared to Cuda. Unfortunately, I still can't find the reason.
 (The interesting thing is that PyOpenCL is still noticeably faster
 than original Apple's C program with the same FFT algorithm).
 2. I tried to support different ways of using plans, including
 precreated 

[PyCUDA] Pycudafft becomes Pyfft

2010-03-20 Thread Bogdan Opanchuk
Hello all,

I fixed some bugs in my pycudafft module and added PyOpenCL support,
so it is called just pyfft now (and it sort of resolves the question
about including it to PyCuda distribution).

At the moment, the most annoying (me, at least)  things are:
1. OpenCL performance tests show up to 6 times slower speed as
compared to Cuda. Unfortunately, I still can't find the reason.
(The interesting thing is that PyOpenCL is still noticeably faster
than original Apple's C program with the same FFT algorithm).
2. I tried to support different ways of using plans, including
precreated contexts, streams/queues and asynchronous execution. This
resulted in quite messy interface. Any suggestions about making it
more clear are welcome.
3. Currently, the only criterion for kernel's block sizes is maximum
allowed by the number of used registers. Resulting occupancy in Cuda
kernels is 0.25 - 0.33 most of the time. But when I try to recompile
kernels with different block sizes in order to find maximum occupancy,
this makes kernels even slower.

Best regards,
Bogdan

___
PyCUDA mailing list
pyc...@host304.hostmonster.com
http://host304.hostmonster.com/mailman/listinfo/pycuda_tiker.net