Hello,
I noticed some strange thing recently. Consider the following kernel:
__global__ void test(float *out)
{
float a[2] = {0,0};
a[0] = 1;
a[1] = 2;
out[0] = a[0];
out[1] = a[1];
}
As far as I understand, a[2] should go into registers. According to
PTX
Hello,
Yet another stupid question. Most probably, I missed something
obvious, but anyway - can someone explain why I get some NaN's in
output for the program (listed below)? Surprisingly, bug disappears if
I send '1' instead of '-1' as a third parameter to function (or remove
'int' parameters
Hello,
The project I am working on relies heavily on batched 3D FFTs. You all
know about the situation with CUFFT and PyCuda, and I decided that I
must put some effort in it. So, I ported Apple's OpenCL implementation
of FFT to PyCuda. The result you can see on
Hi Daniel,
(sort of awkward situation, I do not know which one should I use as
your first name)
Thank you for telling me about Parret, I did not know that your CUFFT
wrapper code can be found outside these mail list. Nevertheless, I'll
stick to the version I'm currently using (and remove it from
2010, Bogdan Opanchuk wrote:
Hello,
Yet another stupid question. Most probably, I missed something
obvious, but anyway - can someone explain why I get some NaN's in
output for the program (listed below)? Surprisingly, bug disappears if
I send '1' instead of '-1' as a third parameter to function
have no other complaints about pycuda. It just works!
Best regards,
Bogdan
On Tue, Mar 2, 2010 at 6:51 AM, Andreas Klöckner
li...@informa.tiker.net wrote:
Hi Bogdan,
On Sonntag 14 Februar 2010, Bogdan Opanchuk wrote:
The project I am working on relies heavily on batched 3D FFTs. You all
know
Hello all,
I fixed some bugs in my pycudafft module and added PyOpenCL support,
so it is called just pyfft now (and it sort of resolves the question
about including it to PyCuda distribution).
At the moment, the most annoying (me, at least) things are:
1. OpenCL performance tests show up to 6
some version check too,
because there will definitely be other bugs on Python 2.4, which is
still used by some Linux distros )
Best regards,
Bogdan
On Thu, Mar 25, 2010 at 11:36 AM, Bogdan Opanchuk manti...@gmail.com wrote:
Hello Imran,
I tested it only on 2.6, so it can be the case. Thanks
,
Imran
Bogdan Opanchuk wrote:
Hello Imran,
kernel.py requires patching too:
- from .kernel_helpers import *
+ from .kernel_helpers import log2, getRadixArray, getGlobalRadixInfo,
getPadding, getSharedMemorySize
I hope this will be enough. Sorry for the inconvenience, I'm going to
commit
Hi Gerald,
I can watch the memory pointers of the gpuarrays increase until I get a
launch error... presumably due to lack of memory.
Are you sure that failure is caused by the lack of memory? I think,
this would rather result in an error during memory allocation, not
during kernel execution.
Hi Michael,
The error message is sort of self-explanatory. You need to make 'nvcc'
(cuda compiler) available to installer. There are two ways to do it:
either add path to it (usually /usr/local/cuda/bin) to your $PATH
variable (by modifying bash profile, for example), or pass the path to
CUDA
Hi all,
I'm observing the following behavior with latest (git-fetched today)
pycuda and opencl versions on Snow Leopard 10.6.4:
$ python
import pycuda.driver
import pyopencl
Traceback (most recent call last):
File stdin, line 1, in module
File
On Fri, Sep 10, 2010 at 12:18 AM, Andreas Kloeckner
li...@informa.tiker.net wrote:
Are you using the shipped version of Boost in both libraries? If so,
that might present an issue.
Yep, in both. Does it behave in the same way on your system?
Best regards,
Bogdan
Hi Andreas,
On Fri, Sep 10, 2010 at 1:09 AM, Andreas Kloeckner
li...@informa.tiker.net wrote:
Nope, it seems fine on my machine. I guess that means if you'd like to
work with both PyCUDA and PyOpenCL at the same time, you have to build
with external (non-shipped) Boost.
You were right, I
Hi Javier,
It would probably help if you attach the source of the expon_them()
function (since something is definitely happening there).
I'll try to do some psychic debugging though. I find these lines suspicious:
self.weights_lateral = gpuarray.to_gpu(self.weight_matrixLateral())
Kloeckner
li...@informa.tiker.net wrote:
On Sun, 3 Oct 2010 01:44:35 +1000, Bogdan Opanchuk manti...@gmail.com wrote:
Hello all,
I am getting LogicError from gpuarray.to_gpu_async() for some reason. Code:
Any host memory involved in *_async() must be page-locked. FTFY:
import pycuda.autoinit
Hi all,
Consider the following program, which is supposed to check the
randomness of pycuda random number generator:
import pycuda.autoinit
import pycuda.curandom as curandom
import numpy
def test(size, dtype):
a = curandom.rand((size,), dtype=dtype).get()
return numpy.sum(a) /
Hi Vincent,
On Tue, Oct 19, 2010 at 2:28 AM, Vincent Favre-Nicolin
vincent.favre-nico...@cea.fr wrote:
I'm not sure what is happening exactly, but there is no indication that the
random numbers are *repeating themselves*
However you seem to hit a floating point issue *when computing the
Hi Saigopal,
What pyfft version do you use? Can you please post the full testing
code which can be executed to reproduce the bug? Because the code I
composed (basically, added comparison with CPU to your code) works
normally on my desktop with Tesla C2050 (Ubuntu 10.04 x64, Cuda 3.2,
PyCuda
Hi Saigopal,
On Tue, Jan 18, 2011 at 5:14 PM, Saigopal Nelaturi saigo...@gmail.com wrote:
Thanks for the quick response. My operating specs are exactly the same as
yours, and when I run your test I get an error of ~3e-7. But I think that
number may have to do with dividing by the norm of the
Hi Saigopal,
Try adding fast_math=False option when creating plans. It will give
small precision increase (for the cost of performance, of course),
which may be enough for your purposes. This option works only for
single precision.
Best regards,
Bogdan
Hello,
There is some problem with current PyCuda version (most recent commit
from repo). On my Ubuntu 10.04 x64, Python 2.6, Cuda 4.0 after
'submodule update', compilation and installation, _curand cannot be
imported:
import pycuda._curand
Traceback (most recent call last):
File stdin, line
Hello Andreas,
On Sun, Jun 5, 2011 at 5:43 PM, Andreas Kloeckner
li...@informa.tiker.net wrote:
If worst comes to worst, we'll just shove the _curand wrappers back into
the main PyCUDA wrapper binary.
I've done just that, for lack of better ideas.
Scott, Bogdan--can you check whether this
Hello Irwin,
On Mon, Jun 6, 2011 at 2:16 PM, Irwin Zaid irwin.z...@physics.ox.ac.uk wrote:
Anyway, I was wondering if there is a better way to provide this
functionality? In normal CUDA code, this could be done with templates, but
that doesn't seem to be an option here. I know metaprogramming
Hello,
I created the pull request (https://github.com/inducer/pycuda/pull/5)
which fixes this issue for me. People with macs, could you please
check it on your systems?
Best regards,
Bogdan
On Sun, Jun 5, 2011 at 10:16 PM, Bogdan Opanchuk manti...@gmail.com wrote:
Hello Andreas,
On Sun, Jun
Hello,
How about this (very drafty draft, just to illustrate an idea)?
2011/6/7 Andreas Kloeckner li...@informa.tiker.net:
On Tue, 7 Jun 2011 14:16:31 -0400, Frédéric Bastien no...@nouiz.org wrote:
Hi,
I'm preparing a Tutorial about Theano and PyCUDA. Is there any PyCUDA
logo that I can put
And, speaking of parallel snakes, was it something like this (attached)?
2011/6/7 Bogdan Opanchuk manti...@gmail.com:
Hello,
How about this (very drafty draft, just to illustrate an idea)?
2011/6/7 Andreas Kloeckner li...@informa.tiker.net:
On Tue, 7 Jun 2011 14:16:31 -0400, Frédéric
Hello,
I shamelessly stole David's hollowness idea and Andreas' parallel
snakes design and made snakes look more like ones from Python logo -
see variant1.pdf. In addition, there's variant2.pdf inspired by the
Little Prince. These are drafts of course, neither shapes nor colors
are not final.
On
Hello,
I finally have the time to contribute something to compyte, so I had a
look at its sources. As far as I understand, at the moment it has:
- sources for GPU platform-dependent memory operations (malloc()/free()/...)
- sources for array class, which uses abstract API of these operations
-
Hello Andreas, Frederic,
2011/6/21 Andreas Kloeckner li...@informa.tiker.net:
On Mon, 20 Jun 2011 09:40:02 -0400, Frédéric Bastien no...@nouiz.org wrote:
Currently there is not a good compilation system for this project as
you saw. What I currently have in mind is that it should
Hello Andreas,
Is there some way to change the shape of GPUArray object same as it
can be done with numpy.ndarray? The following naive code raises
exception on the last line:
import pycuda.autoinit
import pycuda.gpuarray as gpuarray
import numpy
arr = gpuarray.GPUArray((64, 64), numpy.float64)
Hello,
I just bumped into a certain problem with copying numpy arrays to GPU.
Consider the following code:
---
import pycuda.autoinit
import pycuda.gpuarray as gpuarray
from pycuda.elementwise import ElementwiseKernel
import numpy
arr = numpy.random.randn(50, 50).astype(numpy.float32)
arr_tr =
Hello Andreas,
On Wed, Jul 6, 2011 at 12:04 AM, Andreas Kloeckner
li...@informa.tiker.net wrote:
Ok, we should introduce a warning when to_gpu'ing arrays that are not in C
order. And probably also add a function gpuarray.i_know_about_strides() to
turn that warning off.
Yep, that'll work too.
Hello Mikhail,
This program worked without any changes on Tesla C2050. Such error
message usually points to insufficient number of registers on the
device, so try to choose block size = MAX_REGISTERS_PER_BLOCK (device
attribute) / func.num_regs
Best regards,
Bogdan
On Fri, Jul 8, 2011 at 1:46
Hello Andreas,
Currently CURAND wrapper cannot fill_normal() or fill_uniform() the
array of complex64 or complex128. I can add this functionality, but
first I'd like to clarify some details:
1. Should I add this to XORWOW RNG only? In CURAND *2 functions were
not implemented for Sobol
Hello Tomasz,
Against which commit have you diffed your patch? I was going to run it
on Tesla 2050 (test_gpuarray.py is enough, right?) but I am having
problems trying to apply it.
Best regards,
Bogdan
On Sat, Aug 13, 2011 at 8:41 PM, Tomasz Rybak bogom...@post.pl wrote:
Hello.
I have been
Hello Алексей,
As far as I can see, there are two things you may try.
1. ElementwiseKernel.__call__ calculates necessary grid and block
sizes every time, along with doing some other stuff, which can be
significant if the kernel execution time is of the order of tens of
microseconds. So you can
Hello,
In your example the condition is necessary: if N is some large prime
number, you cannot create grid/block pair which contains exactly N
total threads; so you have to skip excessive ones somehow. Moreover,
the if statement is not expensive by itself, it becomes expensive if
it causes
Hello Apostolis,
There are two errors:
1. You are trying to use 32x32 block, but this size is only supported
by compute compatibility 2.0 devices (Teslas and probably other new
cards, look it up in the programming guide). Older cards (such as
mine) only allow maximum 512 threads per block, so I
On Wed, Apr 4, 2012 at 9:15 PM, Michiel Bruinink
michiel.bruin...@mapperlithography.com wrote:
First of all, I made a typo in my sample program. The value of 10 should
be 169. That makes those array declarations less problematic, I think.
Much less. This now amounts to ~6kb per thread,
Hi Andrea,
On Tue, Jul 10, 2012 at 11:55 PM, Andrea Cesari
andrea_ces...@hotmail.it wrote:
But if i modify the kernel in this mode:
const int i = threadIdx.x+2
dest[i]=i;
the result is: [1 0 2 3 4 5 6 7 8 9]
while, in my opinion,should be [0,0,2,3,4,5,6,7,8,9] (confirmed by C code).
why?
On Wed, Jul 11, 2012 at 12:15 AM, Andrea Cesari
andrea_ces...@hotmail.it wrote:
so, the firs two elements of a vector are always garbage?
can i solve it allocating manually the memory? but should be the same of
drv.Out() i think..or no?
The first two elements are garbage because:
1) you have
Hi Andrea,
On Wed, Jul 11, 2012 at 10:25 PM, Andrea Cesari
andrea_ces...@hotmail.it wrote:
__global__ void gpu_kernel(int *corrGpu,int *aMod,int *b,int *kernelSize_h)
{
int j,step1=kernelSize_h[0]/2; // ---
...
)
When I remove /2 where the arrow points, I get results identical with
the
Hi all,
Some of you may remember compyte discussions last year when I made the
suggestion of creating a library with a compilation of GPGPU
algorithms, working both with PyOpenCL and PyCuda. Long story short, I
have finally found some time and created a prototype. The preliminary
tutorial can be
Hi Frédéric,
On Thu, Jul 19, 2012 at 8:58 AM, Frédéric Bastien no...@nouiz.org wrote:
How much useful it is to abstract between PyCUDA
and PyOpenCL? Personnaly, I probably won't use that part, but I want
to abstract between CUDA and OpenCL.
It was either that or to write almost identical
Hi Andrea,
On Thu, Jul 19, 2012 at 2:39 AM, Andrea Cesari andrea_ces...@hotmail.it wrote:
Hi, this is my code that, keep a 3d array, and for each pixel of the matrix
find the minimum and put it to the corresponding pixel of a matrix b. Then
compare the result with the cpu.
Obviously, with
Hi Andrea,
On Thu, Jul 19, 2012 at 4:26 PM, Andrea Cesari andrea_ces...@hotmail.it wrote:
The problem is that the results match with cpu only for dim_x and dim_y
minor of 32.
For higher dimensions the cpu and gpu results are different.
When you change dim_x and dim_y values, do you also
Hi Andrea,
On Thu, Jul 19, 2012 at 4:37 PM, Andrea Cesari andrea_ces...@hotmail.it wrote:
yes..for example if i do:
dim_x=33
dim_y=33
then chenge grid and block to this: (32,32,1) and (2,1)
because i do ( 33*33=1089 threads, so grid= 1089/1024=1,063-- 2)
When you do this, you read values
Hi Cédric
On Fri, Aug 17, 2012 at 5:15 PM, Cédric LACZNY cedric.lac...@uni.lu wrote:
Thanks for the suggestion but it's causing other errors all of the same
notion, e.g. the following:
kernel.cu(142): error: calling a host function(NVMatrix::eltWiseDivide)
from a __device__/__global__
Hi Cédric,
On Fri, Aug 17, 2012 at 4:49 PM, Cédric LACZNY cedric.lac...@uni.lu wrote:
extern C
{
void main_kernel(float* inp_P, unsigned int N, float* mappedX, unsigned int
no_dims) {
// … Some code …
}
}
You have to prefix your exported kernel definition with '__global__'.
See the code
Hi Cédric,
On Fri, Aug 17, 2012 at 6:04 PM, Cédric LACZNY cedric.lac...@uni.lu wrote:
Executing the python script now, gives me the following error:
pycuda.driver.CompileError: nvcc compilation of /tmp/tmpe1ZS7Z/kernel.cu
failed
[command: nvcc --cubin -arch sm_20
Hi Eelco,
On Sun, Aug 26, 2012 at 3:00 AM, Eelco Hoogendoorn e.hoogendo...@uva.nl wrote:
I have some code that I would like to contribute to pycuda. What would the
preferred way of doing so be? Create a branch in git?
Yep. Perhaps the easiest way to do it is by forking PyCuda repo on
github
Hi Mohsen,
On Wed, Sep 5, 2012 at 9:53 PM, mohsen jadidi mohsen.jad...@gmail.com wrote:
pycuda._driver.LaunchError: cuMemcpyDtoH failed: launch failed
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: launch failed
what would be the reason ?
It means that
Hi Mohsen,
On Thu, Sep 6, 2012 at 3:31 AM, mohsen jadidi mohsen.jad...@gmail.com wrote:
File
/usr/local/lib/python2.7/dist-packages/scikits.cuda-0.042-py2.7.egg/scikits/cuda/linalg.py,
line 323, in dot
c_gpu = gpuarray.empty((n, ldc), x_gpu.dtype)
File
Hi Rui,
On Mon, Nov 5, 2012 at 8:51 AM, Rui Lopes rmlo...@dei.uc.pt wrote:
I've written a kernel to perform a custom dot operation that would work
perfectly if there was not an issue with the memory allocation. Maybe I am
missing something in the mapping process?
From what I understood
Hi Rui,
On Mon, Nov 5, 2012 at 2:36 PM, Rui Lopes rmlo...@dei.uc.pt wrote:
I have built a benchmark for my custom dot kernel, pasted below. It only
outperforms cpu dot for big sizes, expectable in my educated guess.
Yes, it is to be expected for your kernel, especially on slow video
cards.
Hi Alex,
Maybe I am misunderstanding, I am not so familiar with the buffer terminology
(having not dealt much with opencl),
It is not really OpenCL-specific; basically it's just a wrapper on top
of pointer arithmetic.
Would the following be sufficient?
a =
Moreover, I do not need the actual data to be copied. I just need a
view to the middle of an existing array.
On Sun, Feb 10, 2013 at 10:53 AM, Bogdan Opanchuk manti...@gmail.com wrote:
Hi Alex,
Maybe I am misunderstanding, I am not so familiar with the buffer
terminology (having not dealt
'pycuda.gpuarray.GPUArray'
[ 5. 5.]
[-4. -4.]
[ 5. 5. 0. 0. 0. 0. -4. -4.]
On Sat, Feb 9, 2013 at 7:01 PM, Bogdan Opanchuk manti...@gmail.com wrote:
Moreover, I do not need the actual data to be copied. I just need a
view to the middle of an existing array.
On Sun, Feb 10, 2013 at 10:53 AM, Bogdan
Hi Giuseppe,
It seems that the problem is in these lines:
w, h = src.shape
result = gpuarray.empty((h, w), dtype=src.dtype, order='C')
The order of numpy arrays is row-major, so you should write instead:
h, w = src.shape
result = gpuarray.empty((w, h), dtype=src.dtype,
Hi David,
What libraries do you have in cuda_installation_dir/lib?
(cuda_installation_dir is /usr/local/cuda by default). I have both
libcuda.dylib and libcudart.dylib there.
On Mon, Aug 19, 2013 at 1:39 PM, David P. Sanders
dpsand...@ciencias.unam.mx wrote:
Hi,
I am trying to install PyCUDA
Hi Isaac,
You can try my package Reikna (http://reikna.publicfields.net). The
FFT there is somewhat slower than the CUFFT one, but it works with
Python 3.
On Thu, Oct 31, 2013 at 11:54 PM, Isaac Gerg isaac.g...@gergltd.com wrote:
They have no support for python 3.2 64 bit :(
On Oct 31, 2013
Hi Ahmed,
On Fri, Dec 6, 2013 at 12:27 PM, Ahmed Fasih wuzzyv...@gmail.com wrote:
I ran into a similar issue:
http://stackoverflow.com/questions/13187443/nvidia-cufft-limit-on-sizes-and-batches-for-fft-with-scikits-cuda
Batch 1 of 64x1024 complex64 arrays amounts to 5Gb of data, which
Hi Jayanth,
I can run a 8192x8192 transform on a Tesla C2050 without problems. I
think you are limited by the available video memory, see my previous
message in this thread --- a 8192x4096 buffer takes 250Mb, and you
have to factor in the temporary buffers PyFFT creates.
By the way, I would
Hi oyster,
I have fixed two things in order to make your program runnable:
- replaced 'numPoint.x' and 'numPoint.y' with 'numPointX' and 'numPointY',
- added 'startTime = time.time()' line before the kernel call
There are the following problems with the code:
- The shape of 'iter' is incorrect:
Hi 金陆,
The \n\n in your code correspond to two actual newlines in the .cu
file being compiled, not to a \n\n string, because they are resolved
by the Python interpreter at the parsing stage. See the kernel.cu you
quoted for the result, you have 'CUPRINTF(' commented and an
unmatched quote and
Hello,
Does PyCUDA support struct arguments to kernels? From the Python side
it means an element of an array with a struct dtype (a numpy.void
object), e.g.
dtype = numpy.dtype([('first', numpy.int32), ('second', numpy.int32)])
pair = numpy.empty(1, dtype)[0]
See
format
On Tue, May 27, 2014 at 2:45 PM, Andreas Kloeckner
li...@informa.tiker.net wrote:
Hi Bogdan,
Bogdan Opanchuk manti...@gmail.com writes:
Thank you for the correction. Just curious, how come in PyOpenCL it
works with rank-0 numpy arrays (which, in my opinion, is more
intuitive than
at 9:39 PM, Bogdan Opanchuk manti...@gmail.com
wrote:
Hi Bruce,
Seems to be a typo in the Wiki. If you look at
http://wiki.tiker.net/PyCuda/Examples/MatrixTranspose (where
MatrixTranspose.py originally comes from), you can see in line 24 two
#defines in one line. Incidentally, if someone has
Hi Thomas,
Does PyCUDA have any support for non-contiguous arrays at all?
> (I've tried implementing my own version, but was unable to figure out how
to map the thread IDs to valid memory addresses in a general way. Any
pointers/)
I have support for custom strides in my Reikna library, and it
ry shouldn't leave
> the GPU).
>
> On 31. Jul 2018, at 09:16, Bogdan Opanchuk wrote:
>
> First of all, are you using multiple contexts or a single one? If you only
> have one context, `Thread(pycuda.autoinit.context)` should be enough for
> Reikna (don't know about scikit-cuda, tho
First of all, are you using multiple contexts or a single one? If you only
have one context, `Thread(pycuda.autoinit.context)` should be enough for
Reikna (don't know about scikit-cuda, though).
Now if you have several contexts, things become more complicated. CUDA
maintains a global context
72 matches
Mail list logo