Re: [Numpy-discussion] SWIGed function does not like my boolean array
On Nov 7, 2007 6:46 PM, Timothy Hochberg [EMAIL PROTECTED] wrote: On Nov 7, 2007 10:35 AM, Sebastian Haase [EMAIL PROTECTED] wrote: On Nov 7, 2007 5:23 PM, Matthieu Brucher [EMAIL PROTECTED] wrote: I don't understand. I'm thinking of most math functions in the C-library. In C a boolean is just an integer of 0 or 1 (quasi, by definition). Could you explain what you mean ? In C++, bool is a new type that has two values, true and false. If you add true and true, it is still true, and not 2. In C, everything that is not 0 is true, not in C++. Yes, I know this. But my situation is the other way around. Lets say I want to count foreground pixels in an image: I would want to sum all the true values, i.e. a true *is* a 1 and a false *is* a 0. In other words, I'm really thinking of (older kind of) C, where there *was* no bool. I assume this thinking still applies to the internal arithmetic of CPUs today. Also the bit-values of a boolean array (in memory) are set this way already anyway ! How can I simply call my functions looking at these bit values ? (essentially interpreting a boolean true as 1 and false as 0) I'm not sure how well this would work, but could you change the dtype before passing the array to your function? If you wanted a copy, you could just to the equivalent of a.astype(unit8). However, if you didn't want a copy, you could set the dtype to unit8, operate on the array and then reset it to bool: a = np.array([True, True, False, True]) a array([ True, True, False, True], dtype=bool) a.dtype = np.uint8 a array([1, 1, 0, 1], dtype=uint8) # do something with 'a' here a.dtype = bool a array([ True, True, False, True], dtype=bool) This assumes everything is single threaded. If you have multiple threads accessing 'a', this could be a problem... And, you probably want to do this in C, so translate as appropriate. -tim Thanks Tim, this sound like a good idea. How about creating an a = a.view() before changing dtype. This should make the proposed solution thread safe again. How expensive is the creation of a view (performance wise, e.g. compared to calling a trivial C-function) ? Thanks, Sebastian ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Unnecessarily bad performance of elementwise operators with Fortran-arrays
Hi! I wonder why simple elementwise operations like a * 2 or a + 1 are not performed in order of increasing memory addresses in order to exploit CPU caches etc. - as it is now, their speed drops by a factor of around 3 simply by transpose()ing. Similarly (but even less logical), copy() and even the constructor are affected (yes, I understand that copy() creates contiguous arrays, but shouldn't it respect/retain the order nevertheless?): ### constructor ### In [89]: %timeit -r 10 -n 100 numpy.ndarray((3,3,3)) 100 loops, best of 10: 1.19 s per loop In [90]: %timeit -r 10 -n 100 numpy.ndarray((3,3,3), order=f) 100 loops, best of 10: 2.19 s per loop ### copy 3x3x3 array ### In [85]: a = numpy.ndarray((3,3,3)) In [86]: %timeit -r 10 a.copy() 100 loops, best of 10: 1.14 s per loop In [87]: a = numpy.ndarray((3,3,3), order=f) In [88]: %timeit -r 10 -n 100 a.copy() 100 loops, best of 10: 3.39 s per loop ### copy 256x256x256 array ### In [74]: a = numpy.ndarray((256,256,256)) In [75]: %timeit -r 10 a.copy() 10 loops, best of 10: 119 ms per loop In [76]: a = numpy.ndarray((256,256,256), order=f) In [77]: %timeit -r 10 a.copy() 10 loops, best of 10: 274 ms per loop ### fill ### In [79]: a = numpy.ndarray((256,256,256)) In [80]: %timeit -r 10 a.fill(0) 10 loops, best of 10: 60.2 ms per loop In [81]: a = numpy.ndarray((256,256,256), order=f) In [82]: %timeit -r 10 a.fill(0) 10 loops, best of 10: 60.2 ms per loop ### power ### In [151]: a = numpy.ndarray((256,256,256)) In [152]: %timeit -r 10 a ** 2 10 loops, best of 10: 124 ms per loop In [153]: a = numpy.asfortranarray(a) In [154]: %timeit -r 10 a ** 2 10 loops, best of 10: 458 ms per loop ### addition ### In [160]: a = numpy.ndarray((256,256,256)) In [161]: %timeit -r 10 a + 1 10 loops, best of 10: 139 ms per loop In [162]: a = numpy.asfortranarray(a) In [163]: %timeit -r 10 a + 1 10 loops, best of 10: 465 ms per loop ### fft ### In [146]: %timeit -r 10 numpy.fft.fft(vol, axis=0) 10 loops, best of 10: 1.16 s per loop In [148]: %timeit -r 10 numpy.fft.fft(vol0, axis=2) 10 loops, best of 10: 1.16 s per loop In [149]: vol.flags Out[149]: C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : True WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False In [150]: vol0.flags Out[150]: C_CONTIGUOUS : False F_CONTIGUOUS : True OWNDATA : False WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False In [9]: %timeit -r 10 numpy.fft.fft(vol0, axis=0) 10 loops, best of 10: 939 ms per loop ### mean ### In [173]: %timeit -r 10 vol.mean() 10 loops, best of 10: 272 ms per loop In [174]: %timeit -r 10 vol0.mean() 10 loops, best of 10: 683 ms per loop ### max ### In [175]: %timeit -r 10 vol.max() 10 loops, best of 10: 63.8 ms per loop In [176]: %timeit -r 10 vol0.max() 10 loops, best of 10: 475 ms per loop ### min ### In [177]: %timeit -r 10 vol.min() 10 loops, best of 10: 63.8 ms per loop In [178]: %timeit -r 10 vol0.min() 10 loops, best of 10: 476 ms per loop ### rot90 ### In [10]: %timeit -r 10 numpy.rot90(vol) 10 loops, best of 10: 6.97 s per loop In [12]: %timeit -r 10 numpy.rot90(vol0) 10 loops, best of 10: 6.92 s per loop -- Ciao, / / /--/ / / ANS ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Unnecessarily bad performance of elementwise operators with Fortran-arrays
Hans Meine wrote: Hi! I wonder why simple elementwise operations like a * 2 or a + 1 are not performed in order of increasing memory addresses in order to exploit CPU caches etc. - as it is now, their speed drops by a factor of around 3 simply by transpose()ing. Because it is not trivial to do so in all cases, I guess. It is a problem which comes back time to time on the ML, but AFAIK, nobody had a fix for it. Fundamentally, for many element-wise operations, either you have to implement the thing for every possible case, or you abstract it through an iterator, which gives you a decrease of performances in some cases. There are also cases where the current implementation is far from optimal, for lack of man power I guess (taking a look at PyArray_Mean, for example, shows that it uses PyArray_GenericReduceFunction, which is really slow compare to a straight C implementation). Similarly (but even less logical), copy() and even the constructor are affected (yes, I understand that copy() creates contiguous arrays, but shouldn't it respect/retain the order nevertheless?): I don't see why it is illogical: when you do a copy, you don't preserve memory layout, and so a simple memcpy of the whole buffer is not possible. cheers, David ### constructor ### In [89]: %timeit -r 10 -n 100 numpy.ndarray((3,3,3)) 100 loops, best of 10: 1.19 s per loop In [90]: %timeit -r 10 -n 100 numpy.ndarray((3,3,3), order=f) 100 loops, best of 10: 2.19 s per loop ### copy 3x3x3 array ### In [85]: a = numpy.ndarray((3,3,3)) In [86]: %timeit -r 10 a.copy() 100 loops, best of 10: 1.14 s per loop In [87]: a = numpy.ndarray((3,3,3), order=f) In [88]: %timeit -r 10 -n 100 a.copy() 100 loops, best of 10: 3.39 s per loop ### copy 256x256x256 array ### In [74]: a = numpy.ndarray((256,256,256)) In [75]: %timeit -r 10 a.copy() 10 loops, best of 10: 119 ms per loop In [76]: a = numpy.ndarray((256,256,256), order=f) In [77]: %timeit -r 10 a.copy() 10 loops, best of 10: 274 ms per loop ### fill ### In [79]: a = numpy.ndarray((256,256,256)) In [80]: %timeit -r 10 a.fill(0) 10 loops, best of 10: 60.2 ms per loop In [81]: a = numpy.ndarray((256,256,256), order=f) In [82]: %timeit -r 10 a.fill(0) 10 loops, best of 10: 60.2 ms per loop ### power ### In [151]: a = numpy.ndarray((256,256,256)) In [152]: %timeit -r 10 a ** 2 10 loops, best of 10: 124 ms per loop In [153]: a = numpy.asfortranarray(a) In [154]: %timeit -r 10 a ** 2 10 loops, best of 10: 458 ms per loop ### addition ### In [160]: a = numpy.ndarray((256,256,256)) In [161]: %timeit -r 10 a + 1 10 loops, best of 10: 139 ms per loop In [162]: a = numpy.asfortranarray(a) In [163]: %timeit -r 10 a + 1 10 loops, best of 10: 465 ms per loop ### fft ### In [146]: %timeit -r 10 numpy.fft.fft(vol, axis=0) 10 loops, best of 10: 1.16 s per loop In [148]: %timeit -r 10 numpy.fft.fft(vol0, axis=2) 10 loops, best of 10: 1.16 s per loop In [149]: vol.flags Out[149]: C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : True WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False In [150]: vol0.flags Out[150]: C_CONTIGUOUS : False F_CONTIGUOUS : True OWNDATA : False WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False In [9]: %timeit -r 10 numpy.fft.fft(vol0, axis=0) 10 loops, best of 10: 939 ms per loop ### mean ### In [173]: %timeit -r 10 vol.mean() 10 loops, best of 10: 272 ms per loop In [174]: %timeit -r 10 vol0.mean() 10 loops, best of 10: 683 ms per loop ### max ### In [175]: %timeit -r 10 vol.max() 10 loops, best of 10: 63.8 ms per loop In [176]: %timeit -r 10 vol0.max() 10 loops, best of 10: 475 ms per loop ### min ### In [177]: %timeit -r 10 vol.min() 10 loops, best of 10: 63.8 ms per loop In [178]: %timeit -r 10 vol0.min() 10 loops, best of 10: 476 ms per loop ### rot90 ### In [10]: %timeit -r 10 numpy.rot90(vol) 10 loops, best of 10: 6.97 s per loop In [12]: %timeit -r 10 numpy.rot90(vol0) 10 loops, best of 10: 6.92 s per loop ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Unnecessarily bad performance of elementwise operators with Fortran-arrays
Am Donnerstag, 08. November 2007 13:44:59 schrieb David Cournapeau: Hans Meine wrote: I wonder why simple elementwise operations like a * 2 or a + 1 are not performed in order of increasing memory addresses in order to exploit CPU caches etc. - as it is now, their speed drops by a factor of around 3 simply by transpose()ing. Because it is not trivial to do so in all cases, I guess. It is a problem which comes back time to time on the ML, but AFAIK, nobody had a fix for it. Fundamentally, for many element-wise operations, either you have to implement the thing for every possible case, or you abstract it through an iterator, which gives you a decrease of performances in some cases. OK, so far I have not looked at the NumPy internals, but I would expect sth. like the following: [elementwise operation on array 'a'] 1) ii = [i for i, s in sorted(enumerate(ia.strides), key = lambda (i, s): -s)] 2) aprime = a.transpose(ii) # aprime will be as C-contiguous as it gets 3) bprime = perform_operation(aprime) # fast elementwise operation 4) jj = [ii.index(i) for i in range(bprime.ndim)] # inverse ii 5) b = bprime.transpose(jj) # let result have the same order as the input Apart from the fact that this is more or less pseudo-code and that the last step brings a behaviour change (which is intended, but probably not very welcome), am I overlooking a problem? Similarly (but even less logical), copy() and even the constructor are affected (yes, I understand that copy() creates contiguous arrays, but shouldn't it respect/retain the order nevertheless?): I don't see why it is illogical: when you do a copy, you don't preserve memory layout, and so a simple memcpy of the whole buffer is not possible. As I wrote above, I don't think this is good. A fortran-order-contiguous array is still contiguous, and not inferior in any way to C-order arrays. So I actually expect copy() to return an array of unchanged order. The numpy book only says copy() - Return a copy of the array (which is always single-segment, and ALIGNED)., which would be fulfilled also if the memory order was preserved as by my proposed method from above. -- Ciao, / / /--/ / / ANS ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] SWIGed function does not like my boolean array
On Nov 8, 2007 3:28 AM, Sebastian Haase [EMAIL PROTECTED] wrote: On Nov 7, 2007 6:46 PM, Timothy Hochberg [EMAIL PROTECTED] wrote: On Nov 7, 2007 10:35 AM, Sebastian Haase [EMAIL PROTECTED] wrote: On Nov 7, 2007 5:23 PM, Matthieu Brucher [EMAIL PROTECTED] wrote: I don't understand. I'm thinking of most math functions in the C-library. In C a boolean is just an integer of 0 or 1 (quasi, by definition). Could you explain what you mean ? In C++, bool is a new type that has two values, true and false. If you add true and true, it is still true, and not 2. In C, everything that is not 0 is true, not in C++. Yes, I know this. But my situation is the other way around. Lets say I want to count foreground pixels in an image: I would want to sum all the true values, i.e. a true *is* a 1 and a false *is* a 0. In other words, I'm really thinking of (older kind of) C, where there *was* no bool. I assume this thinking still applies to the internal arithmetic of CPUs today. Also the bit-values of a boolean array (in memory) are set this way already anyway ! How can I simply call my functions looking at these bit values ? (essentially interpreting a boolean true as 1 and false as 0) I'm not sure how well this would work, but could you change the dtype before passing the array to your function? If you wanted a copy, you could just to the equivalent of a.astype(unit8). However, if you didn't want a copy, you could set the dtype to unit8, operate on the array and then reset it to bool: a = np.array([True, True, False, True]) a array([ True, True, False, True], dtype=bool) a.dtype = np.uint8 a array([1, 1, 0, 1], dtype=uint8) # do something with 'a' here a.dtype = bool a array([ True, True, False, True], dtype=bool) This assumes everything is single threaded. If you have multiple threads accessing 'a', this could be a problem... And, you probably want to do this in C, so translate as appropriate. -tim Thanks Tim, this sound like a good idea. How about creating an a = a.view() before changing dtype. Yeah. That's a better idea. You should be able to just use a.view(np.uint8 ). This should make the proposed solution thread safe again. How expensive is the creation of a view (performance wise, e.g. compared to calling a trivial C-function) ? It should be pretty cheap in theory, but I have no idea really. -- . __ . |-\ . . [EMAIL PROTECTED] ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Unnecessarily bad performance of elementwise operators with Fortran-arrays
Am Donnerstag, 08. November 2007 16:37:06 schrieb David Cournapeau: The problem is not F vs C storage: for element-wise operation, it does not matter at all; you just apply the same function (perform_operation) over and over on every element of the array. The order does not matter at all. Yet Fortran order leads to several times slower operations, see the figures in my original post. :-( But what if you have segmented buffers, non aligned, etc... arrays ? The code I posted should deal with it - by sorting the indices by decreasing stride, I simply ensure that all (source and target) segments are traversed in order of increasing memory addresses. It does not affect segments or alignment. All this has to be taken care of, Right - my perform_operation(aprime) step would need to apply the operation on every memory segment. and this means handling reference count and other things which are always delicate to handle well... I am not sure that I understand where refcounting comes into play here. Or you just use the current situation, which let python handle it (through PyObject_Call and a callable, at a C level). I need to look at the code to see what you mean here. Probably, I have a wrong picture of where the slowness comes from (I thought that the order of the loops was wrong). As I wrote above, I don't think this is good. A fortran-order-contiguous array is still contiguous, and not inferior in any way to C-order arrays. So I actually expect copy() to return an array of unchanged order. Maybe this behaviour was kept for compatibility with numarray ? If you look at the docstring, it is said that copy may not return the same order than its input. It kind of make sense to me: C order is the default of many numpy operations. That is very sad, because Fortran order is much more natural for handling images, where you're absolutely used to index with [x,y], and x being the faster-changing index in memory. -- Ciao, / / /--/ / / ANS ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Unnecessarily bad performance of elementwise operators with Fortran-arrays
On Nov 8, 2007 10:13 PM, Hans Meine [EMAIL PROTECTED] wrote: Am Donnerstag, 08. November 2007 13:44:59 schrieb David Cournapeau: Hans Meine wrote: I wonder why simple elementwise operations like a * 2 or a + 1 are not performed in order of increasing memory addresses in order to exploit CPU caches etc. - as it is now, their speed drops by a factor of around 3 simply by transpose()ing. Because it is not trivial to do so in all cases, I guess. It is a problem which comes back time to time on the ML, but AFAIK, nobody had a fix for it. Fundamentally, for many element-wise operations, either you have to implement the thing for every possible case, or you abstract it through an iterator, which gives you a decrease of performances in some cases. OK, so far I have not looked at the NumPy internals, but I would expect sth. like the following: [elementwise operation on array 'a'] 1) ii = [i for i, s in sorted(enumerate(ia.strides), key = lambda (i, s): -s)] 2) aprime = a.transpose(ii) # aprime will be as C-contiguous as it gets 3) bprime = perform_operation(aprime) # fast elementwise operation 4) jj = [ii.index(i) for i in range(bprime.ndim)] # inverse ii 5) b = bprime.transpose(jj) # let result have the same order as the input The problem is not F vs C storage: for element-wise operation, it does not matter at all; you just apply the same function (perform_operation) over and over on every element of the array. The order does not matter at all. But what if you have segmented buffers, non aligned, etc... arrays ? All this has to be taken care of, and this means handling reference count and other things which are always delicate to handle well... Or you just use the current situation, which let python handle it (through PyObject_Call and a callable, at a C level). I agree that in some cases, it would be useful to have better performances: in particular, having fast code paths for common cases (aligned, and single segmented) would be nice. One example is the PyArray_Clip function (in multiarray.c), whose speed was too slow for my taste in my use cases: the general idea is to get to the point where you have a C single buffer at which point you can call a trivial C function (fast_clip). As you can see, this is not trivial (~150 lines of C code): since the point is to go faster, you really want to avoid copying things, which means you have to be pretty careful. Speed-wise, it definitely worths it, though: I don't remember the exact numbers, but in some cases (which are very common), it was a several times speed. Apart from the fact that this is more or less pseudo-code and that the last step brings a behaviour change (which is intended, but probably not very welcome), am I overlooking a problem? Similarly (but even less logical), copy() and even the constructor are affected (yes, I understand that copy() creates contiguous arrays, but shouldn't it respect/retain the order nevertheless?): I don't see why it is illogical: when you do a copy, you don't preserve memory layout, and so a simple memcpy of the whole buffer is not possible. As I wrote above, I don't think this is good. A fortran-order-contiguous array is still contiguous, and not inferior in any way to C-order arrays. So I actually expect copy() to return an array of unchanged order. Maybe this behaviour was kept for compatibility with numarray ? If you look at the docstring, it is said that copy may not return the same order than its input. It kind of make sense to me: C order is the default of many numpy operations. David ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] saving and loading as PNG
Hi all, i am working with numpy and mostly 16 bit matrices, and so I was looking for a standard way of saving them. I found that PNG actually supports that, but I could not find a way to use this feature from within python. I thought that it actually is a cool way of storing matrices and thus I wrote a little pyrex program that does the job. I figured this might be interesting to a broader audience, so I posted it here, maybe for inclusion in numpy? Some more thoughts: * Other implementations: There is other people who have done such a thing. the PIL knows how to read and write PNG, but only 8 bit. The same holds for matplotlib. * Problems: until now, numpy does not link agains libpng. This should not be a problem on linux-like systems, as those normally all have libpng installed. Windows however, I don't know. Same thing with pyrex. Numpy doesn't use pyrex until now. * other places to put: numpy could be considered to basic to have such a functionality, one could put it in scipy. Well, but png is not really a _scientific_ application (even though I personally use it to do science). Matplotlib would be another nice place to put this feature, big advantage is that matplotlib already links against libpng. So, now I shared my thoughts with you, comments? Greetings, Martin And here the code (188 lines are hopefully ok for a mailinglist?) cimport c_numpy import numpy cdef extern from stdlib.h: cdef void *malloc(int) cdef void free(void *) cdef extern from stdio.h: cdef struct file: pass ctypedef file FILE cdef FILE *fopen(char *, char *) cdef int fclose(FILE *) cdef extern from errno.h: cdef int errno cdef extern from png.h: cdef struct jmp_buf: pass cdef int setjmp(jmp_buf) cdef struct png_struct: jmp_buf jmpbuf ctypedef png_struct *png_structp cdef struct png_info: int width int height int bit_depth int color_type int channels ctypedef png_info *png_infop cdef struct png_text: int compression char *key char *text int text_length ctypedef png_text *png_textp cdef png_structp png_create_read_struct(char *, void *, void *, void*) cdef void png_init_io(png_structp, FILE *) cdef void png_set_sig_bytes(png_structp, int) cdef png_infop png_create_info_struct(png_structp) cdef void png_read_info(png_structp, png_infop) cdef int png_set_interlace_handling(png_structp) cdef void png_read_update_info(png_structp, png_infop) cdef void png_read_image(png_structp, unsigned char **) cdef void png_read_png(png_structp, png_infop, int, void *) cdef void png_read_end(png_structp, png_infop) cdef void png_destroy_read_struct(png_structp *, png_infop *, void *) cdef png_structp png_create_write_struct(char *, void *, void *, void*) cdef void png_set_IHDR(png_structp, png_infop, unsigned int, unsigned int, int, int, int, int, int) cdef void png_write_info(png_structp, png_infop) cdef void png_write_image(png_structp, unsigned char **) cdef void png_write_end(png_structp, png_infop) cdef void png_destroy_write_struct(png_structp *,png_infop *) cdef void png_set_swap(png_structp) cdef void png_set_rows(png_structp, png_infop, unsigned char **) cdef void png_write_png(png_structp, png_infop, int, void *) cdef void png_set_text(png_structp, png_infop, png_textp, int) cdef int PNG_COLOR_TYPE_GRAY cdef int PNG_COLOR_TYPE_PALETTE cdef int PNG_COLOR_TYPE_RGB cdef int PNG_COLOR_TYPE_RGB_ALPHA cdef int PNG_COLOR_TYPE_GRAY_ALPHA cdef int PNG_INTERLACE_NONE cdef int PNG_COMPRESSION_TYPE_BASE cdef int PNG_FILTER_TYPE_BASE cdef int PNG_TRANSFORM_IDENTITY cdef int PNG_TRANSFORM_SWAP_ENDIAN cdef char *PNG_LIBPNG_VER_STRING def load(fn): load a PNG file into a NumPy array. This function can read all kinds of PNG images in a senseful way: Depending on the number of bits per pixel, we chose numpy.bool, numpy.unit8 or numpy.uint16 as base type. For gray scale images or pallettes, this returns a 2-D array, for everything else a 3-D array, where the third dimension has a size of 2 for gray+alpha images, 3 for RGB and 4 for RGBA images. cdef FILE *f f = fopen(fn, rb) if f == NULL: raise IOError(errno) cdef png_structp p p = png_create_read_struct(PNG_LIBPNG_VER_STRING, NULL, NULL, NULL) if setjmp(p.jmpbuf): fclose(f) raise RuntimeError(libpng reported error) cdef png_infop i i = png_create_info_struct(p) png_init_io(p, f) png_read_info(p, i) if i.color_type == PNG_COLOR_TYPE_GRAY: shape = (i.height, i.width) elif i.color_type == PNG_COLOR_TYPE_GRAY_ALPHA: shape = (i.height, i.width, 2) elif i.color_type == PNG_COLOR_TYPE_PALETTE: shape = (i.height, i.width) elif i.color_type == PNG_COLOR_TYPE_RGB:
Re: [Numpy-discussion] saving and loading as PNG
Am Donnerstag, 08. November 2007 16:25:57 schrieb Martin Teichmann: Some more thoughts: * Other implementations: There is other people who have done such a thing. the PIL knows how to read and write PNG, but only 8 bit. The same holds for matplotlib. Our VIGRA imaging library can read and save PNGs with 8/16 bit data, as well as several kinds of TIFF files (including float data). We are currently trying to re-model our python bindings (which have not yet been officially released) using NumPy containers for the images. Unfortunately, we seem to get a performance penalty by the common convention that the order of the indices is x,y, e.g. colorImage[x,y] == [r,g,b]. See the other current thread. -- Ciao, / / /--/ / / ANS ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] numpy 1.04 numpy.test() hang
Good morning. I just installed the Windows binary of numpy 1.04. When I ran numpy.test() in IDLE (the Python shell that comes with Python), the program hang (or at least is running for half an hour). I am using Windows XP, duel core intel CPU. Does anyone know what is going on? Thanks, Geoffrey ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Unnecessarily bad performance of elementwise operators with Fortran-arrays
On Nov 9, 2007 1:55 AM, Hans Meine [EMAIL PROTECTED] wrote: Am Donnerstag, 08. November 2007 17:31:40 schrieb David Cournapeau: This is because the current implementation for at least some of the operations you are talking about are using PyArray_GenericReduce and other similar functions, which are really high level (they use python callable, etc..). This is easier, because you don't have to care about anything (type, etc...), but this means that python runtime is handling all the thing. I suspected that after your last post, but that's really bad for pointwise operations on a contiguous, aligned array. A simple transpose() should really not make any difference here. It should not, but in practise, it is not so easy to do. AFAIK, even matlab has the same problem, with less difference, though. And they have much more ressources than numpy. Not to say that we cannot do better, but this takes time. Instead, you should use a pure C implementation (by pure C, I mean a C function totally independant of python, only dealing with standard C types). This would already lead a significant performance leap. AFAICS, it would be much more elegant and easier to implement this using C++ templates. We have a lot of experience with such a design from our VIGRA library ( http://kogs-www.informatik.uni-hamburg.de/~koethe/vigra/ ), which is an imaging library based on the STL concepts (and some necessary and convenient extensions for higher-dimensional arrays and a more flexible API). I am not very keen on writing hundreds of lines of C code for things that can easily be handled with C++ functors. But I don't think that I am the first to propose this, and I know that C has some advantages (faster compilation; are there more? ;-) ) The advantages of C over C++ are much more than compilation speed: it is actually portable, is simple, and easily callable from other languages, 3 significant points that C++ does not meet at all. Generally, I don't see much benefit for using low level C++ in numerical code ; I consider most numerical-based container a total failure (the fact that after 20 years, there is still nothing close to a standard for even a matrix concept in C++ is for me quite striking). I think C++ 's aspect which would be more useful in numpy is RAII (I must confess that I personnally don't like C++ very much, in particular when template are involved). This is only my opinion, I don't know the opinion on other people on this. Yes, I think it does. It probably depends on the sizes of the segments though. If you have a multi-segment box-sub-range of a large dataset (3D volume or even only 2D), processing each contiguous row (column/...) at once within the inner loop definitely makes a difference. I.e. as long as one dimension is not strided (and the data's extent in this dimension is not too small), it should be handled in the inner loop. The other loops probably don't make a big difference. If you have contiguous segments in subranges, then this is already handled through the ufunc mechanism, but I don't know anything about their actual implementation. Other people much more knowledgable than me can give you more info here. David ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Unnecessarily bad performance of elementwise operators with Fortran-arrays
On Nov 9, 2007 12:50 AM, Hans Meine [EMAIL PROTECTED] wrote: Am Donnerstag, 08. November 2007 16:37:06 schrieb David Cournapeau: The problem is not F vs C storage: for element-wise operation, it does not matter at all; you just apply the same function (perform_operation) over and over on every element of the array. The order does not matter at all. Yet Fortran order leads to several times slower operations, see the figures in my original post. :-( This is because the current implementation for at least some of the operations you are talking about are using PyArray_GenericReduce and other similar functions, which are really high level (they use python callable, etc..). This is easier, because you don't have to care about anything (type, etc...), but this means that python runtime is handling all the thing. Instead, you should use a pure C implementation (by pure C, I mean a C function totally independant of python, only dealing with standard C types). This would already lead a significant performance leap. Once you do that, you should not see difference between F and C storage, normally. But what if you have segmented buffers, non aligned, etc... arrays ? The code I posted should deal with it - by sorting the indices by decreasing stride, I simply ensure that all (source and target) segments are traversed in order of increasing memory addresses. It does not affect segments or alignment. If you have segmented addresses, I don't think the ordering matters much anymore, for memory access, no ? For example, All this has to be taken care of, Right - my perform_operation(aprime) step would need to apply the operation on every memory segment. and this means handling reference count and other things which are always delicate to handle well... I am not sure that I understand where refcounting comes into play here. Or you just use the current situation, which let python handle it (through PyObject_Call and a callable, at a C level). I need to look at the code to see what you mean here. Probably, I have a wrong picture of where the slowness comes from (I thought that the order of the loops was wrong). As I wrote above, I don't think this is good. A fortran-order-contiguous array is still contiguous, and not inferior in any way to C-order arrays. So I actually expect copy() to return an array of unchanged order. Maybe this behaviour was kept for compatibility with numarray ? If you look at the docstring, it is said that copy may not return the same order than its input. It kind of make sense to me: C order is the default of many numpy operations. That is very sad, because Fortran order is much more natural for handling images, where you're absolutely used to index with [x,y], and x being the faster-changing index in memory. -- Ciao, / / /--/ / / ANS ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Unnecessarily bad performance of elementwise operators with Fortran-arrays
On Nov 9, 2007 1:31 AM, David Cournapeau [EMAIL PROTECTED] wrote: On Nov 9, 2007 12:50 AM, Hans Meine [EMAIL PROTECTED] wrote: Am Donnerstag, 08. November 2007 16:37:06 schrieb David Cournapeau: The problem is not F vs C storage: for element-wise operation, it does not matter at all; you just apply the same function (perform_operation) over and over on every element of the array. The order does not matter at all. Yet Fortran order leads to several times slower operations, see the figures in my original post. :-( This is because the current implementation for at least some of the operations you are talking about are using PyArray_GenericReduce and other similar functions, which are really high level (they use python callable, etc..). This is easier, because you don't have to care about anything (type, etc...), but this means that python runtime is handling all the thing. Instead, you should use a pure C implementation (by pure C, I mean a C function totally independant of python, only dealing with standard C types). This would already lead a significant performance leap. Once you do that, you should not see difference between F and C storage, normally. But what if you have segmented buffers, non aligned, etc... arrays ? The code I posted should deal with it - by sorting the indices by decreasing stride, I simply ensure that all (source and target) segments are traversed in order of increasing memory addresses. It does not affect segments or alignment. If you have segmented addresses, I don't think the ordering matters much anymore, for memory access, no ? For example, All this has to be taken care of, Right - my perform_operation(aprime) step would need to apply the operation on every memory segment. and this means handling reference count and other things which are always delicate to handle well... I am not sure that I understand where refcounting comes into play here. Or you just use the current situation, which let python handle it (through PyObject_Call and a callable, at a C level). I need to look at the code to see what you mean here. Probably, I have a wrong picture of where the slowness comes from (I thought that the order of the loops was wrong). As I wrote above, I don't think this is good. A fortran-order-contiguous array is still contiguous, and not inferior in any way to C-order arrays. So I actually expect copy() to return an array of unchanged order. Maybe this behaviour was kept for compatibility with numarray ? If you look at the docstring, it is said that copy may not return the same order than its input. It kind of make sense to me: C order is the default of many numpy operations. That is very sad, because Fortran order is much more natural for handling images, where you're absolutely used to index with [x,y], and x being the faster-changing index in memory. -- Ciao, / / /--/ / / ANS ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Unnecessarily bad performance of elementwise operators with Fortran-arrays
This discussion makes me wonder if the basic element-wise operations could (should?) be special cased for contiguous arrays, reducing them to simple pointer incrementing from the start to the finish of the data block. The same code would work for C and Fortran order arrays, and be pretty simple. This would address Hans' issue, no? It's a special case but a common one. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception [EMAIL PROTECTED] ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Unnecessarily bad performance of elementwise operators with Fortran-arrays
Am Donnerstag, 08. November 2007 17:31:40 schrieb David Cournapeau: This is because the current implementation for at least some of the operations you are talking about are using PyArray_GenericReduce and other similar functions, which are really high level (they use python callable, etc..). This is easier, because you don't have to care about anything (type, etc...), but this means that python runtime is handling all the thing. I suspected that after your last post, but that's really bad for pointwise operations on a contiguous, aligned array. A simple transpose() should really not make any difference here. Instead, you should use a pure C implementation (by pure C, I mean a C function totally independant of python, only dealing with standard C types). This would already lead a significant performance leap. AFAICS, it would be much more elegant and easier to implement this using C++ templates. We have a lot of experience with such a design from our VIGRA library ( http://kogs-www.informatik.uni-hamburg.de/~koethe/vigra/ ), which is an imaging library based on the STL concepts (and some necessary and convenient extensions for higher-dimensional arrays and a more flexible API). I am not very keen on writing hundreds of lines of C code for things that can easily be handled with C++ functors. But I don't think that I am the first to propose this, and I know that C has some advantages (faster compilation; are there more? ;-) ) - what is the opinion on this in the SciPy community? If you have segmented addresses, I don't think the ordering matters much anymore, for memory access, no ? Yes, I think it does. It probably depends on the sizes of the segments though. If you have a multi-segment box-sub-range of a large dataset (3D volume or even only 2D), processing each contiguous row (column/...) at once within the inner loop definitely makes a difference. I.e. as long as one dimension is not strided (and the data's extent in this dimension is not too small), it should be handled in the inner loop. The other loops probably don't make a big difference. -- Ciao, / / /--/ / / ANS ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Warning or error on conversion from complex to float.
Hi, Is it possible or easy to add a warning and/or error when array assignments are made that lose information? I just got caught with the following type of code: def f(): return numpy.array([1j,2.0]) x = numpy.empty((2,),dtype=float) x[:] = f() I am pre-allocating arrays for speed, but made a change to f() resulting in complex results. My code now silently ignores the imaginary part. It would have been very helpful if the assignment check to make sure that the conversion was not going to lose information (presumably with an assert so that performance is not affected outside of debugging). Any simple suggestions on how to avoid this problem in the future? How hard would it be to add such a check in numpy? Thanks, Michael. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy 1.04 numpy.test() hang
Geoffrey Zhu wrote: Good morning. I just installed the Windows binary of numpy 1.04. When I ran numpy.test() in IDLE (the Python shell that comes with Python), the program hang (or at least is running for half an hour). I am using Windows XP, duel core intel CPU. Does anyone know what is going on? No. Run numpy.test(verbosity=2) so it will print out each test name before running the test. Then we might have some idea about where the hang is coming from. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Unnecessarily bad performance of elementwise operators with Fortran-arrays
On Nov 9, 2007 3:28 AM, Christopher Barker [EMAIL PROTECTED] wrote: This discussion makes me wonder if the basic element-wise operations could (should?) be special cased for contiguous arrays, reducing them to simple pointer incrementing from the start to the finish of the data block. I don't see why it could not. I think that most of the implementation could be common to most functions, because they share a lot (handling out argument, axis argument, and sometimes dtype argument). The only different part would be the actual compuation by a C function, which is trivial as you pointed. cheers, David ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy 1.04 numpy.test() hang
On Thu, 08 Nov 2007 12:12:42 -0600 Robert Kern [EMAIL PROTECTED] wrote: Geoffrey Zhu wrote: Good morning. I just installed the Windows binary of numpy 1.04. When I ran numpy.test() in IDLE (the Python shell that comes with Python), the program hang (or at least is running for half an hour). I am using Windows XP, duel core intel CPU. Does anyone know what is going on? No. Run numpy.test(verbosity=2) so it will print out each test name before running the test. Then we might have some idea about where the hang is coming from. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion I can confirm the problem on an AMD x86_64 with python2.5.1. The test hangs at check_cdouble (numpy.tests.test_linalg.test_det) Nils ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy 1.04 numpy.test() hang
On Nov 8, 2007 12:12 PM, Robert Kern [EMAIL PROTECTED] wrote: Geoffrey Zhu wrote: Good morning. I just installed the Windows binary of numpy 1.04. When I ran numpy.test() in IDLE (the Python shell that comes with Python), the program hang (or at least is running for half an hour). I am using Windows XP, duel core intel CPU. Does anyone know what is going on? No. Run numpy.test(verbosity=2) so it will print out each test name before running the test. Then we might have some idea about where the hang is coming from. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion Thanks for the hint, Robert. It hangs on check_cdouble (numpy.tests.test_linalg.test_det). Also testip_not_allclose() had three warnings. I guess that's probably okay. testip_not_allclose (numpy.core.tests.test_numeric.test_allclose_inf) ... Warning: invalid value encountered in absolute Warning: invalid value encountered in absolute Warning: invalid value encountered in less_equal ok Thanks, Geoffrey ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Warning or error on conversion from complex to float.
On Nov 8, 2007 11:32 AM, Michael McNeil Forbes [EMAIL PROTECTED] wrote: Hi, Is it possible or easy to add a warning and/or error when array assignments are made that lose information? I just got caught with the following type of code: def f(): return numpy.array([1j,2.0]) x = numpy.empty((2,),dtype=float) x[:] = f() I am pre-allocating arrays for speed, but made a change to f() resulting in complex results. My code now silently ignores the imaginary part. It would have been very helpful if the assignment check to make sure that the conversion was not going to lose information (presumably with an assert so that performance is not affected outside of debugging). Any simple suggestions on how to avoid this problem in the future? How hard would it be to add such a check in numpy? This seems to be a generic problem. In [1]: a = empty((10,), dtype=float32) In [2]: a[:] = 10.0 In [3]: a[:] = float32(1) In [4]: b = empty((10,), dtype=float64) In [5]: a[:] = b[:] In the case of the scalar assignments downcasting is a convenience because having to explicitly use the correct data type in an assignment gets to be a hassle when the type isn't native to Python. It might have been more correct to require matching data types, but convenience won out in that case. That said, it might be useful to warn on such an unexpected conversion as complex to float, and maybe for float to integer conversions also. Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Unnecessarily bad performance of elementwise operators with Fortran-arrays
Hans Meine wrote: Hi! I wonder why simple elementwise operations like a * 2 or a + 1 are not performed in order of increasing memory addresses in order to exploit CPU caches etc. C-order is special in NumPy due to the history. I agree that it doesn't need to be and we have taken significant steps in that direction. Right now, the fundamental problem is probably due to the fact that the output arrays are being created as C-order arrays when the input is a Fortran order array. Once that is changed then we can cause Fortran-order inputs to use the simplest path through the code. Fortran order arrays can be preserved but it takes a little extra work because backward compatible expectations had to be met. See for example the order argument to the copy method of arrays. - as it is now, their speed drops by a factor of around 3 simply by transpose()ing. Similarly (but even less logical), copy() and even the constructor are affected (yes, I understand that copy() creates contiguous arrays, but shouldn't it respect/retain the order nevertheless?): As mentioned, it can preserve order with the 'order' argument a.copy('A') ### constructor ### In [89]: %timeit -r 10 -n 100 numpy.ndarray((3,3,3)) 100 loops, best of 10: 1.19 s per loop In [90]: %timeit -r 10 -n 100 numpy.ndarray((3,3,3), order=f) 100 loops, best of 10: 2.19 s per loop I bet what you are seeing here is simply the overhead of processing the order argument. Try the first one with order='c' to see what I mean. ### copy 3x3x3 array ### In [85]: a = numpy.ndarray((3,3,3)) In [86]: %timeit -r 10 a.copy() 100 loops, best of 10: 1.14 s per loop In [87]: a = numpy.ndarray((3,3,3), order=f) In [88]: %timeit -r 10 -n 100 a.copy() 100 loops, best of 10: 3.39 s per loop Use the 'a' argument to allow copying in fortran order. ### copy 256x256x256 array ### In [74]: a = numpy.ndarray((256,256,256)) In [75]: %timeit -r 10 a.copy() 10 loops, best of 10: 119 ms per loop In [76]: a = numpy.ndarray((256,256,256), order=f) In [77]: %timeit -r 10 a.copy() 10 loops, best of 10: 274 ms per loop Same thing here. Nobody is opposed to having faster code as long as we don't in the process break code bases. There is also the matter of implementation -Travis O. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Improving numpy.interp
I often need to make a linear interpolation for a single scalar value but this is not very simple with numpy.interp : import numpy as n n.__version__ '1.0.5.dev4420' xp = n.arange(10) yp = 2.5 + xp**2 -x x = 3.2 n.interp(x, xp, yp) Traceback (most recent call last): File stdin, line 1, in module ValueError: object of too small depth for desired array So I use the following trick (which is really ugly) : n.interp([x], xp, yp)[0] 7.7011 Did I miss an obvious way to do it ? If not, I'd be tempted to patch interp to let it accept scalar values as first argument as follow : def interp(x, *args, **kwds): ... if type(x) in (float, int): ... return n.interp([x], *args, **kwds).item() ... else : ... return n.interp(x, *args, **kwds) ... interp(x, xp, yp) 7.7011 interp([x,2*x], xp, yp) array([ 7.7, 38.5]) -- LB ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Unnecessarily bad performance of elementwise operators with Fortran-arrays
On Donnerstag 08 November 2007, Christopher Barker wrote: This discussion makes me wonder if the basic element-wise operations could (should?) be special cased for contiguous arrays, reducing them to simple pointer incrementing from the start to the finish of the data block. The same code would work for C and Fortran order arrays, and be pretty simple. This is exactly what I am thinking and talking about. I did not want to imply that NumPy is bad because it does not yet do so - it's an impressive piece of well-designed software AFAICS - but I do think that this is a really common and important use case that should not stay as inefficiently handled as it is now. Ciao, / /.o. /--/ ..o / / ANS ooo signature.asc Description: This is a digitally signed message part. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Unnecessarily bad performance of elementwise operators with Fortran-arrays
Christopher Barker wrote: This discussion makes me wonder if the basic element-wise operations could (should?) be special cased for contiguous arrays, reducing them to simple pointer incrementing from the start to the finish of the data block. The same code would work for C and Fortran order arrays, and be pretty simple. This would address Hans' issue, no? It's a special case but a common one. There is a special case for this already. It's just that the specific operations he is addressing requires creation of output arrays that by default are in C-order.This would need to change in order to take advantage of the special case. -Travis O. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Unnecessarily bad performance of elementwise operators with Fortran-arrays
Travis E. Oliphant wrote: Christopher Barker wrote: This discussion makes me wonder if the basic element-wise operations could (should?) be special cased for contiguous arrays, reducing them to simple pointer incrementing from the start to the finish of the data block. The same code would work for C and Fortran order arrays, and be pretty simple. This would address Hans' issue, no? It's a special case but a common one. There is a special case for this already. It's just that the specific operations he is addressing requires creation of output arrays that by default are in C-order.This would need to change in order to take advantage of the special case. For copy and array creation, I understand this, but for element-wise operations (mean, min, and max), this is not enough to explain the difference, no ? For example, I can understand a 50 % or 100 % time increase for simple operations (by simple, I mean one element operation taking only a few CPU cycles), because of copies, but a 5 fold time increase seems too big, no (mayb a cache problem, though) ? Also, the fact that mean is slower than min/max for both cases (F vs C) seems a bit counterintuitive (maybe cache effects are involved somehow ?). Again, I see huge differences between my Xeon PIV @ 3.2 Ghz and my pentium M @ 1.2 Ghz for those operations: pentium M gives more intuitive results (and is almost as fast, and sometimes even faster than my Xeon for arrays which can stay in cache). cheers, David ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Unnecessarily bad performance of elementwise operators with Fortran-arrays
David Cournapeau wrote: Travis E. Oliphant wrote: I wasn't talking about the min, mean, and max methods specifically. These are all implemented with the reduce method of a ufunc. Ah, my mistake, I wrongly understood only some of them were implemented through ufunc. But the ufunc machinery has nothing to do with output array output, right ? So is the 5 time speed increase mainly happening inside ufunc ? I'm not sure. Ufuncs don't do everything in NumPy as you have noted. There are less well publicized (and less structured) array functions for each data-type that also implement things. Not all of these places have optimized paths, but there was some thought given to it in several places of the code. One area that could be improved that may be exposed in some of the timing tests, is that iterator-based algorithms always use a C-contiguous iterator. There is no Fortran-contiguous iterator (although there could be). -Travis O. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Unnecessarily bad performance of elementwise operators with Fortran-arrays
On 08/11/2007, David Cournapeau [EMAIL PROTECTED] wrote: For copy and array creation, I understand this, but for element-wise operations (mean, min, and max), this is not enough to explain the difference, no ? For example, I can understand a 50 % or 100 % time increase for simple operations (by simple, I mean one element operation taking only a few CPU cycles), because of copies, but a 5 fold time increase seems too big, no (mayb a cache problem, though) ? Also, the fact that mean is slower than min/max for both cases (F vs C) seems a bit counterintuitive (maybe cache effects are involved somehow ?). I have no doubt at all that cache effects are involved: for an int array, each data element is four bytes, but typical CPUs need to load 64 bytes at a time into cache. If you read (or write) the rest of those bytes in the next iterations through the loop, the (large) cost of a memory read is amortized. If you jump to the next row of the array, some large number of bytes away, those 64 bytes basically need to be purged to make room for another 64 bytes, of which you'll use 4. If you're reading from a FORTRAN-order array and writing to a C-order one, there's no way around doing this on one end or another: you're effectively doing a transpose, which is pretty much always expensive. Is there any reason not to let ufuncs pick whichever order for newly-allocated arrays they want? The natural choice would be the same as the bigger input array, if it's a contiguous block of memory (whether or not the contiguous flags are set). Failing that, the same as the other input array (if any); failing that, C order's as good a default as any. How difficult would this be to implement? Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Unnecessarily bad performance of elementwise operators with Fortran-arrays
Anne Archibald wrote: On 08/11/2007, David Cournapeau [EMAIL PROTECTED] wrote: For copy and array creation, I understand this, but for element-wise operations (mean, min, and max), this is not enough to explain the difference, no ? For example, I can understand a 50 % or 100 % time increase for simple operations (by simple, I mean one element operation taking only a few CPU cycles), because of copies, but a 5 fold time increase seems too big, no (mayb a cache problem, though) ? Also, the fact that mean is slower than min/max for both cases (F vs C) seems a bit counterintuitive (maybe cache effects are involved somehow ?). I have no doubt at all that cache effects are involved: for an int array, each data element is four bytes, but typical CPUs need to load 64 bytes at a time into cache. If you read (or write) the rest of those bytes in the next iterations through the loop, the (large) cost of a memory read is amortized. If you jump to the next row of the array, some large number of bytes away, those 64 bytes basically need to be purged to make room for another 64 bytes, of which you'll use 4. If you're reading from a FORTRAN-order array and writing to a C-order one, there's no way around doing this on one end or another: you're effectively doing a transpose, which is pretty much always expensive. This is obviously part of the picture, and there is indeed no way around the simple fact that on sequential memory access is extremely slow on modern hardware. The fact that the array iterator does not support (yet) the Fortran order would largely explain this (I wrongly assumed the iterator already did that). Is there any reason not to let ufuncs pick whichever order for newly-allocated arrays they want? The natural choice would be the same as the bigger input array, if it's a contiguous block of memory (whether or not the contiguous flags are set). Failing that, the same as the other input array (if any); failing that, C order's as good a default as any. How difficult would this be to implement? I think part of the problem is that it is difficult to know which is faster a priori (is copying multi-segmented parts in a single buffer faster for processing faster than direct processing ?). As mentioned previously, on the same platform (same OS/compiler combination), there are already huge differences between two CPU (pentium4 vs pentium M, the former being noticeably pig-slow when SSE FPU is not used). Does ufunc use buffer for segmented memory, or do they only use them for misbehaved arrays ? Also, from my very limited experience, I found array iterators to be significantly slower than simple array indexing on contiguous, single segment arrays. Do ufunc always use array iterator, or do they use simple array indexing when possible ? When I implemented the fast clip function (which does not use ufunc), I noticed that the following matter: - avoiding unnecessary copies. - quickly determine which cases can be optimized wrt to the operations you are using (memmove vs memcpy, etc...) - fast memory copying (numpy do not have those yet: by using optimized memory copying, you can often gain a 50 % speed increase on recent archs; also, having information about alignment could help, too, and we go back to aligned allocators discussion :) ). The problem is to find a good API to put those optimizations at one place. I unfortunately do not know much about ufunc, maybe they already implement most of those. cheers, David ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion