Re: [Numpy-discussion] numpy.mean still broken for largefloat32arrays
To rephrase my most pressing question: may np.ones((N,2)).mean(0) and np.ones((2,N)).mean(1) produce different results with the implementation in the current master? If so, I think that would be very much regrettable; and if this is a minority opinion, I do hope that at least this gets documented in a most explicit manner. On Sun, Jul 27, 2014 at 8:26 PM, Sturla Molden sturla.mol...@gmail.com wrote: Nathaniel Smith n...@pobox.com wrote: The problem here is that when summing up the values, the sum gets large enough that after rounding, x + 1 = x and the sum stops increasing. Interesting. That explains why the divide-and-conquer reduction is much more robust. Thanks :) Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy.mean still broken for largefloat32arrays
On Mo, 2014-07-28 at 14:37 +0200, Eelco Hoogendoorn wrote: To rephrase my most pressing question: may np.ones((N,2)).mean(0) and np.ones((2,N)).mean(1) produce different results with the implementation in the current master? If so, I think that would be very much regrettable; and if this is a minority opinion, I do hope that at least this gets documented in a most explicit manner. This will always give you different results. Though in master. the difference is more likely to be large, since (often the second one) maybe be less likely to run into bigger numerical issues. On Sun, Jul 27, 2014 at 8:26 PM, Sturla Molden sturla.mol...@gmail.com wrote: Nathaniel Smith n...@pobox.com wrote: The problem here is that when summing up the values, the sum gets large enough that after rounding, x + 1 = x and the sum stops increasing. Interesting. That explains why the divide-and-conquer reduction is much more robust. Thanks :) Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy.mean still broken for largefloat32arrays
On Mon, Jul 28, 2014 at 8:46 AM, Sebastian Berg sebast...@sipsolutions.net wrote: On Mo, 2014-07-28 at 14:37 +0200, Eelco Hoogendoorn wrote: To rephrase my most pressing question: may np.ones((N,2)).mean(0) and np.ones((2,N)).mean(1) produce different results with the implementation in the current master? If so, I think that would be very much regrettable; and if this is a minority opinion, I do hope that at least this gets documented in a most explicit manner. This will always give you different results. Though in master. the difference is more likely to be large, since (often the second one) maybe be less likely to run into bigger numerical issues. Are you sure they always give different results? Notice that np.ones((N,2)).mean(0) np.ones((2,N)).mean(1) compute means of different axes on transposed arrays so these differences 'cancel out'. My understanding of the question is to clarify how numpy reduction algorithms are special-cased for the fast axis vs. other axes. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 64-bit windows numpy / scipy wheels for testing
Hi, on https://bitbucket.org/carlkl/mingw-w64-for-python/downloads I uploaded 7z-archives for mingw-w64 and for OpenBLAS-0.2.10 for 32 bit and for 64 bit. To use mingw-w64 for Python = 3.3 you have to manually tweak the so called specs file - see readme.txt in the archive. Regards Carl 2014-07-28 4:32 GMT+02:00 Robert McGibbon rmcgi...@gmail.com: I forked Olivier's example project to use the same infrastructure for building conda binaries and deploying them to binstar, which might also be useful for some projects. https://github.com/rmcgibbo/python-appveyor-conda-example -Robert On Wed, Jul 9, 2014 at 3:53 PM, Robert McGibbon rmcgi...@gmail.com wrote: This is an awesome resource for tons of projects. Thanks Olivier! -Robert On Wed, Jul 9, 2014 at 7:00 AM, Olivier Grisel olivier.gri...@ensta.org wrote: Feodor updated the AppVeyor nodes to have the Windows SDK matching MSVC 2008 Express for Python 2. I have updated my sample scripts and we now have a working example of a free CI system for: Python 2 and 3 both for 32 and 64 bit architectures. https://github.com/ogrisel/python-appveyor-demo Best, -- Olivier ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy.mean still broken for largefloat32arrays
On 28 July 2014 14:46, Sebastian Berg sebast...@sipsolutions.net wrote: To rephrase my most pressing question: may np.ones((N,2)).mean(0) and np.ones((2,N)).mean(1) produce different results with the implementation in the current master? If so, I think that would be very much regrettable; and if this is a minority opinion, I do hope that at least this gets documented in a most explicit manner. This will always give you different results. Though in master. the difference is more likely to be large, since (often the second one) maybe be less likely to run into bigger numerical issues. An example using float16 on Numpy 1.8.1 (I haven't seen diferences with float32): for N in np.logspace(2, 6): print N, (np.ones((N,2), dtype=np.float16).mean(0), np.ones((2,N), dtype=np.float16).mean(1)) The first one gives correct results up to 2049, from where the values start to fall. The second one, on the other hand, gives correct results up to 65519, where it blows to infinity. Interestingly, in the second case there are fluctuations. For example, for N = 65424, the mean is 0.99951172, but 1 for the next and previous numbers. But I think they are just an effect of the rounding, as: In [33]: np.ones(N+1, dtype=np.float16).sum() - N Out[33]: 16.0 In [35]: np.ones(N+1, dtype=np.float16).sum() - (N +1) Out[35]: 15.0 In [36]: np.ones(N-1, dtype=np.float16).sum() - (N -1) Out[36]: -15.0 In [37]: N = 65519 - 20 In [38]: np.ones(N, dtype=np.float16).sum() - N Out[38]: 5.0 In [39]: np.ones(N+1, dtype=np.float16).sum() - (N +1) Out[39]: 4.0 In [40]: np.ones(N-1, dtype=np.float16).sum() - (N -1) Out[40]: 6.0 ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy.mean still broken for largefloat32arrays
On 28/07/14 15:21, alex wrote: Are you sure they always give different results? Notice that np.ones((N,2)).mean(0) np.ones((2,N)).mean(1) compute means of different axes on transposed arrays so these differences 'cancel out'. They will be if different algorithms are used. np.ones((N,2)).mean(0) will have larger accumulated rounding error than np.ones((2,N)).mean(1), if only the latter uses the divide-and-conquer summation. I would suggest that in the first case we try to copy the array to a temporary contiguous buffer and use the same divide-and-conquer algorithm, unless some heuristics on memory usage fails. Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy.mean still broken for largefloat32arrays
On 28.07.2014 15:30, Daπid wrote: An example using float16 on Numpy 1.8.1 (I haven't seen diferences with float32): Why aren't there differences between float16 and float32 ? Could this be related to my earlier post in this thread where I mentioned summation problems occurring much earlier in numpy than in IDL? Fabien ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy.mean still broken for largefloat32arrays
On Mo, 2014-07-28 at 15:35 +0200, Sturla Molden wrote: On 28/07/14 15:21, alex wrote: Are you sure they always give different results? Notice that np.ones((N,2)).mean(0) np.ones((2,N)).mean(1) compute means of different axes on transposed arrays so these differences 'cancel out'. They will be if different algorithms are used. np.ones((N,2)).mean(0) will have larger accumulated rounding error than np.ones((2,N)).mean(1), if only the latter uses the divide-and-conquer summation. What I wanted to point out is that to some extend the algorithm does not matter. You will not necessarily get identical results already if you use a different iteration order, and we have been doing that for years for speed reasons. All libs like BLAS do the same. Yes, the new changes make this much more dramatic, but they only make some paths much better, never worse. It might be dangerous, but only in the sense that you test it with the good path and it works good enough, but later (also) use the other one in some lib. I am not even sure if I I would suggest that in the first case we try to copy the array to a temporary contiguous buffer and use the same divide-and-conquer algorithm, unless some heuristics on memory usage fails. Sure, but you have to make major changes to the buffered iterator to do that without larger speed implications. It might be a good idea, but it requires someone who knows this stuff to spend a lot of time and care in the depths of numpy. Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy.mean still broken for largefloat32arrays
On Mo, 2014-07-28 at 15:50 +0200, Fabien wrote: On 28.07.2014 15:30, Daπid wrote: An example using float16 on Numpy 1.8.1 (I haven't seen diferences with float32): Why aren't there differences between float16 and float32 ? float16 calculations are actually float32 calculations. If done along the fast axis they will not get rounded in between (within those 8192 elements chunks). Basically something like the difference we are talking about for float32 and float64 has for years existed in float16. Could this be related to my earlier post in this thread where I mentioned summation problems occurring much earlier in numpy than in IDL? Fabien ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy.mean still broken for largefloat32arrays
Sebastian: Those are good points. Indeed iteration order may already produce different results, even though the semantics of numpy suggest identical operations. Still, I feel this different behavior without any semantical clues is something to be minimized. Indeed copying might have large speed implications. But on second thought, does it? Either the data is already aligned and no copy is required, or it isn't aligned, and we need one pass of cache inefficient access to the data anyway. Infact, if we had one low level function which does cache-intelligent transposition of numpy arrays (using some block strategy), this might be faster even than the current simple reduction operations when forced to work on awkwardly aligned data. Ideally, this intelligent access and intelligent reduction would be part of a single pass of course; but that wouldn't really fit within the numpy design, and merely such an intelligent transpose would provide most of the benefit I think. Or is the mechanism behind ascontiguousarray already intelligent in this sense? On Mon, Jul 28, 2014 at 4:06 PM, Sebastian Berg sebast...@sipsolutions.net wrote: On Mo, 2014-07-28 at 15:35 +0200, Sturla Molden wrote: On 28/07/14 15:21, alex wrote: Are you sure they always give different results? Notice that np.ones((N,2)).mean(0) np.ones((2,N)).mean(1) compute means of different axes on transposed arrays so these differences 'cancel out'. They will be if different algorithms are used. np.ones((N,2)).mean(0) will have larger accumulated rounding error than np.ones((2,N)).mean(1), if only the latter uses the divide-and-conquer summation. What I wanted to point out is that to some extend the algorithm does not matter. You will not necessarily get identical results already if you use a different iteration order, and we have been doing that for years for speed reasons. All libs like BLAS do the same. Yes, the new changes make this much more dramatic, but they only make some paths much better, never worse. It might be dangerous, but only in the sense that you test it with the good path and it works good enough, but later (also) use the other one in some lib. I am not even sure if I I would suggest that in the first case we try to copy the array to a temporary contiguous buffer and use the same divide-and-conquer algorithm, unless some heuristics on memory usage fails. Sure, but you have to make major changes to the buffered iterator to do that without larger speed implications. It might be a good idea, but it requires someone who knows this stuff to spend a lot of time and care in the depths of numpy. Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 64-bit windows numpy / scipy wheels for testing
2014-07-28 15:25 GMT+02:00 Carl Kleffner cmkleff...@gmail.com: Hi, on https://bitbucket.org/carlkl/mingw-w64-for-python/downloads I uploaded 7z-archives for mingw-w64 and for OpenBLAS-0.2.10 for 32 bit and for 64 bit. To use mingw-w64 for Python = 3.3 you have to manually tweak the so called specs file - see readme.txt in the archive. Have the patches to build numpy and scipy with mingw-w64 been merged in the master branches of those projects? -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 64-bit windows numpy / scipy wheels for testing
I had to move my development enviroment on different windows box recently (stilll in progress). On this box I don't have full access unfortunately. The patch for scipy build was merged into scipy master some time ago, see https://github.com/scipy/scipy/pull/3484 . I have some additional patches for scipy.test. The pull request for numpy build has not yet been made for the reasons I mentioned. Cheers, Carl 2014-07-28 16:46 GMT+02:00 Olivier Grisel olivier.gri...@ensta.org: 2014-07-28 15:25 GMT+02:00 Carl Kleffner cmkleff...@gmail.com: Hi, on https://bitbucket.org/carlkl/mingw-w64-for-python/downloads I uploaded 7z-archives for mingw-w64 and for OpenBLAS-0.2.10 for 32 bit and for 64 bit. To use mingw-w64 for Python = 3.3 you have to manually tweak the so called specs file - see readme.txt in the archive. Have the patches to build numpy and scipy with mingw-w64 been merged in the master branches of those projects? -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy.mean still broken for largefloat32arrays
On Mo, 2014-07-28 at 16:31 +0200, Eelco Hoogendoorn wrote: Sebastian: Those are good points. Indeed iteration order may already produce different results, even though the semantics of numpy suggest identical operations. Still, I feel this different behavior without any semantical clues is something to be minimized. Indeed copying might have large speed implications. But on second thought, does it? Either the data is already aligned and no copy is required, or it isn't aligned, and we need one pass of cache inefficient access to the data anyway. Infact, if we had one low level function which does cache-intelligent transposition of numpy arrays (using some block strategy), this might be faster even than the current simple reduction operations when forced to work on awkwardly aligned data. Ideally, this intelligent access and intelligent reduction would be part of a single pass of course; but that wouldn't really fit within the numpy design, and merely such an intelligent transpose would provide most of the benefit I think. Or is the mechanism behind ascontiguousarray already intelligent in this sense? The iterator is currently smart in the sense that it will (obviously low level), do something like [1]. Most things in numpy use that iterator, ascontiguousarray does so as well. Such a blocked cache aware iterator is what I mean by, someone who knows it would have to spend a lot of time on it :). [1] Appendix: arr = np.ones((100, 100)) arr.sum(1) # being equivalent (iteration order wise) to: res = np.zeros(100) for i in range(100): res += arr[i, :] # while arr.sum(0) would be: for i in range(100): res[i] = arr[i, :].sum() On Mon, Jul 28, 2014 at 4:06 PM, Sebastian Berg sebast...@sipsolutions.net wrote: On Mo, 2014-07-28 at 15:35 +0200, Sturla Molden wrote: On 28/07/14 15:21, alex wrote: Are you sure they always give different results? Notice that np.ones((N,2)).mean(0) np.ones((2,N)).mean(1) compute means of different axes on transposed arrays so these differences 'cancel out'. They will be if different algorithms are used. np.ones((N,2)).mean(0) will have larger accumulated rounding error than np.ones((2,N)).mean(1), if only the latter uses the divide-and-conquer summation. What I wanted to point out is that to some extend the algorithm does not matter. You will not necessarily get identical results already if you use a different iteration order, and we have been doing that for years for speed reasons. All libs like BLAS do the same. Yes, the new changes make this much more dramatic, but they only make some paths much better, never worse. It might be dangerous, but only in the sense that you test it with the good path and it works good enough, but later (also) use the other one in some lib. I am not even sure if I I would suggest that in the first case we try to copy the array to a temporary contiguous buffer and use the same divide-and-conquer algorithm, unless some heuristics on memory usage fails. Sure, but you have to make major changes to the buffered iterator to do that without larger speed implications. It might be a good idea, but it requires someone who knows this stuff to spend a lot of time and care in the depths of numpy. Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy.mean still broken for largefloat32arrays
I see, thanks for the clarification. Just for the sake of argument, since unfortunately I don't have the time to go dig in the guts of numpy myself: a design which always produces results of the same (high) accuracy, but only optimizes the common access patterns in a hacky way, and may be inefficient in case it needs to fall back on dumb iteration or array copying, is the best compromise between features and the ever limiting amount of time available, I would argue, no? Its preferable if your code works, but may be hacked to work more efficiently, than that it works efficiently, but may need hacking to work correctly under all circumstances. But fun as it is to think about what ought to be, i suppose the people who do actually pour in the effort have thought about these things already. A numpy 2.0 could probably borrow/integrate a lot from numexpr, I suppose. By the way, the hierarchical summation would make it fairly easy to erase (and in any case would minimize) summation differences due to differences between logical and actual ordering in memory of the data, no? On Mon, Jul 28, 2014 at 5:22 PM, Sebastian Berg sebast...@sipsolutions.net wrote: On Mo, 2014-07-28 at 16:31 +0200, Eelco Hoogendoorn wrote: Sebastian: Those are good points. Indeed iteration order may already produce different results, even though the semantics of numpy suggest identical operations. Still, I feel this different behavior without any semantical clues is something to be minimized. Indeed copying might have large speed implications. But on second thought, does it? Either the data is already aligned and no copy is required, or it isn't aligned, and we need one pass of cache inefficient access to the data anyway. Infact, if we had one low level function which does cache-intelligent transposition of numpy arrays (using some block strategy), this might be faster even than the current simple reduction operations when forced to work on awkwardly aligned data. Ideally, this intelligent access and intelligent reduction would be part of a single pass of course; but that wouldn't really fit within the numpy design, and merely such an intelligent transpose would provide most of the benefit I think. Or is the mechanism behind ascontiguousarray already intelligent in this sense? The iterator is currently smart in the sense that it will (obviously low level), do something like [1]. Most things in numpy use that iterator, ascontiguousarray does so as well. Such a blocked cache aware iterator is what I mean by, someone who knows it would have to spend a lot of time on it :). [1] Appendix: arr = np.ones((100, 100)) arr.sum(1) # being equivalent (iteration order wise) to: res = np.zeros(100) for i in range(100): res += arr[i, :] # while arr.sum(0) would be: for i in range(100): res[i] = arr[i, :].sum() On Mon, Jul 28, 2014 at 4:06 PM, Sebastian Berg sebast...@sipsolutions.net wrote: On Mo, 2014-07-28 at 15:35 +0200, Sturla Molden wrote: On 28/07/14 15:21, alex wrote: Are you sure they always give different results? Notice that np.ones((N,2)).mean(0) np.ones((2,N)).mean(1) compute means of different axes on transposed arrays so these differences 'cancel out'. They will be if different algorithms are used. np.ones((N,2)).mean(0) will have larger accumulated rounding error than np.ones((2,N)).mean(1), if only the latter uses the divide-and-conquer summation. What I wanted to point out is that to some extend the algorithm does not matter. You will not necessarily get identical results already if you use a different iteration order, and we have been doing that for years for speed reasons. All libs like BLAS do the same. Yes, the new changes make this much more dramatic, but they only make some paths much better, never worse. It might be dangerous, but only in the sense that you test it with the good path and it works good enough, but later (also) use the other one in some lib. I am not even sure if I I would suggest that in the first case we try to copy the array to a temporary contiguous buffer and use the same divide-and-conquer algorithm, unless some heuristics on memory usage fails. Sure, but you have to make major changes to the buffered iterator to do that without larger speed implications. It might be a good idea, but it requires someone who knows this stuff to spend a lot of time and care in the depths of numpy. Sturla
Re: [Numpy-discussion] numpy.mean still broken for largefloat32arrays
On 28.07.2014 23:32, Eelco Hoogendoorn wrote: I see, thanks for the clarification. Just for the sake of argument, since unfortunately I don't have the time to go dig in the guts of numpy myself: a design which always produces results of the same (high) accuracy, but only optimizes the common access patterns in a hacky way, and may be inefficient in case it needs to fall back on dumb iteration or array copying, is the best compromise between features and the ever limiting amount of time available, I would argue, no? Its preferable if your code works, but may be hacked to work more efficiently, than that it works efficiently, but may need hacking to work correctly under all circumstances. I don't see the inconsistency as such a big problem. If applications are so sensitive to accurate summations over large uniform datasets they will most likely implement their own algorithm instead of relying on the black box in numpy (which never documented any accuracy bounds or used algorithms on summation so far I know). If they do they should add testsuites that will detect accidental use the less accurate path in numpy and fix it before they even release. General purpose libraries that may not be able to test every input third party users may give them usually don't have the luxury of only supporting the latest version of numpy to have pairwise summation guaranteed in the first place, so they would just have to implement their own algorithms anyway. But fun as it is to think about what ought to be, i suppose the people who do actually pour in the effort have thought about these things already. A numpy 2.0 could probably borrow/integrate a lot from numexpr, I suppose. By the way, the hierarchical summation would make it fairly easy to erase (and in any case would minimize) summation differences due to differences between logical and actual ordering in memory of the data, no? On Mon, Jul 28, 2014 at 5:22 PM, Sebastian Berg sebast...@sipsolutions.net mailto:sebast...@sipsolutions.net wrote: On Mo, 2014-07-28 at 16:31 +0200, Eelco Hoogendoorn wrote: Sebastian: Those are good points. Indeed iteration order may already produce different results, even though the semantics of numpy suggest identical operations. Still, I feel this different behavior without any semantical clues is something to be minimized. Indeed copying might have large speed implications. But on second thought, does it? Either the data is already aligned and no copy is required, or it isn't aligned, and we need one pass of cache inefficient access to the data anyway. Infact, if we had one low level function which does cache-intelligent transposition of numpy arrays (using some block strategy), this might be faster even than the current simple reduction operations when forced to work on awkwardly aligned data. Ideally, this intelligent access and intelligent reduction would be part of a single pass of course; but that wouldn't really fit within the numpy design, and merely such an intelligent transpose would provide most of the benefit I think. Or is the mechanism behind ascontiguousarray already intelligent in this sense? The iterator is currently smart in the sense that it will (obviously low level), do something like [1]. Most things in numpy use that iterator, ascontiguousarray does so as well. Such a blocked cache aware iterator is what I mean by, someone who knows it would have to spend a lot of time on it :). [1] Appendix: arr = np.ones((100, 100)) arr.sum(1) # being equivalent (iteration order wise) to: res = np.zeros(100) for i in range(100): res += arr[i, :] # while arr.sum(0) would be: for i in range(100): res[i] = arr[i, :].sum() On Mon, Jul 28, 2014 at 4:06 PM, Sebastian Berg sebast...@sipsolutions.net mailto:sebast...@sipsolutions.net wrote: On Mo, 2014-07-28 at 15:35 +0200, Sturla Molden wrote: On 28/07/14 15:21, alex wrote: Are you sure they always give different results? Notice that np.ones((N,2)).mean(0) np.ones((2,N)).mean(1) compute means of different axes on transposed arrays so these differences 'cancel out'. They will be if different algorithms are used. np.ones((N,2)).mean(0) will have larger accumulated rounding error than np.ones((2,N)).mean(1), if only the latter uses the divide-and-conquer summation. What I wanted to point out is that to some extend the algorithm does not matter. You will not necessarily get identical results already