Re: [Numpy-discussion] please change mean to use dtype=float
Sebastian Haase wrote: On Thursday 21 September 2006 15:28, Tim Hochberg wrote: David M. Cooke wrote: On Thu, 21 Sep 2006 11:34:42 -0700 Tim Hochberg [EMAIL PROTECTED] wrote: Tim Hochberg wrote: Robert Kern wrote: David M. Cooke wrote: On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote: Let me offer a third path: the algorithms used for .mean() and .var() are substandard. There are much better incremental algorithms that entirely avoid the need to accumulate such large (and therefore precision-losing) intermediate values. The algorithms look like the following for 1D arrays in Python: def mean(a): m = a[0] for i in range(1, len(a)): m += (a[i] - m) / (i + 1) return m This isn't really going to be any better than using a simple sum. It'll also be slower (a division per iteration). With one exception, every test that I've thrown at it shows that it's better for float32. That exception is uniformly spaced arrays, like linspace(). You do avoid accumulating large sums, but then doing the division a[i]/len(a) and adding that will do the same. Okay, this is true. Now, if you want to avoid losing precision, you want to use a better summation technique, like compensated (or Kahan) summation: def mean(a): s = e = a.dtype.type(0) for i in range(0, len(a)): temp = s y = a[i] + e s = temp + y e = (temp - s) + y return s / len(a) def var(a): m = a[0] t = a.dtype.type(0) for i in range(1, len(a)): q = a[i] - m r = q / (i+1) m += r t += i * q * r t /= len(a) return t Alternatively, from Knuth: def var_knuth(a): m = a.dtype.type(0) variance = a.dtype.type(0) for i in range(len(a)): delta = a[i] - m m += delta / (i+1) variance += delta * (a[i] - m) variance /= len(a) return variance I'm going to go ahead and attach a module containing the versions of mean, var, etc that I've been playing with in case someone wants to mess with them. Some were stolen from traffic on this list, for others I grabbed the algorithms from wikipedia or equivalent. I looked into this a bit more. I checked float32 (single precision) and float64 (double precision), using long doubles (float96) for the exact results. This is based on your code. Results are compared using abs(exact_stat - computed_stat) / max(abs(values)), with 1 values in the range of [-100, 900] First, the mean. In float32, the Kahan summation in single precision is better by about 2 orders of magnitude than simple summation. However, accumulating the sum in double precision is better by about 9 orders of magnitude than simple summation (7 orders more than Kahan). In float64, Kahan summation is the way to go, by 2 orders of magnitude. For the variance, in float32, Knuth's method is *no better* than the two-pass method. Tim's code does an implicit conversion of intermediate results to float64, which is why he saw a much better result. Doh! And I fixed that same problem in the mean implementation earlier too. I was astounded by how good knuth was doing, but not astounded enough apparently. Does it seem weird to anyone else that in: numpy_scalar op python_scalar the precision ends up being controlled by the python scalar? I would expect the numpy_scalar to control the resulting precision just like numpy arrays do in similar circumstances. Perhaps the egg on my face is just clouding my vision though. The two-pass method using Kahan summation (again, in single precision), is better by about 2 orders of magnitude. There is practically no difference when using a double-precision accumulator amongst the techniques: they're all about 9 orders of magnitude better than single-precision two-pass. In float64, Kahan summation is again better than the rest, by about 2 orders of magnitude. I've put my adaptation of Tim's code, and box-and-whisker plots of the results, at http://arbutus.mcmaster.ca/dmc/numpy/variance/ Conclusions: - If you're going to calculate everything in single precision, use Kahan summation. Using it in double-precision also helps. - If you can use a double-precision accumulator, it's much better than any of the techniques in single-precision only. - for speed+precision in the variance, either use Kahan summation in single precision with the two-pass method, or use double precision with simple summation with the two-pass method. Knuth buys you nothing, except slower code :-) The two pass methods are definitely more accurate. I won't be convinced on the speed front till I see comparable C implementations slug it out. That may well mean never in practice. However, I expect that
Re: [Numpy-discussion] please change mean to use dtype=float
Tim Hochberg wrote: It would probably be nice to expose the Kahan sum and maybe even the raw_kahan_sum somewhere. What about using it for .sum() by default? What is the speed hit anyway? In any case, having it available would be nice. I'm on the fence on using the array dtype for the accumulator dtype versus always using at least double precision for the accumulator. The former is easier to explain and is probably faster, but the latter is a lot more accuracy for basically free. I don't think the difficulty of explanation is a big issue at all -- I'm having a really hard time imagining someone getting confused and/or disappointed that their single precision calculation didn't overflow or was more accurate than expected. Anyone that did, would know enough to understand the explanation. In general, users expect things to just work. They only dig into the details when something goes wrong, like the mean of a bunch of positive integers coming out as negative (wasn't that the OP's example?). The fewer such instance we have, the fewer questions we have. speeds shake out I suppose. If the speed of using float64 is comparable to that of using float32, we might as well. Only testing will tell, but my experience is that with double precision FPUs, doubles are just as fast as floats unless you're dealing with enough memory to make a difference in caching. In this case, only the accumulator is double, so that wouldn't be an issue. I suppose the float to double conversion could take some time though... One thing I'm not on the fence about is the return type: it should always match the input type, or dtype if that is specified. +1 Since numpy-scalars are generally the results of indexing operations, not literals, I think that they should behave like arrays for purposes of determining the resulting precision, not like python-scalars. +1 Of course the accuracy is pretty bad at single precision, so the possible, theoretical speed advantage at large sizes probably doesn't matter. good point. the larger the array -- the more accuracy matters. -Chris -- Christopher Barker, Ph.D. Oceanographer NOAA/ORR/HAZMAT (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception [EMAIL PROTECTED] - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
Re: [Numpy-discussion] please change mean to use dtype=float
On 9/22/06, Tim Hochberg [EMAIL PROTECTED] wrote: Sebastian Haase wrote: On Thursday 21 September 2006 15:28, Tim Hochberg wrote: David M. Cooke wrote: On Thu, 21 Sep 2006 11:34:42 -0700 Tim Hochberg [EMAIL PROTECTED] wrote: Tim Hochberg wrote: Robert Kern wrote: David M. Cooke wrote: On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote: Let me offer a third path: the algorithms used for .mean() and .var() are substandard. There are much better incremental algorithms that entirely avoid the need to accumulate such large (and therefore precision-losing) intermediate values. The algorithms look like the following for 1D arrays in Python: def mean(a):m = a[0]for i in range(1, len(a)): m += (a[i] - m) / (i + 1)return m This isn't really going to be any better than using a simple sum. It'll also be slower (a division per iteration). With one exception, every test that I've thrown at it shows that it's better for float32. That exception is uniformly spaced arrays, like linspace(). You do avoid accumulating large sums, but then doing the division a[i]/len(a) and adding that will do the same. Okay, this is true. Now, if you want to avoid losing precision, you want to use a better summation technique, like compensated (or Kahan) summation: def mean(a): s = e = a.dtype.type (0) for i in range(0, len(a)): temp = s y = a[i] + e s = temp + y e = (temp - s) + y return s / len(a) def var(a): m = a[0]t = a.dtype.type(0)for i in range(1, len(a)):q = a[i] - m r = q / (i+1)m += rt += i * q * rt /= len(a) return t Alternatively, from Knuth: def var_knuth(a): m = a.dtype.type(0)variance = a.dtype.type(0)for i in range(len(a)):delta = a[i] - m m += delta / (i+1)variance += delta * (a[i] - m)variance /= len(a)return variance I'm going to go ahead and attach a module containing the versions of mean, var, etc that I've been playing with in case someone wants to mess with them. Some were stolen from traffic on this list, for others I grabbed the algorithms from wikipedia or equivalent. I looked into this a bit more. I checked float32 (single precision) and float64 (double precision), using long doubles (float96) for the exact results. This is based on your code. Results are compared using abs(exact_stat - computed_stat) / max(abs(values)), with 1 values in the range of [-100, 900] First, the mean. In float32, the Kahan summation in single precision is better by about 2 orders of magnitude than simple summation. However, accumulating the sum in double precision is better by about 9 orders of magnitude than simple summation (7 orders more than Kahan). In float64, Kahan summation is the way to go, by 2 orders of magnitude. For the variance, in float32, Knuth's method is *no better* than the two-pass method. Tim's code does an implicit conversion of intermediate results to float64, which is why he saw a much better result. Doh! And I fixed that same problem in the mean implementation earlier too. I was astounded by how good knuth was doing, but not astounded enough apparently. Does it seem weird to anyone else that in: numpy_scalar op python_scalar the precision ends up being controlled by the python scalar? I would expect the numpy_scalar to control the resulting precision just like numpy arrays do in similar circumstances. Perhaps the egg on my face is just clouding my vision though. The two-pass method using Kahan summation (again, in single precision), is better by about 2 orders of magnitude. There is practically no difference when using a double-precision accumulator amongst the techniques: they're all about 9 orders of magnitude better than single-precision two-pass. In float64, Kahan summation is again better than the rest, by about 2 orders of magnitude. I've put my adaptation of Tim's code, and box-and-whisker plots of the results, at http://arbutus.mcmaster.ca/dmc/numpy/variance/ Conclusions: - If you're going to calculate everything in single precision, use Kahan summation. Using it in double-precision also helps. - If you can use a double-precision accumulator, it's much better than any of the techniques in single-precision only. - for speed+precision in the variance, either use Kahan summation in single precision with the two-pass method, or use double precision with simple summation with the two-pass method. Knuth buys you nothing, except slower code :-) The two pass methods are definitely more accurate. I won't be convinced on the speed front till I see comparable C implementations slug it out. That may well mean never in practice. However, I expect that somewhere around 10,000 items, the cache will overflow and memory bandwidth will become the bottleneck. At that point the extra operations of Knuth won't matter as much as making two passes through the array and Knuth will win on speed. Of course the accuracy is pretty bad at single precision, so the possible, theoretical speed advantage at large sizes probably doesn't matter. -tim After 1.0 is out, we should look at doing one of
Re: [Numpy-discussion] please change mean to use dtype=float
On Thu, 21 Sep 2006 11:34:42 -0700 Tim Hochberg [EMAIL PROTECTED] wrote: Tim Hochberg wrote: Robert Kern wrote: David M. Cooke wrote: On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote: Let me offer a third path: the algorithms used for .mean() and .var() are substandard. There are much better incremental algorithms that entirely avoid the need to accumulate such large (and therefore precision-losing) intermediate values. The algorithms look like the following for 1D arrays in Python: def mean(a): m = a[0] for i in range(1, len(a)): m += (a[i] - m) / (i + 1) return m This isn't really going to be any better than using a simple sum. It'll also be slower (a division per iteration). With one exception, every test that I've thrown at it shows that it's better for float32. That exception is uniformly spaced arrays, like linspace(). You do avoid accumulating large sums, but then doing the division a[i]/len(a) and adding that will do the same. Okay, this is true. Now, if you want to avoid losing precision, you want to use a better summation technique, like compensated (or Kahan) summation: def mean(a): s = e = a.dtype.type(0) for i in range(0, len(a)): temp = s y = a[i] + e s = temp + y e = (temp - s) + y return s / len(a) def var(a): m = a[0] t = a.dtype.type(0) for i in range(1, len(a)): q = a[i] - m r = q / (i+1) m += r t += i * q * r t /= len(a) return t Alternatively, from Knuth: def var_knuth(a): m = a.dtype.type(0) variance = a.dtype.type(0) for i in range(len(a)): delta = a[i] - m m += delta / (i+1) variance += delta * (a[i] - m) variance /= len(a) return variance I'm going to go ahead and attach a module containing the versions of mean, var, etc that I've been playing with in case someone wants to mess with them. Some were stolen from traffic on this list, for others I grabbed the algorithms from wikipedia or equivalent. I looked into this a bit more. I checked float32 (single precision) and float64 (double precision), using long doubles (float96) for the exact results. This is based on your code. Results are compared using abs(exact_stat - computed_stat) / max(abs(values)), with 1 values in the range of [-100, 900] First, the mean. In float32, the Kahan summation in single precision is better by about 2 orders of magnitude than simple summation. However, accumulating the sum in double precision is better by about 9 orders of magnitude than simple summation (7 orders more than Kahan). In float64, Kahan summation is the way to go, by 2 orders of magnitude. For the variance, in float32, Knuth's method is *no better* than the two-pass method. Tim's code does an implicit conversion of intermediate results to float64, which is why he saw a much better result. The two-pass method using Kahan summation (again, in single precision), is better by about 2 orders of magnitude. There is practically no difference when using a double-precision accumulator amongst the techniques: they're all about 9 orders of magnitude better than single-precision two-pass. In float64, Kahan summation is again better than the rest, by about 2 orders of magnitude. I've put my adaptation of Tim's code, and box-and-whisker plots of the results, at http://arbutus.mcmaster.ca/dmc/numpy/variance/ Conclusions: - If you're going to calculate everything in single precision, use Kahan summation. Using it in double-precision also helps. - If you can use a double-precision accumulator, it's much better than any of the techniques in single-precision only. - for speed+precision in the variance, either use Kahan summation in single precision with the two-pass method, or use double precision with simple summation with the two-pass method. Knuth buys you nothing, except slower code :-) After 1.0 is out, we should look at doing one of the above. -- ||\/| /--\ |David M. Cooke http://arbutus.physics.mcmaster.ca/dmc/ |[EMAIL PROTECTED] - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
Re: [Numpy-discussion] please change mean to use dtype=float
David M. Cooke wrote: Conclusions: - If you're going to calculate everything in single precision, use Kahan summation. Using it in double-precision also helps. - If you can use a double-precision accumulator, it's much better than any of the techniques in single-precision only. - for speed+precision in the variance, either use Kahan summation in single precision with the two-pass method, or use double precision with simple summation with the two-pass method. Knuth buys you nothing, except slower code :-) After 1.0 is out, we should look at doing one of the above. +1 - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
Re: [Numpy-discussion] please change mean to use dtype=float
David M. Cooke wrote: On Thu, 21 Sep 2006 11:34:42 -0700 Tim Hochberg [EMAIL PROTECTED] wrote: Tim Hochberg wrote: Robert Kern wrote: David M. Cooke wrote: On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote: Let me offer a third path: the algorithms used for .mean() and .var() are substandard. There are much better incremental algorithms that entirely avoid the need to accumulate such large (and therefore precision-losing) intermediate values. The algorithms look like the following for 1D arrays in Python: def mean(a): m = a[0] for i in range(1, len(a)): m += (a[i] - m) / (i + 1) return m This isn't really going to be any better than using a simple sum. It'll also be slower (a division per iteration). With one exception, every test that I've thrown at it shows that it's better for float32. That exception is uniformly spaced arrays, like linspace(). You do avoid accumulating large sums, but then doing the division a[i]/len(a) and adding that will do the same. Okay, this is true. Now, if you want to avoid losing precision, you want to use a better summation technique, like compensated (or Kahan) summation: def mean(a): s = e = a.dtype.type(0) for i in range(0, len(a)): temp = s y = a[i] + e s = temp + y e = (temp - s) + y return s / len(a) def var(a): m = a[0] t = a.dtype.type(0) for i in range(1, len(a)): q = a[i] - m r = q / (i+1) m += r t += i * q * r t /= len(a) return t Alternatively, from Knuth: def var_knuth(a): m = a.dtype.type(0) variance = a.dtype.type(0) for i in range(len(a)): delta = a[i] - m m += delta / (i+1) variance += delta * (a[i] - m) variance /= len(a) return variance I'm going to go ahead and attach a module containing the versions of mean, var, etc that I've been playing with in case someone wants to mess with them. Some were stolen from traffic on this list, for others I grabbed the algorithms from wikipedia or equivalent. I looked into this a bit more. I checked float32 (single precision) and float64 (double precision), using long doubles (float96) for the exact results. This is based on your code. Results are compared using abs(exact_stat - computed_stat) / max(abs(values)), with 1 values in the range of [-100, 900] First, the mean. In float32, the Kahan summation in single precision is better by about 2 orders of magnitude than simple summation. However, accumulating the sum in double precision is better by about 9 orders of magnitude than simple summation (7 orders more than Kahan). In float64, Kahan summation is the way to go, by 2 orders of magnitude. For the variance, in float32, Knuth's method is *no better* than the two-pass method. Tim's code does an implicit conversion of intermediate results to float64, which is why he saw a much better result. The two-pass method using Kahan summation (again, in single precision), is better by about 2 orders of magnitude. There is practically no difference when using a double-precision accumulator amongst the techniques: they're all about 9 orders of magnitude better than single-precision two-pass. In float64, Kahan summation is again better than the rest, by about 2 orders of magnitude. I've put my adaptation of Tim's code, and box-and-whisker plots of the results, at http://arbutus.mcmaster.ca/dmc/numpy/variance/ Conclusions: - If you're going to calculate everything in single precision, use Kahan summation. Using it in double-precision also helps. - If you can use a double-precision accumulator, it's much better than any of the techniques in single-precision only. - for speed+precision in the variance, either use Kahan summation in single precision with the two-pass method, or use double precision with simple summation with the two-pass method. Knuth buys you nothing, except slower code :-) After 1.0 is out, we should look at doing one of the above. One more little tidbit; it appears possible to fix up Knuth's algorithm so that it's comparable in accuracy to the two pass Kahan version by doing Kahan summation while accumulating the variance. Testing on this was far from thorough, but in the tests I did it nearly always produced identical results to kahan. Of course this is even slower than the original Knuth version, but it's interesting anyway: # This is probably messier than it need to be # but I'm out of time for today... def var_knuth2(values, dtype=default_prec): var(values) - variance of values computed using Knuth's one pass algorithm m = variance = mc = vc = dtype(0) for i, x in enumerate(values):
Re: [Numpy-discussion] please change mean to use dtype=float
On Thursday 21 September 2006 15:28, Tim Hochberg wrote: David M. Cooke wrote: On Thu, 21 Sep 2006 11:34:42 -0700 Tim Hochberg [EMAIL PROTECTED] wrote: Tim Hochberg wrote: Robert Kern wrote: David M. Cooke wrote: On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote: Let me offer a third path: the algorithms used for .mean() and .var() are substandard. There are much better incremental algorithms that entirely avoid the need to accumulate such large (and therefore precision-losing) intermediate values. The algorithms look like the following for 1D arrays in Python: def mean(a): m = a[0] for i in range(1, len(a)): m += (a[i] - m) / (i + 1) return m This isn't really going to be any better than using a simple sum. It'll also be slower (a division per iteration). With one exception, every test that I've thrown at it shows that it's better for float32. That exception is uniformly spaced arrays, like linspace(). You do avoid accumulating large sums, but then doing the division a[i]/len(a) and adding that will do the same. Okay, this is true. Now, if you want to avoid losing precision, you want to use a better summation technique, like compensated (or Kahan) summation: def mean(a): s = e = a.dtype.type(0) for i in range(0, len(a)): temp = s y = a[i] + e s = temp + y e = (temp - s) + y return s / len(a) def var(a): m = a[0] t = a.dtype.type(0) for i in range(1, len(a)): q = a[i] - m r = q / (i+1) m += r t += i * q * r t /= len(a) return t Alternatively, from Knuth: def var_knuth(a): m = a.dtype.type(0) variance = a.dtype.type(0) for i in range(len(a)): delta = a[i] - m m += delta / (i+1) variance += delta * (a[i] - m) variance /= len(a) return variance I'm going to go ahead and attach a module containing the versions of mean, var, etc that I've been playing with in case someone wants to mess with them. Some were stolen from traffic on this list, for others I grabbed the algorithms from wikipedia or equivalent. I looked into this a bit more. I checked float32 (single precision) and float64 (double precision), using long doubles (float96) for the exact results. This is based on your code. Results are compared using abs(exact_stat - computed_stat) / max(abs(values)), with 1 values in the range of [-100, 900] First, the mean. In float32, the Kahan summation in single precision is better by about 2 orders of magnitude than simple summation. However, accumulating the sum in double precision is better by about 9 orders of magnitude than simple summation (7 orders more than Kahan). In float64, Kahan summation is the way to go, by 2 orders of magnitude. For the variance, in float32, Knuth's method is *no better* than the two-pass method. Tim's code does an implicit conversion of intermediate results to float64, which is why he saw a much better result. Doh! And I fixed that same problem in the mean implementation earlier too. I was astounded by how good knuth was doing, but not astounded enough apparently. Does it seem weird to anyone else that in: numpy_scalar op python_scalar the precision ends up being controlled by the python scalar? I would expect the numpy_scalar to control the resulting precision just like numpy arrays do in similar circumstances. Perhaps the egg on my face is just clouding my vision though. The two-pass method using Kahan summation (again, in single precision), is better by about 2 orders of magnitude. There is practically no difference when using a double-precision accumulator amongst the techniques: they're all about 9 orders of magnitude better than single-precision two-pass. In float64, Kahan summation is again better than the rest, by about 2 orders of magnitude. I've put my adaptation of Tim's code, and box-and-whisker plots of the results, at http://arbutus.mcmaster.ca/dmc/numpy/variance/ Conclusions: - If you're going to calculate everything in single precision, use Kahan summation. Using it in double-precision also helps. - If you can use a double-precision accumulator, it's much better than any of the techniques in single-precision only. - for speed+precision in the variance, either use Kahan summation in single precision with the two-pass method, or use double precision with simple summation with the two-pass method. Knuth buys you nothing, except slower code :-) The two pass methods are definitely more accurate. I won't be convinced on the speed front till I see comparable C implementations slug it out. That may well mean never in practice. However, I expect that somewhere around 10,000 items, the cache will overflow and memory bandwidth will become
Re: [Numpy-discussion] please change mean to use dtype=float
Sebastian Haase wrote: Robert Kern wrote: Sebastian Haase wrote: I know that having too much knowledge of the details often makes one forget what the newcomers will do and expect. Please be more careful with such accusations. Repeated frequently, they can become quite insulting. I did not mean to insult anyone - what I meant was, that I'm for numpy becoming an easy platform to use. I have spend and enjoyed part the last four years developing and evangelizing Python as an alternative to Matlab and C/Fortran based image analysis environment. I often find myself arguing for good support of the single precision data format. So I find it actually somewhat ironic to see myself arguing now for wanting float64 over float32 ;-) No one is doubting that you want numpy to be easy to use. Please don't doubt that the rest of us want otherwise. However, the fact that you *want* numpy to be easy to use does not mean that your suggestions *will* make numpy easy to use. We haven't forgotten what newcomers will do; to the contrary, we are quite aware that new users need consistent behavior in order to learn how to use a system. Adding another special case in how dtypes implicitly convert to one another will impede new users being able to understand the whole system. See A. M. Archibald's question in the thread ufunc.reduce and conversion for an example. In our judgement this is a worse outcome than notational convenience for float32 users, who already need to be aware of the effects of their precision choice. Each of us can come to different conclusions in good faith without one of us forgetting the new user experience. Let me offer a third path: the algorithms used for .mean() and .var() are substandard. There are much better incremental algorithms that entirely avoid the need to accumulate such large (and therefore precision-losing) intermediate values. The algorithms look like the following for 1D arrays in Python: def mean(a): m = a[0] for i in range(1, len(a)): m += (a[i] - m) / (i + 1) return m def var(a): m = a[0] t = a.dtype.type(0) for i in range(1, len(a)): q = a[i] - m r = q / (i+1) m += r t += i * q * r t /= len(a) return t Alternatively, from Knuth: def var_knuth(a): m = a.dtype.type(0) variance = a.dtype.type(0) for i in range(len(a)): delta = a[i] - m m += delta / (i+1) variance += delta * (a[i] - m) variance /= len(a) return variance If you will code up implementations of these for ndarray.mean() and ndarray.var(), I will check them in and then float32 arrays will avoid most of the catastrophes that the current implementations run into. We are only talking about people that will a) work with single-precision data (e.g. large scale-image analysis) and who b) will tend to just use the default (dtype) --- How else can I say this: these people will just assume that arr.mean() *is* the mean of arr. I don't understand what you mean, here. arr.mean() is almost never *the* mean of arr. Double precision can't fix that. This was not supposed to be a scientific statement -- I'm (again) thinking of our students that not always appreciate the full complexity of computational numerics and data types and such. They need to appreciate the complexity of computational numerics if they are going to do numerical computation. Double precision does not make it any simpler. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
Re: [Numpy-discussion] please change mean to use dtype=float
Robert Kern wrote: This was not supposed to be a scientific statement -- I'm (again) thinking of our students that not always appreciate the full complexity of computational numerics and data types and such. They need to appreciate the complexity of computational numerics if they are going to do numerical computation. Double precision does not make it any simpler. This is were we differ. We haven't forgotten what newcomers will do; to the contrary, we are quite aware that new users need consistent behavior in order to learn how to use a system. Adding another special case in how dtypes implicitly convert to one another will impede new users being able to understand the whole system. All I'm proposing could be summarized in: mean(), sum(), var() ... produce output of dtype float64 (except for input float96 which produces float96) A comment on this is also that for these operations the input type/precision is almost not related to the resulting output precision -- the int case makes that already clear. (This is different for e.g. min() or max() ) The proposed alternative implementations seem to have one or more multiplication (or division) for each value -- this might be noticeably slower ... Regards, Sebastian - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
Re: [Numpy-discussion] please change mean to use dtype=float
Robert Kern wrote: Sebastian Haase wrote: Robert Kern wrote: Sebastian Haase wrote: I know that having too much knowledge of the details often makes one forget what the newcomers will do and expect. Please be more careful with such accusations. Repeated frequently, they can become quite insulting. I did not mean to insult anyone - what I meant was, that I'm for numpy becoming an easy platform to use. I have spend and enjoyed part the last four years developing and evangelizing Python as an alternative to Matlab and C/Fortran based image analysis environment. I often find myself arguing for good support of the single precision data format. So I find it actually somewhat ironic to see myself arguing now for wanting float64 over float32 ;-) No one is doubting that you want numpy to be easy to use. Please don't doubt that the rest of us want otherwise. However, the fact that you *want* numpy to be easy to use does not mean that your suggestions *will* make numpy easy to use. We haven't forgotten what newcomers will do; to the contrary, we are quite aware that new users need consistent behavior in order to learn how to use a system. Adding another special case in how dtypes implicitly convert to one another will impede new users being able to understand the whole system. See A. M. Archibald's question in the thread ufunc.reduce and conversion for an example. In our judgement this is a worse outcome than notational convenience for float32 users, who already need to be aware of the effects of their precision choice. Each of us can come to different conclusions in good faith without one of us forgetting the new user experience. Let me offer a third path: the algorithms used for .mean() and .var() are substandard. There are much better incremental algorithms that entirely avoid the need to accumulate such large (and therefore precision-losing) intermediate values. The algorithms look like the following for 1D arrays in Python: def mean(a): m = a[0] for i in range(1, len(a)): m += (a[i] - m) / (i + 1) return m def var(a): m = a[0] t = a.dtype.type(0) for i in range(1, len(a)): q = a[i] - m r = q / (i+1) m += r t += i * q * r t /= len(a) return t Alternatively, from Knuth: def var_knuth(a): m = a.dtype.type(0) variance = a.dtype.type(0) for i in range(len(a)): delta = a[i] - m m += delta / (i+1) variance += delta * (a[i] - m) variance /= len(a) return variance If you will code up implementations of these for ndarray.mean() and ndarray.var(), I will check them in and then float32 arrays will avoid most of the catastrophes that the current implementations run into. +1 We are only talking about people that will a) work with single-precision data (e.g. large scale-image analysis) and who b) will tend to just use the default (dtype) --- How else can I say this: these people will just assume that arr.mean() *is* the mean of arr. I don't understand what you mean, here. arr.mean() is almost never *the* mean of arr. Double precision can't fix that. This was not supposed to be a scientific statement -- I'm (again) thinking of our students that not always appreciate the full complexity of computational numerics and data types and such. They need to appreciate the complexity of computational numerics if they are going to do numerical computation. Double precision does not make it any simpler. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
Re: [Numpy-discussion] please change mean to use dtype=float
On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote: Let me offer a third path: the algorithms used for .mean() and .var() are substandard. There are much better incremental algorithms that entirely avoid the need to accumulate such large (and therefore precision-losing) intermediate values. The algorithms look like the following for 1D arrays in Python: def mean(a): m = a[0] for i in range(1, len(a)): m += (a[i] - m) / (i + 1) return m This isn't really going to be any better than using a simple sum. It'll also be slower (a division per iteration). You do avoid accumulating large sums, but then doing the division a[i]/len(a) and adding that will do the same. Now, if you want to avoid losing precision, you want to use a better summation technique, like compensated (or Kahan) summation: def mean(a): s = e = a.dtype.type(0) for i in range(0, len(a)): temp = s y = a[i] + e s = temp + y e = (temp - s) + y return s / len(a) Some numerical experiments in Maple using 5-digit precision show that your mean is maybe a bit better in some cases, but can also be much worse, than sum(a)/len(a), but both are quite poor in comparision to the Kahan summation. (We could probably use a fast implementation of Kahan summation in addition to a.sum()) def var(a): m = a[0] t = a.dtype.type(0) for i in range(1, len(a)): q = a[i] - m r = q / (i+1) m += r t += i * q * r t /= len(a) return t Alternatively, from Knuth: def var_knuth(a): m = a.dtype.type(0) variance = a.dtype.type(0) for i in range(len(a)): delta = a[i] - m m += delta / (i+1) variance += delta * (a[i] - m) variance /= len(a) return variance These formulas are good when you can only do one pass over the data (like in a calculator where you don't store all the data points), but are slightly worse than doing two passes. Kahan summation would probably also be good here too. -- ||\/| /--\ |David M. Cooke http://arbutus.physics.mcmaster.ca/dmc/ |[EMAIL PROTECTED] - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
Re: [Numpy-discussion] please change mean to use dtype=float
Sebastian Haase wrote: The best I can hope for is a sound default for most (practical) cases... I still think that 80bit vs. 128bit vs 96bit is rather academic for most people ... most people seem to only use float64 and then there are some that use float32 (like us) ... I fully agree with Sebastian here. As Travis pointed out, all we are talking about is the default. The default should follow the principle of least surprise for the less-knowledgeable users. Personally, I try to always use doubles, unless I have a real reason not to. The recent change of default types for zeros et al. will help. clearly, there is a problem to say the default accumulator for *all* types is double, as you wouldn't want to downcast float128s, even if they are obscure. However, is it that hard to say that the default accumulator will have *at least* the precision of double? Robert Kern wrote: Let me offer a third path: the algorithms used for .mean() and .var() are substandard. There are much better incremental algorithms that entirely avoid the need to accumulate such large (and therefore precision-losing) intermediate values. This, of course, is an even better solution, unless there are substantial performance impacts. -Chris -- Christopher Barker, Ph.D. Oceanographer NOAA/ORR/HAZMAT (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception [EMAIL PROTECTED] - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
Re: [Numpy-discussion] please change mean to use dtype=float
On 9/20/06, David M. Cooke [EMAIL PROTECTED] wrote: On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote: Let me offer a third path: the algorithms used for .mean() and .var() are substandard. There are much better incremental algorithms that entirely avoid the need to accumulate such large (and therefore precision-losing) intermediate values. The algorithms look like the following for 1D arrays in Python: def mean(a):m = a[0] for i in range(1, len(a)):m += (a[i] - m) / (i + 1)return mThis isn't really going to be any better than using a simple sum.It'll also be slower (a division per iteration). You do avoid accumulating large sums, but then doing the division a[i]/len(a) andadding that will do the same.Now, if you want to avoid losing precision, you want to use a bettersummation technique, like compensated (or Kahan) summation: def mean(a):s = e = a.dtype.type(0)for i in range(0, len(a)):temp = sy = a[i] + es = temp + ye = (temp - s) + yreturn s / len(a) A variant of this can also be used to generate the sin/cos for fourier transforms using repeated complex multiplication. It works amazingly well. But I suspect accumulating in higher precision would be just as good and faster if the hardware supports it. Divide and conquer, like an fft where only the constant coefficient is computed, is also a good bet but requires lg(n) passes over the data and extra storage. Some numerical experiments in Maple using 5-digit precision show thatyour mean is maybe a bit better in some cases, but can also be muchworse, than sum(a)/len(a), but both are quite poor in comparision to theKahan summation. (We could probably use a fast implementation of Kahan summation inaddition to a.sum()) def var(a):m = a[0]t = a.dtype.type(0)for i in range(1, len(a)): q = a[i] - mr = q / (i+1)m += rt += i * q * rt /= len(a)return t Alternatively, from Knuth: def var_knuth(a): m = a.dtype.type(0)variance = a.dtype.type(0)for i in range(len(a)):delta = a[i] - mm += delta / (i+1)variance += delta * (a[i] - m) variance /= len(a)return varianceThese formulas are good when you can only do one pass over the data(like in a calculator where you don't store all the data points), butare slightly worse than doing two passes. Kahan summation would probably also be good here too.I think we should leave things as they are. Choosing the right precision for accumulating sums is absolutely fundamental, I always think about it when programming such loops because it is always a potential gotcha. Making the defaults higher precision just kicks the can down the road where the unwary are still likely to trip over it at some point. Perhaps these special tricks for special cases could be included in scipy somewhere.Chuck - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV___ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
Re: [Numpy-discussion] please change mean to use dtype=float
David M. Cooke wrote: On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote: Let me offer a third path: the algorithms used for .mean() and .var() are substandard. There are much better incremental algorithms that entirely avoid the need to accumulate such large (and therefore precision-losing) intermediate values. The algorithms look like the following for 1D arrays in Python: def mean(a): m = a[0] for i in range(1, len(a)): m += (a[i] - m) / (i + 1) return m This isn't really going to be any better than using a simple sum. It'll also be slower (a division per iteration). With one exception, every test that I've thrown at it shows that it's better for float32. That exception is uniformly spaced arrays, like linspace(). You do avoid accumulating large sums, but then doing the division a[i]/len(a) and adding that will do the same. Okay, this is true. Now, if you want to avoid losing precision, you want to use a better summation technique, like compensated (or Kahan) summation: def mean(a): s = e = a.dtype.type(0) for i in range(0, len(a)): temp = s y = a[i] + e s = temp + y e = (temp - s) + y return s / len(a) Some numerical experiments in Maple using 5-digit precision show that your mean is maybe a bit better in some cases, but can also be much worse, than sum(a)/len(a), but both are quite poor in comparision to the Kahan summation. (We could probably use a fast implementation of Kahan summation in addition to a.sum()) +1 def var(a): m = a[0] t = a.dtype.type(0) for i in range(1, len(a)): q = a[i] - m r = q / (i+1) m += r t += i * q * r t /= len(a) return t Alternatively, from Knuth: def var_knuth(a): m = a.dtype.type(0) variance = a.dtype.type(0) for i in range(len(a)): delta = a[i] - m m += delta / (i+1) variance += delta * (a[i] - m) variance /= len(a) return variance These formulas are good when you can only do one pass over the data (like in a calculator where you don't store all the data points), but are slightly worse than doing two passes. Kahan summation would probably also be good here too. Again, my tests show otherwise for float32. I'll condense my ipython log into a module for everyone's perusal. It's possible that the Kahan summation of the squared residuals will work better than the current two-pass algorithm and the implementations I give above. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
Re: [Numpy-discussion] please change mean to use dtype=float
Robert Kern wrote: David M. Cooke wrote: On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote: Let me offer a third path: the algorithms used for .mean() and .var() are substandard. There are much better incremental algorithms that entirely avoid the need to accumulate such large (and therefore precision-losing) intermediate values. The algorithms look like the following for 1D arrays in Python: def mean(a): m = a[0] for i in range(1, len(a)): m += (a[i] - m) / (i + 1) return m This isn't really going to be any better than using a simple sum. It'll also be slower (a division per iteration). With one exception, every test that I've thrown at it shows that it's better for float32. That exception is uniformly spaced arrays, like linspace(). Here's version of mean based on the Kahan sum, including the compensation term that seems to be consistently much better than builtin mean It seems to be typically 4 orders or magnitude closer to the correct answer than the builtin mean and 2 orders of magnitude better than just naively using the Kahan sum. I'm using the sum performed with dtype=int64 as the correct value. def rawKahanSum(values): # Stolen from mathworld if not input: return 0.0 total = values[0] c = type(total)(0.0)# A running compensation for lost low-order bits. for x in values[1:]: y = x - c # So far, so good: c is zero. t = total + y # Alas, total is big, y small, so low-order digits of y are lost. c = (t - total) - y# (t - total) recovers the high-order part of y; subtracting y recovers -(low part of y) total = t # Algebriacally, c should always be zero. Beware eagerly optimising compilers! # Next time around, the lost low part will be added to y in a fresh attempt. return total, c def kahanSum(values): total, c = rawKahanSum(values) return total def mean(values): total, c = rawKahanSum(values) n = float(len(values)) return total / n - c / n for i in range(100): data = random.random([1]).astype(float32) baseline = data.mean(dtype=float64) delta_builtin_mean = baseline - data.mean() delta_compensated_mean = baseline - mean(data) delta_kahan_mean = baseline - (kahanSum(data) / len(data)) if not abs(delta_builtin_mean) = abs(delta_kahan_mean) = abs(delta_compensated_mean): print , print delta_builtin_mean, delta_kahan_mean, delta_compensated_mean -tim You do avoid accumulating large sums, but then doing the division a[i]/len(a) and adding that will do the same. Okay, this is true. Now, if you want to avoid losing precision, you want to use a better summation technique, like compensated (or Kahan) summation: def mean(a): s = e = a.dtype.type(0) for i in range(0, len(a)): temp = s y = a[i] + e s = temp + y e = (temp - s) + y return s / len(a) Some numerical experiments in Maple using 5-digit precision show that your mean is maybe a bit better in some cases, but can also be much worse, than sum(a)/len(a), but both are quite poor in comparision to the Kahan summation. (We could probably use a fast implementation of Kahan summation in addition to a.sum()) +1 def var(a): m = a[0] t = a.dtype.type(0) for i in range(1, len(a)): q = a[i] - m r = q / (i+1) m += r t += i * q * r t /= len(a) return t Alternatively, from Knuth: def var_knuth(a): m = a.dtype.type(0) variance = a.dtype.type(0) for i in range(len(a)): delta = a[i] - m m += delta / (i+1) variance += delta * (a[i] - m) variance /= len(a) return variance These formulas are good when you can only do one pass over the data (like in a calculator where you don't store all the data points), but are slightly worse than doing two passes. Kahan summation would probably also be good here too. Again, my tests show otherwise for float32. I'll condense my ipython log into a module for everyone's perusal. It's possible that the Kahan summation of the squared residuals will work better than the current two-pass algorithm and the implementations I give above. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
Re: [Numpy-discussion] please change mean to use dtype=float
[Sorry, this version should have less munged formatting since I clipped the comments. Oh, and the Kahan sum algorithm was grabbed from wikipedia, not mathworld] Tim Hochberg wrote: Robert Kern wrote: David M. Cooke wrote: On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote: Let me offer a third path: the algorithms used for .mean() and .var() are substandard. There are much better incremental algorithms that entirely avoid the need to accumulate such large (and therefore precision-losing) intermediate values. The algorithms look like the following for 1D arrays in Python: def mean(a): m = a[0] for i in range(1, len(a)): m += (a[i] - m) / (i + 1) return m This isn't really going to be any better than using a simple sum. It'll also be slower (a division per iteration). With one exception, every test that I've thrown at it shows that it's better for float32. That exception is uniformly spaced arrays, like linspace(). Here's version of mean based on the Kahan sum, including the compensation term that seems to be consistently much better than builtin mean It seems to be typically 4 orders or magnitude closer to the correct answer than the builtin mean and 2 orders of magnitude better than just naively using the Kahan sum. I'm using the sum performed with dtype=int64 as the correct value. def rawKahanSum(values): if not input: return 0.0 total = values[0] c = type(total)(0.0) for x in values[1:]: y = x - c t = total + y c = (t - total) - y total = t return total, c def kahanSum(values): total, c = rawKahanSum(values) return total def mean(values): total, c = rawKahanSum(values) n = float(len(values)) return total / n - c / n for i in range(100): data = random.random([1]).astype(float32) baseline = data.mean(dtype=float64) delta_builtin_mean = baseline - data.mean() delta_compensated_mean = baseline - mean(data) delta_kahan_mean = baseline - (kahanSum(data) / len(data)) if not abs(delta_builtin_mean) = abs(delta_kahan_mean) = abs(delta_compensated_mean): print , print delta_builtin_mean, delta_kahan_mean, delta_compensated_mean -tim You do avoid accumulating large sums, but then doing the division a[i]/len(a) and adding that will do the same. Okay, this is true. Now, if you want to avoid losing precision, you want to use a better summation technique, like compensated (or Kahan) summation: def mean(a): s = e = a.dtype.type(0) for i in range(0, len(a)): temp = s y = a[i] + e s = temp + y e = (temp - s) + y return s / len(a) Some numerical experiments in Maple using 5-digit precision show that your mean is maybe a bit better in some cases, but can also be much worse, than sum(a)/len(a), but both are quite poor in comparision to the Kahan summation. (We could probably use a fast implementation of Kahan summation in addition to a.sum()) +1 def var(a): m = a[0] t = a.dtype.type(0) for i in range(1, len(a)): q = a[i] - m r = q / (i+1) m += r t += i * q * r t /= len(a) return t Alternatively, from Knuth: def var_knuth(a): m = a.dtype.type(0) variance = a.dtype.type(0) for i in range(len(a)): delta = a[i] - m m += delta / (i+1) variance += delta * (a[i] - m) variance /= len(a) return variance These formulas are good when you can only do one pass over the data (like in a calculator where you don't store all the data points), but are slightly worse than doing two passes. Kahan summation would probably also be good here too. Again, my tests show otherwise for float32. I'll condense my ipython log into a module for everyone's perusal. It's possible that the Kahan summation of the squared residuals will work better than the current two-pass algorithm and the implementations I give above. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion - Take Surveys. Earn Cash. Influence the Future of IT Join
Re: [Numpy-discussion] please change mean to use dtype=float
Robert Kern wrote: David M. Cooke wrote: On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote: Let me offer a third path: the algorithms used for .mean() and .var() are substandard. There are much better incremental algorithms that entirely avoid the need to accumulate such large (and therefore precision-losing) intermediate values. The algorithms look like the following for 1D arrays in Python: def mean(a): m = a[0] for i in range(1, len(a)): m += (a[i] - m) / (i + 1) return m This isn't really going to be any better than using a simple sum. It'll also be slower (a division per iteration). With one exception, every test that I've thrown at it shows that it's better for float32. That exception is uniformly spaced arrays, like linspace(). You do avoid accumulating large sums, but then doing the division a[i]/len(a) and adding that will do the same. Okay, this is true. Now, if you want to avoid losing precision, you want to use a better summation technique, like compensated (or Kahan) summation: def mean(a): s = e = a.dtype.type(0) for i in range(0, len(a)): temp = s y = a[i] + e s = temp + y e = (temp - s) + y return s / len(a) Some numerical experiments in Maple using 5-digit precision show that your mean is maybe a bit better in some cases, but can also be much worse, than sum(a)/len(a), but both are quite poor in comparision to the Kahan summation. (We could probably use a fast implementation of Kahan summation in addition to a.sum()) +1 def var(a): m = a[0] t = a.dtype.type(0) for i in range(1, len(a)): q = a[i] - m r = q / (i+1) m += r t += i * q * r t /= len(a) return t Alternatively, from Knuth: def var_knuth(a): m = a.dtype.type(0) variance = a.dtype.type(0) for i in range(len(a)): delta = a[i] - m m += delta / (i+1) variance += delta * (a[i] - m) variance /= len(a) return variance These formulas are good when you can only do one pass over the data (like in a calculator where you don't store all the data points), but are slightly worse than doing two passes. Kahan summation would probably also be good here too. On the flip side, it doesn't take a very large array (somewhere in the vicinity of 10,000 values in my experience) before memory bandwidth becomes a limiting factor. In that region a two pass algorithm could well be twice as slow as a single pass algorithm even if the former is somewhat simpler in terms of numeric operations. -tim Again, my tests show otherwise for float32. I'll condense my ipython log into a module for everyone's perusal. It's possible that the Kahan summation of the squared residuals will work better than the current two-pass algorithm and the implementations I give above. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
Re: [Numpy-discussion] please change mean to use dtype=float
Sebastian Haase wrote: Hello all, I just had someone from my lab coming to my desk saying: My god - SciPy is really stupid An array with only positive numbers claims to have a negative mean !! ? I was asking about this before ... the reason was of course that her array was of dtype int32 and had many large values to cause an overflow (wrap around) . Now that the default for axis is None (for all functions having an axis argument), can we please change dtype to default to float64 !? The default is float64 now (as long as you are not using numpy.oldnumeric). I suppose more appropriately, we could reduce over float for integer data-types when calculating the mean as well (since a floating point is returned anyway). -Travis - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
Re: [Numpy-discussion] please change mean to use dtype=float
On Tuesday 19 September 2006 15:48, Travis Oliphant wrote: Sebastian Haase wrote: snip can we please change dtype to default to float64 !? The default is float64 now (as long as you are not using numpy.oldnumeric). I suppose more appropriately, we could reduce over float for integer data-types when calculating the mean as well (since a floating point is returned anyway). Is now mean() always reducing over float64 ? The svn note Log: Fix mean, std, and var methods so that they reduce over double data-type with integer inputs. makes it sound that a float32 input is stays float32 ? For mean calculation this might introduce large errors - I usually would require double-precision for *any* input type ... (don't know how to say this for complex types !? Are here real and imag treated separately / independently ?) Thanks, Sebastian Haase - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
Re: [Numpy-discussion] please change mean to use dtype=float
Sebastian Haase wrote: On Tuesday 19 September 2006 15:48, Travis Oliphant wrote: Sebastian Haase wrote: snip can we please change dtype to default to float64 !? The default is float64 now (as long as you are not using numpy.oldnumeric). I suppose more appropriately, we could reduce over float for integer data-types when calculating the mean as well (since a floating point is returned anyway). Is now mean() always reducing over float64 ? The svn note Log: Fix mean, std, and var methods so that they reduce over double data-type with integer inputs. makes it sound that a float32 input is stays float32 ? Yes, that is true. Only integer inputs are changed because you are going to get a floating point output anyway. For mean calculation this might introduce large errors - I usually would require double-precision for *any* input type ... Of course. The system is not fool-proof. I hesitate to arbitrarily change this. The advantage of using single-precision calculation is that it is faster. We do rely on the user who expressly requests these things to be aware of the difficulties. (don't know how to say this for complex types !? Are here real and imag treated separately / independently ?) There is a complex add performed at a low-level as two separate adds. The addition is performed in the precision requested. -Travis - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
Re: [Numpy-discussion] please change mean to use dtype=float
Sebastian Haase wrote: (don't know how to say this for complex types !? Are here real and imag treated separately / independently ?) Yes. For mean(), there's really no alternative. Scalar variance is not a well-defined concept for complex numbers, but treating the real and imaginary parts separately is a sensible and (partially) informative thing to do. Simply applying the formula for estimating variance for real numbers to complex numbers (i.e. change x to z) is a meaningless operation. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
Re: [Numpy-discussion] please change mean to use dtype=float
On 9/19/06, Travis Oliphant [EMAIL PROTECTED] wrote: Sebastian Haase wrote:On Tuesday 19 September 2006 15:48, Travis Oliphant wrote:Sebastian Haase wrote:snipcan we please change dtype to default to float64 !? The default is float64 now (as long as you are not usingnumpy.oldnumeric).I suppose more appropriately, we could reduce over float for integer data-types when calculating the mean as well (since a floating point isreturned anyway).Is now mean() always reducing over float64 ? The svn note Log:Fix mean, std, and var methods so that they reduce over double data-type withinteger inputs.makes it sound that a float32 input is stays float32 ? Yes, that is true.Only integer inputs are changed because you aregoing to get a floating point output anyway.For mean calculation this might introduce large errors - I usually would require double-precision for *any*input type ...Of course.The system is not fool-proof.I hesitate to arbitrarilychange this.The advantage of using single-precision calculation isthat it is faster.We do rely on the user who expressly requests these things to be aware of the difficulties.Speed depends on the achitecture. Float is a trifle slower than double on my Athlon64, but faster on PPC750. I don't know about other machines. I think there is a good argument for accumlating in double and converting to float for output if required. Chuck - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV___ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
Re: [Numpy-discussion] please change mean to use dtype=float
On Tuesday 19 September 2006 17:17, Travis Oliphant wrote: Sebastian Haase wrote: On Tuesday 19 September 2006 15:48, Travis Oliphant wrote: Sebastian Haase wrote: snip can we please change dtype to default to float64 !? The default is float64 now (as long as you are not using numpy.oldnumeric). I suppose more appropriately, we could reduce over float for integer data-types when calculating the mean as well (since a floating point is returned anyway). Is now mean() always reducing over float64 ? The svn note Log: Fix mean, std, and var methods so that they reduce over double data-type with integer inputs. makes it sound that a float32 input is stays float32 ? Yes, that is true. Only integer inputs are changed because you are going to get a floating point output anyway. For mean calculation this might introduce large errors - I usually would require double-precision for *any* input type ... Of course. The system is not fool-proof. I hesitate to arbitrarily change this. The advantage of using single-precision calculation is that it is faster. We do rely on the user who expressly requests these things to be aware of the difficulties. I still would argue that getting a good (smaller rounding errors) answer should be the default -- if speed is wanted, then *that* could be still specified by explicitly using dtype=float32 (which would also be a possible choice for int32 input) . In image processing we always want means to be calculated in float64 even though input data is always float32 (if not uint16). Also it is simpler to say float64 is the default (full stop.) - instead float64 is the default unless you have float32 Thanks, Sebastian Haase - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
Re: [Numpy-discussion] please change mean to use dtype=float
Charles R Harris wrote: Speed depends on the achitecture. Float is a trifle slower than double on my Athlon64, but faster on PPC750. I don't know about other machines. I think there is a good argument for accumlating in double and converting to float for output if required. Yes there is. It's just not what NumPy ever does so it would be an exception in this case and would need a more convincing argument in my opinion.You can always specify the accumulation type yourself with the dtype argument. We are only talking about what the default should be. -Travis - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
Re: [Numpy-discussion] please change mean to use dtype=float
Sebastian Haase wrote: I still would argue that getting a good (smaller rounding errors) answer should be the default -- if speed is wanted, then *that* could be still specified by explicitly using dtype=float32 (which would also be a possible choice for int32 input) . So you are arguing for using long double then ;-) In image processing we always want means to be calculated in float64 even though input data is always float32 (if not uint16). Also it is simpler to say float64 is the default (full stop.) - instead float64 is the default unless you have float32 the type you have is the default except for integers. Do you really want float64 to be the default for float96? Unless we are going to use long double as the default, then I'm not convinced that we should special-case the double type. -Travis - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
Re: [Numpy-discussion] please change mean to use dtype=float
Travis Oliphant wrote: Sebastian Haase wrote: I still would argue that getting a good (smaller rounding errors) answer should be the default -- if speed is wanted, then *that* could be still specified by explicitly using dtype=float32 (which would also be a possible choice for int32 input) . So you are arguing for using long double then ;-) In image processing we always want means to be calculated in float64 even though input data is always float32 (if not uint16). Also it is simpler to say float64 is the default (full stop.) - instead float64 is the default unless you have float32 the type you have is the default except for integers. Do you really want float64 to be the default for float96? Unless we are going to use long double as the default, then I'm not convinced that we should special-case the double type. I guess I'm not really aware of the float96 type ... Is that a machine type on any system ? I always thought that -- e.g. coming from C -- double is as good as it gets... Who uses float96 ? I heard somewhere that (some) CPUs use 80bits internally when calculating 64bit double-precision... Is this not going into some academic argument !? For all I know, calculating mean()s (and sum()s, ...) is always done in double precision -- never in single precision, even when the data is in float32. Having float32 be the default for float32 data is just requiring more typing, and more explaining ... it would compromise numpy usability as a day-to-day replacement for other systems. Sorry, if I'm being ignorant ... - Sebastian - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
Re: [Numpy-discussion] please change mean to use dtype=float
On 9/19/06, Charles R Harris [EMAIL PROTECTED] wrote: On 9/19/06, Sebastian Haase [EMAIL PROTECTED] wrote: Travis Oliphant wrote: Sebastian Haase wrote: I still would argue that getting a good (smaller rounding errors) answer should be the default -- if speed is wanted, then *that* could be still specified by explicitly using dtype=float32(which would also be a possible choice for int32 input) . So you are arguing for using long double then ;-) In image processing we always want means to be calculated in float64 even though input data is always float32 (if not uint16). Also it is simpler to say float64 is the default (full stop.) - instead float64 is the default unless you have float32 the type you have is the default except for integers.Do you really want float64 to be the default for float96? Unless we are going to use long double as the default, then I'm not convinced that we should special-case the double type.I guess I'm not really aware of the float96 type ...Is that a machine type on any system ?I always thought that -- e.g .coming from C -- double is as good as it gets...Who uses float96 ?I heard somewhere that (some) CPUs use 80bitsinternally when calculating 64bit double-precision...Is this not going into some academic argument !? For all I know, calculating mean()s (and sum()s, ...) is always done indouble precision -- never in single precision, even when the data is infloat32.Having float32 be the default for float32 data is just requiring more typing, and more explaining ...it would compromise numpy usability asa day-to-day replacement for other systems.Sorry, if I'm being ignorant ...I'm going to side with Travis here. It is only a default and easily overridden. And yes, there are precisions greater than double. I was using quad precision back in the eighties on a VAX for some inherently ill conditioned problems. And on my machine long double is 12 bytes. Here is the 754r (revision) spec: http://en.wikipedia.org/wiki/IEEE_754rIt includes quads (128 bits) and half precision (16 bits) floats. I believe the latter are used for some imaging stuff, radar for instance, and are also available in some high end GPUs from Nvidia and other companies. The 80 bit numbers you refer to were defined as extended precision in the original 754 spec and were mostly intended for temporaries in internal FPU computations. They have various alignment requirements for efficient use, which is why they show up as 96 bits (4 byte alignment) and sometimes 128 bits (8 byte alignment). So actually, float128 would not always distinquish between extended precision and quad precision. I see more work for Travis in the future ;) Chuck - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV___ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
Re: [Numpy-discussion] please change mean to use dtype=float
Sebastian Haase wrote: I know that having too much knowledge of the details often makes one forget what the newcomers will do and expect. Please be more careful with such accusations. Repeated frequently, they can become quite insulting. We are only talking about people that will a) work with single-precision data (e.g. large scale-image analysis) and who b) will tend to just use the default (dtype) --- How else can I say this: these people will just assume that arr.mean() *is* the mean of arr. I don't understand what you mean, here. arr.mean() is almost never *the* mean of arr. Double precision can't fix that. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
Re: [Numpy-discussion] please change mean to use dtype=float
Robert Kern wrote: Sebastian Haase wrote: I know that having too much knowledge of the details often makes one forget what the newcomers will do and expect. Please be more careful with such accusations. Repeated frequently, they can become quite insulting. I did not mean to insult anyone - what I meant was, that I'm for numpy becoming an easy platform to use. I have spend and enjoyed part the last four years developing and evangelizing Python as an alternative to Matlab and C/Fortran based image analysis environment. I often find myself arguing for good support of the single precision data format. So I find it actually somewhat ironic to see myself arguing now for wanting float64 over float32 ;-) We are only talking about people that will a) work with single-precision data (e.g. large scale-image analysis) and who b) will tend to just use the default (dtype) --- How else can I say this: these people will just assume that arr.mean() *is* the mean of arr. I don't understand what you mean, here. arr.mean() is almost never *the* mean of arr. Double precision can't fix that. This was not supposed to be a scientific statement -- I'm (again) thinking of our students that not always appreciate the full complexity of computational numerics and data types and such. The best I can hope for is a sound default for most (practical) cases... I still think that 80bit vs. 128bit vs 96bit is rather academic for most people ... most people seem to only use float64 and then there are some that use float32 (like us) ... Cheers, Sebastian - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion