Re: [Numpy-discussion] Modulus (remainder) function corner cases
2016-02-13 17:42 GMT+01:00 Charles R Harris: > The Fortran modulo function, which is the same basic function as in my >> branch, does not specify any bounds on the result for floating numbers, but >> gives only the formula, modulus(a, b) = a - b*floor(a/b), which has the >> advantage of being simple and well defined ;) >> > > In the light of the libm-discussion I spent some time looking at floating point functions and their accuracy. I would vote in favor of keeping an implementation that uses the fmod-function of the system library and bends it to adhere to the python convention (sign of divisor). There is probably a reason why the fmod-implementation is not as simple as "a - b*floor(a/b)" [1]. One obvious problem with the simple expression arises when a/b = 0.0 in floating point. E.g. In [43]: np.__version__ Out[43]: '1.10.4' In [44]: x = np.float64(1e-320) In [45]: y = np.float64(-1e10) In [46]: x % y # this uses libm's fmod on my system Out[46]: -100.0 # == y, correctly rounded result in round-to-nearest In [47]: x - y*np.floor(x/y) # this here is the naive expression Out[47]: 9.9998886718268301e-321 # == x, wrong sign There are other problems (a/b = inf in floating point). As I do not understand the implementation of fmod (for example in openlibm) in detail I cannot give a full list of corner cases. Unfortunately, I did not follow the (many different) bug reports on this issue originally and am confused why there was a need to change the implementation in the first place. numpy's "%" operator seems to work quite well on my system. Therefore, this mail may be rather unproductive as I am missing some information. Concerning your original question: Many elementary functions loose their mathematical properties when they are calculated correctly-rounded in floating point numbers [2]. We do not fix this for other functions, I would not fix it here. Cheers Nils [1] https://github.com/JuliaLang/openlibm/blob/master/src/e_fmod.c [2] np.exp(np.nextafter(1.0, 0.0)) < np.e -> False (Monotonicity lost in exp(x)). ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Linking other libm-Implementation
2016-02-09 18:02 GMT+01:00 Gregor Thalhammer: >> It is not suitable as a standard for numpy. > > Why should numpy not provide fast transcendental math functions? For linear algebra it supports fast implementations, even non-free (MKL). Wouldn’t it be nice if numpy outperforms C? Floating point operations that make use of vector extensions of modern processors may behave subtly different. This especially concerns floating point exceptions, where sometimes silent infinities are generated instead of raising a divide-by-zero exception (best description I could find on the spot: https://randomascii.wordpress.com/2012/04/21/exceptional-floating-point/, also see the notes on C99-compliance of the new vector expressions in glibc: https://sourceware.org/glibc/wiki/libmvec). I think the default in numpy should strive to be mostly standard compliant. But of course an option to activate vector math operations would be nice - if that is necessary with packages like numexpr is another question. One other point is the extended/long double type which is normally not supported by those libraries (as vector extensions cannot handle them). > Intel publishes accuracy/performance charts for VML/MKL: > https://software.intel.com/sites/products/documentation/doclib/mkl/vm/functions/_accuracyall.html > > For GNU libc it is more difficult to find similarly precise data, I only could find: > http://www.gnu.org/software/libc/manual/html_node/Errors-in-Math-Functions.html On Tue, Feb 9, 2016 at 7:06 AM, Daπid wrote: > I did some digging, and I found this: > > http://julia-programming-language.2336112.n4.nabble.com/Is-the-accuracy-of-Julia-s-elementary-functions-exp-sin-known-td32736.html Thank you for looking that up! I did not knew about the stuff published by Intel yet. 2016-02-09 20:13 GMT+01:00 Matthew Brett : > So GNU libm has max error <= 0.5 ULP, openlibm has <= 1 ULP, and OSX > is (almost always) somewhere in-between. > > So, is <= 1 ULP good enough? Calculating transcendental functions correctly rounded is very, very hard and to my knowledge there is no complete libm implementation that guarantees the necessary accuracy for all possible inputs. One effort was/is the Correctly Rounded LibM (crlibm [1]) which tried to prove the accuracy of their algorithms. However, the performance impact to achieve that last ulp in all rounding modes can be severe. Assessing accuracy of a function implementation is hard. Testing all possible inputs is not feasible (2^32/64 for single/double) and proving accuracy bounds may be even harder. Most of the time one samples accuracy with random numbers from a certain range. This generates tables like the ones for GNU libm or Intel. This is a kind of "faithful" accuracy as you believe that the accuracy you tested on a sample extends to the whole argument range. The error in worst case may be (much) bigger. That being said, I believe the values given by GNU libm for example are very trustworthy. libm is not always correctly rounded (which would be <= 0.5ulp in round-to-nearest), however, the error bounds given in the table seem to cover all worst cases. Common single-argument functions (sin, cos) are correctly rounded and even complex two-argument functions (cpow) are at most 5ulp off. I do not think that other implementations are more accurate. So libm is definitely good enough, accuracy-wise. In any case I would like to build a testing framework to compare some libms and check accuracy/performance (at least Intel has a history of underestimating their error bounds in transcendental functions [2]). crlibm offers worst-case arguments for some functions which could be used to complement randomized sampling. Maybe I have some time in the next weeks... [1] http://lipforge.ens-lyon.fr/www/crlibm/ [2] https://randomascii.wordpress.com/2014/10/09/intel-underestimates-error-bounds-by-1-3-quintillion/ ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Linking other libm-Implementation
2016-02-08 18:54 GMT+01:00 Julian Taylor: > which version of glibm was used here? There are significant difference > in performance between versions. > Also the input ranges are very important for these functions, depending > on input the speed of these functions can vary by factors of 1000. > > glibm now includes vectorized versions of most math functions, does > openlibm have vectorized math? > Thats where most speed can be gained, a lot more than 25%. glibc 2.22 was used running on archlinux. As far as I know openlibm does not include special vectorized functions. (for reference vectorized operations in glibc: https://sourceware.org/glibc/wiki/libmvec). 2016-02-08 23:32 GMT+01:00 Gregor Thalhammer : > Years ago I made the vectorized math functions from Intels Vector Math > Library (VML), part of MKL, available for numpy, see > https://github.com/geggo/uvml > Not particularly difficult, you not even have to change numpy. For some > cases (e.g., exp) I have seen speedups up to 5x-10x. Unfortunately MKL is > not free, and free vector math libraries like yeppp implement much fewer > functions or do not support the required strided memory layout. But to > improve performance, numexpr, numba or theano are much better. > > Gregor > > Thank you very much for the link! I did not know about numpy.set_numeric_ops. You are right, vectorized operations can push down calculation time per element by factors. The benchmarks done for the yeppp-project also indicate that (however far you would trust them: http://www.yeppp.info/benchmarks.html). But I would agree that this domain should be left to specialized tools like numexpr as fully exploiting the speedup depends on the expression, that should be calculated. It is not suitable as a standard for numpy. Still, I think it would be good to give the possibility to choose the libm numpy links against. And be it simply to allow to choose or guarantee a specific accuracy/performance on different platforms and systems. Maybe maintaining a de-facto libm in npy_math could be replaced with a dependency on e.g. openlibm. But such a decision would require a thorough benchmark/testing of the available solutions. Especially with respect to the accuracy-performance-tradeoff that was mentioned. Cheers Nils ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Linking other libm-Implementation
> The npy_math functions are used if otherwise unavailable OR if someone > has at some point noticed that say glibc 2.4-2.10 has a bad quality > tan (or whatever) and added a special case hack that checks for those > particular library versions and uses our built-in version instead. > It's not the most convenient setup to maintain, so there's been some > discussion of trying openlibm instead [1], but AFAIK you're the first > person to find the time to actually sit down and try doing it :-). > > You should be able to tell what math library you're linked to by > running ldd (on linux) or otool (on OS X) against the .so / .dylib > files inside your built copy of numpy -- e.g. > > ldd numpy/core/umath.cpython-34m.so > > (exact filename and command will vary depending on python version and > platform). > > -n > > [1] > https://github.com/numpy/numpy/search?q=openlibm=Issues=%E2%9C%93 > > Ok, I with a little help from someone, at least I got it to work somehow. Apparently linking to openlibm is not a problem, MATHLIB=openlibm does the job. The resulting .so-files are linked to openlibm AND libm. I do not know why, maybe you would have to call gcc with -nostdlib and explicitly include everything you need. When running such a build of numpy, however, only the functions in libm are called. What did the job was to export LD_PRELOAD=/usr/lib/libopenlibm.so. In that case the functions from openlibm are used. This works with any build of numpy and needs no rebuilding. Of course its hacky and not a solution but at the moment it seems by far the easiest way to use a different libm implementation. This does also work with intels libimf. It does not work with amdlibm as they use the prefix amd_ in function names which would require real changes to the build system. Very superficial benchmarks (see below) seem devastating for gnu libm. It seems that openlibm (compiled with gcc -mtune=native -O3) performs really well and intels libm implementation is the best (on my intel CPU). I did not check the accuracy of the functions, though. My own code uses a lot of trigonometric and complex functions (optics calculations). I'd guess it could go 25% faster by just using a better libm implementation. Therefore, I have an interest in getting sane linking to a defined libm implementation to work. Apparently openlibm seems quite a good choice for numpy, at least performance wise. However, I did not find any documentation or tests of the accuracy of its functions. A benchmarking and testing (for accuracy) code for libms would probably be a good starting point for a discussion. I could maybe help with that - but apparently not with any linking/building stuff (I just don't get it). Benchmark: gnu libm.so 3000 x sin(double[10]): 6.68215647800389 s 3000 x log(double[10]): 8.86350397899514 s 3000 x exp(double[10]): 6.560557693999726 s openlibm.so 3000 x sin(double[10]): 4.5058218560006935 s 3000 x log(double[10]): 4.106520485998772 s 3000 x exp(double[10]): 4.597905882001214 s Intel libimf.so 3000 x sin(double[10]): 4.282402812998043 s 3000 x log(double[10]): 4.008453270995233 s 3000 x exp(double[10]): 3.30127963848 s ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Linking other libm-Implementation
Hi all, I wanted to know if there is any sane way to build numpy while linking to a different implementation of libm? A drop-in replacement for libm (e.g. openlibm) should in principle work, I guess, but I did not manage to actually make it work. As far as I understand the build code, setting MATHLIB=openlibm should suffice, but it did not. The build works fine, but in the end when running numpy apparently the functions of the system libm.so are used. I could not verify this directly (as I do not know how) but noticed that there is no performance difference between the builds - while there is one with pure C programs linked against libm and openlibm. Using amdlibm would require some work as the functions are prefixed with "_amd", I guess? Using intels libimf should work when using intels compiler, but I did not try this. With gcc I did not get it to work. A quite general question: At the moment the performance and the accuracy of the base mathematical functions depends on the platform and libm-Implementation of the system. Although there are functions defined in npy_math, they are only used as fall-backs, if they are not provided by a library. (correct me if I am wrong here) Is there some plan to change this in the future and provide defined behaviour (specified accuracy and/or speed) across platforms? As I understood it Julia started openlibm for this reason (which is based on fdlibm/msun, same as npy_math). Cheers Nils ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] method to calculate the magnitude squared
Hey, I use complex numbers a lot and obviously need the modulus a lot. However, I am not sure if we need a special function for _performance_ reasons. At 10:01 AM 9/20/2015, you wrote: It is, but since that involves taking sqrt, it is *much* slower. Even now, ``` In [32]: r = np.arange(1)*(1+1j) In [33]: %timeit np.abs(r)**2 1000 loops, best of 3: 213 µs per loop In [34]: %timeit r.real**2 + r.imag**2 1 loops, best of 3: 47.5 µs per loop This benchmark is not quite fair as the first example needs a python function call and the second doesn't. If you benchmark a modulus function against np.abs(x)**2 the performance gain is ca. 30% on my machine. This means that for such a basic operation most of the time is spent in the function call. In my opinion if you want to have speed you write the modulus explicitly in your expression (3-4x speedup on my machine). If you don't need speed you can afford the function call (be it to abs2 or to abs). By not providing abs2 in numpy, however, people do not loose out on a lot of performance... There may be reasons to provide abs2 related to accuracy. If people (for not knowing it better) use np.abs(x)**2 they lose significant digits I think (may be wrong on that...). I did not look into it, though. Cheers Nils ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] FFTS for numpy's FFTs (was: Re: Choosing between NumPy and SciPy functions)
I think that numpy.fft should be left there in its current state (although perhaps as deprecated). Now scipy.fft should have a good generic algorithm as default, and easily allow for different implementations to be accessed through the same interface. I also agree with the above. But I want to add that in this case it would be wise to include some (sophisticated) testing suite to ensure that all possible libraries implement the DFT with high accuracy. numpy's fftpack (or scipy's) has the advantage that it is old and well tested. FFTW also provides extensive benchmarks of speed and accuracy. Other libraries do not. Most users just want an fft function that works and not bother with details like numerical accuracy. When I encountered such an issue (https://github.com/hgomersall/pyFFTW/issues/51) it took me really a long time to track it down to the fft function. One remark to FFTS: does it implement double precision yet? The corresponding issue (https://github.com/anthonix/ffts/issues/24) seems to be still open but I did not look into it. If it does not it is not suited as a fftpack replacement (I hope I did not overlook some comment about that in the thread). Cheers Nils PS: although I am a long time user of numpy, I am fairly new to the list. So hello! ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Dropping support for, Accelerate/veclib?
I think for Scipy homebrew uses the Gfortran ABI: https://trac.macports.org/browser/trunk/dports/python/py-scipy/Portfile fwiw, homebrew is not macports. it's a more recent replacement that seems to be taking over gradually. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] histogram2d and histogramdd return counts as floats while histogram returns ints
is this intended? np.histogramdd([[1,2],[3,4]],bins=2) (array([[ 1., 0.], [ 0., 1.]]), [array([ 1. , 1.5, 2. ]), array([ 3. , 3.5, 4. ])]) np.histogram2d([1,2],[3,4],bins=2) (array([[ 1., 0.], [ 0., 1.]]), array([ 1. , 1.5, 2. ]), array([ 3. , 3.5, 4. ])) np.histogram([1,2],bins=2) (array([1, 1]), array([ 1. , 1.5, 2. ])) maybe i should have been more explicit. what i meant to say is that 1. the counts in a histogram are integers. whenever no normalization is used i would expect that i get an integer array when i call a histogram function. 2. now it might be intended that the data type is always the same so that float has to be used to accomodate the normalized histograms. 3. in any case, the 1d histogram function handles this differently from the 2d and dd ones. this seems inconsistent and might be considered a bug. n. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] histogram2d and histogramdd return counts as floats while histogram returns ints
hi, is this intended? np.histogramdd([[1,2],[3,4]],bins=2) (array([[ 1., 0.], [ 0., 1.]]), [array([ 1. , 1.5, 2. ]), array([ 3. , 3.5, 4. ])]) np.histogram2d([1,2],[3,4],bins=2) (array([[ 1., 0.], [ 0., 1.]]), array([ 1. , 1.5, 2. ]), array([ 3. , 3.5, 4. ])) np.histogram([1,2],bins=2) (array([1, 1]), array([ 1. , 1.5, 2. ])) n. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fast numpy i/o
Hi, Finally, the former Scientific.IO NetCDF interface is now part of scipy.io, but I assume it only supports netCDF 3 (the documentation is not specific about that). This might be the easiest option for a portable data format (if Matlab supports it). Yes, it is NetCDF 3. In recent versions of Scientific, NetCDF 4 can be written and read, see http://dirac.cnrs-orleans.fr/ScientificPython/ScientificPythonManual/Scientific.IO.NetCDF.NetCDFFile-class.html N. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.histogramdd of empty data
Hi Ralf, I cloned numpy/master and played around a little. when giving the bins explicitely, now histogram2d and histogramdd work as expected in all tests i tried. However, some of the cases with missing bin specification appear somewhat inconsistent. The first question is if creating arbitrary bins for empty data and empty bin specification is better than raising an Exception: Specifically: numpy.histogram2d([],[],bins=[0,0]) (array([ 0., 0.]), array([ 0.]), array([ 0.])) numpy.histogram([],bins=0) ValueError: zero-size array to minimum.reduce without identity so 1-d and 2-d behave not quite the same. also, these work (although with arbitrary bin edges): numpy.histogram2d([],[],bins=[1,1]) (array([ 0., 0.]), array([ 0., 1.]), array([ 0., 1.])) numpy.histogram2d([],[],bins=[0,1]) (array([ 0., 0.]), array([ 0.]), array([ 0., 1.])) while this raises an error: numpy.histogram([],bins=1) ValueError: zero-size array to minimum.reduce without identity another thing with non-empty data: numpy.histogram([1],bins=1) (array([1]), array([ 0.5, 1.5])) numpy.histogram([1],bins=0) (array([], dtype=int64), array([ 0.5])) while numpy.histogram2d([1],[1],bins=A) ValueError: zero-size array to minimum.reduce without identity (here A==[0,0] or A==[0,1] but not A==[1,1] which gives a result) Nils ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] np.histogramdd of empty data
Hi, I was wondering why histogram2d and histogramdd raise a ValueError when fed with empty data of the correct dimensions. I came across this as a corner case when calling histogram2d from my own specialized histogram function. In comparison, histogram does handle this case correctly when bins are specified explicitely (obviously automatic binning cannot work without data...) np.histogram([], bins=([0,1])) (array([0]), array([0, 1])) np.histogram2d([],[], bins=([0,1],[0,1])) ValueError: zero-size array to ufunc.reduce without identity np.histogramdd([[],[]], bins=([0,1],[0,1])) ValueError: zero-size array to ufunc.reduce without identity cheers, nils ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] indexing of rank-0 structured arrays: why not?
Hi, I noticed that I can index into a dtype when I take an element of a rank-1 array but not if I make a rank-0 array directly. This seems inconsistent. A bug? Nils In [76]: np.version.version Out[76]: '1.5.1' In [78]: dt = np.dtype([('x', 'f8'), ('y', 'f8')]) In [80]: a_rank_1 = np.zeros((1,), dtype=dt) In [81]: a_rank_0 = np.zeros((), dtype=dt) In [83]: a_rank_1[0] Out[83]: (0.0, 0.0) In [84]: a_rank_1[0] == a_rank_0 Out[84]: True In [85]: a_rank_1[0][0] Out[85]: 0.0 In [86]: a_rank_0[0] --- IndexErrorTraceback (most recent call last) /home/pabra/ramp/testramp/ipython console in module() IndexError: 0-d arrays can't be indexed In [87]: a_rank_0['x'] Out[87]: array(0.0) ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] indexing of rank-0 structured arrays: why not?
Robert, your answer does work: after indexing with () I can then further index into the datatype. In [115]: a_rank_0[()][0] Out[115]: 0.0 I guess I just found the fact confusing that a_rank_1[0] and a_rank_0 compare and print equal but behave differently under indexing. More precisely if I do In [117]: b = a_rank_1[0] then In [118]: b.shape Out[118]: () and In [120]: a_rank_0 == b Out[120]: True but In [119]: b[0] Out[119]: 0.0 works but a_rank_0[0] doesn't. I thought b is a rank-0 array which it apparently is not since it can be indexed. So maybe b[0] should fail for consistency? N. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] truth value of dtypes
Hi, why is bool(np.dtype(np.float)) False ? I came across this when using this python idiom: def f(dtype=None): if not dtype: print 'using default dtype' If there is no good reason to have a False truth value, I would vote for making it True since that is what one would expect (no?) N. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Schedule for 1.5.1?
Hi, what about the normed=True bug in numpy.histogram? It was discussed here a while ago, and fixed (although i did not find it on the tracker), but the message aanlktikwbsrxq0ynf3u3jo3ekrikszmwqy30pnsfg...@mail.gmail.com suggests it just missed 1.5.0? I don't have 1.5 installed, so I can't check right now. sorry to mention my personal pet bug, but maybe it's also a candidate? Nils ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy histogram normed=True (bug / confusing behavior)
I think (a corrected) density histogram is core functionality for unequal bin lengths. The graph with raw count in the case of unequal bin sizes would be quite misleading when plotted and interpreted on the real line and not on discrete points (shaded areas instead of vertical lines). And as the origin of this thread showed, it's not trivial to figure out what the correct normalization is. So, I think, if we drop the density normalization, we just need a new function that does it. My 2c, Josef (Not a dev, but ) I agree with Josef. Throwing out normalization completely would in addition have to destroy backwards compatibility. It's another question whether the frequency ( not density! ) normalization is so simple that it is better left out, since everyone can be expected to be able to do a raw histogram and divide by the number of data points without a headache. Nils ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy histogram normed=True (bug / confusing behavior)
On Sat, Aug 28, 2010 at 04:12, Zbyszek Szmek zbys...@in.waw.pl wrote: Hi, On Fri, Aug 27, 2010 at 06:43:26PM -0600, Charles R Harris wrote: ? ?On Fri, Aug 27, 2010 at 2:47 PM, Robert Kern robert.k...@gmail.com ? ?wrote: ? ? ?On Fri, Aug 27, 2010 at 15:32, David Huard david.hu...@gmail.com ? ? ?wrote: ? ? ? Nils and Joseph, ? ? ? Thanks for the bug report, this is now fixed in SVN (r8672). ? ? ?While we're at it, can we change the name of the argument? normed ? ? ?has caused so much confusion over the years. We could deprecate ? ? ?normed=True in favor of pdf=True or density=True. I think it might be a good moment to also include a different type of normalization: ? ? ? n = n / n.sum() i.e. the frequency of counts in each bin. This one is of course very simple to calculate by hand, but very common. I think it would be useful to have this normalization available too. [http://www.itl.nist.gov/div898/handbook/eda/section3/histogra.htm] My feeling is that this is trivial to do by hand. I do not see a reason to add an option to histogram() to do this. Hi, +1 for not silently changing the behavior of normed=True. (I'm one of the people who have worked around it). One argument in favor of putting both normalizing styles 'frequency' and 'density' may be that the documentation will automatically become very clear. A user sees all options and there is little chance of a misunderstanding. Of course, a sentence like If you want frequency normalization, use histogram(data, normalized=False)/sum(data) would also make things clear, without adding the frequency option. Nils ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] numpy histogram normed=True (bug / confusing behavior)
Hi, I found what looks like a bug in histogram, when the option normed=True is used together with non-uniform bins. Consider this example: import numpy as np data = np.array([1, 2, 3, 4]) bins = np.array([.5, 1.5, 4.5]) bin_widths = np.diff(bins) (counts, dummy) = np.histogram(data, bins) (densities, dummy) = np.histogram(data, bins, normed=True) What this gives is: bin_widths array([ 1., 3.]) counts array([1, 3]) densities array([ 0.1, 0.3]) The documentation claims that histogram with normed=True gives a density, which integrates to 1. In this example, it is true that (densities * bin_widths).sum() is 1. However, clearly the data are equally spaced, so their density should be uniform and equal to 0.25. Note that (0.25 * bin_widths).sum() is also 1. I believe np.histogram(data, bins, normed=True) effectively does : np.histogram(data, bins, normed=False) / (bins[-1] - bins[0]). However, it _should_ do np.histogram(data, bins, normed=False) / bins_widths to get a true density over the data coordinate as a result. It's easy to fix by hand, but I think the documentation is at least misleading?! sorry if this has been discussed before; I did not find it anyway (numpy 1.3) ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy histogram normed=True (bug / confusing behavior)
Hi again, first a correction: I posted I believe np.histogram(data, bins, normed=True) effectively does : np.histogram(data, bins, normed=False) / (bins[-1] - bins[0]). However, it _should_ do np.histogram(data, bins, normed=False) / bins_widths but there is a normalization missing; it should read I believe np.histogram(data, bins, normed=True) effectively does np.histogram(data, bins, normed=False) / (bins[-1] - bins[0]) / data.sum() However, it _should_ do np.histogram(data, bins, normed=False) / bins_widths / data.sum() Bruce Southey replied: As I recall, there as issues with this aspect. Please search the discussion regarding histogram especially David Huard's reply in this thread: http://thread.gmane.org/gmane.comp.python.numeric.general/22445 I think this discussion pertains to a switch in calling conventions which happened at the time. The last reply of D. Huard (to me) seems to say that they did not fix anything in the _old_ semantics, but that the new semantics is expected to work properly. I tried with an infinite bin: counts, dmy = np.histogram([1,2,3,4], [0.5,1.5,np.inf]) counts array([1,3]) ncounts, dmy = np.histogram([1,2,3,4], [0.5,1.5,np.inf], normed=1) ncounts array([0.,0.]) this also does not make a lot of sense to me. A better result would be array([0.25, 0.]), since 25% of the points fall in the first bin; 75% fall in the second but are spread out over an infinite interval, giving 0. This is what my second proposal would give. I cannot find anything wrong with it so far... Cheers, Nils ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion