Re: [Numpy-discussion] Modulus (remainder) function corner cases

2016-02-14 Thread Nils Becker
2016-02-13 17:42 GMT+01:00 Charles R Harris :

> The Fortran modulo function, which is the same basic function as in my
>> branch, does not specify any bounds on the result for floating numbers, but
>> gives only the formula,  modulus(a, b) = a - b*floor(a/b), which has the
>> advantage of being simple and well defined ;)
>>
>
>
In the light of the libm-discussion I spent some time looking at floating
point functions and their accuracy. I would vote in favor of keeping an
implementation that uses the fmod-function of the system library and bends
it to adhere to the python convention (sign of divisor). There is probably
a reason why the fmod-implementation is not as simple as "a - b*floor(a/b)"
[1].

One obvious problem with the simple expression arises when a/b = 0.0 in
floating point. E.g.

In [43]: np.__version__
Out[43]: '1.10.4'
In [44]: x = np.float64(1e-320)
In [45]: y = np.float64(-1e10)
In [46]: x % y # this uses libm's fmod on my system
Out[46]: -100.0 # == y, correctly rounded result in round-to-nearest
In [47]: x - y*np.floor(x/y) # this here is the naive expression
Out[47]: 9.9998886718268301e-321  # == x, wrong sign

There are other problems (a/b = inf in floating point). As I do not
understand the implementation of fmod (for example in openlibm) in detail I
cannot give a full list of corner cases.

Unfortunately, I did not follow the (many different) bug reports on this
issue originally and am confused why there was a need to change the
implementation in the first place. numpy's "%" operator seems to work quite
well on my system. Therefore, this mail may be rather unproductive as I am
missing some information.

Concerning your original question: Many elementary functions loose their
mathematical properties when they are calculated correctly-rounded in
floating point numbers [2]. We do not fix this for other functions, I would
not fix it here.

Cheers
Nils

[1] https://github.com/JuliaLang/openlibm/blob/master/src/e_fmod.c
[2] np.exp(np.nextafter(1.0, 0.0)) < np.e -> False (Monotonicity lost in
exp(x)).
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Linking other libm-Implementation

2016-02-10 Thread Nils Becker
2016-02-09 18:02 GMT+01:00 Gregor Thalhammer :
>> It is not suitable as a standard for numpy.
>
> Why should numpy not provide fast transcendental math functions? For
linear algebra it supports fast implementations, even non-free (MKL).
Wouldn’t it be nice if numpy outperforms C?

Floating point operations that make use of vector extensions of modern
processors may behave subtly different. This especially concerns floating
point exceptions, where sometimes silent infinities are generated instead
of raising a divide-by-zero exception (best description I could find on the
spot:
https://randomascii.wordpress.com/2012/04/21/exceptional-floating-point/,
also see the notes on C99-compliance of the new vector expressions in
glibc: https://sourceware.org/glibc/wiki/libmvec).
I think the default in numpy should strive to be mostly standard compliant.
But of course an option to activate vector math operations would be nice -
if that is necessary with packages like numexpr is another question.
One other point is the extended/long double type which is normally not
supported by those libraries (as vector extensions cannot handle them).

> Intel publishes accuracy/performance charts for VML/MKL:
>
https://software.intel.com/sites/products/documentation/doclib/mkl/vm/functions/_accuracyall.html
>
> For GNU libc it is more difficult to find similarly precise data, I only
could find:
>
http://www.gnu.org/software/libc/manual/html_node/Errors-in-Math-Functions.html

On Tue, Feb 9, 2016 at 7:06 AM, Daπid  wrote:
> I did some digging, and I found this:
>
>
http://julia-programming-language.2336112.n4.nabble.com/Is-the-accuracy-of-Julia-s-elementary-functions-exp-sin-known-td32736.html

Thank you for looking that up! I did not knew about the stuff published by
Intel yet.

2016-02-09 20:13 GMT+01:00 Matthew Brett :
> So GNU libm has max error <= 0.5 ULP, openlibm has <= 1 ULP, and OSX
> is (almost always) somewhere in-between.
>
> So, is <= 1 ULP good enough?

Calculating transcendental functions correctly rounded is very, very hard
and to my knowledge there is no complete libm implementation that
guarantees the necessary accuracy for all possible inputs. One effort
was/is the Correctly Rounded LibM (crlibm [1]) which tried to prove the
accuracy of their algorithms. However, the performance impact to achieve
that last ulp in all rounding modes can be severe.
Assessing accuracy of a function implementation is hard. Testing all
possible inputs is not feasible (2^32/64 for single/double) and proving
accuracy bounds may be even harder.
Most of the time one samples accuracy with random numbers from a certain
range. This generates tables like the ones for GNU libm or Intel.
This is a kind of "faithful" accuracy as you believe that the accuracy you
tested on a sample extends to the whole argument range. The error in worst
case may be (much) bigger.

That being said, I believe the values given by GNU libm for example are
very trustworthy. libm is not always correctly rounded (which would be <=
0.5ulp in round-to-nearest), however, the error bounds given in the table
seem to cover all worst cases.
Common single-argument functions (sin, cos) are correctly rounded and even
complex two-argument functions (cpow) are at most 5ulp off. I do not think
that other implementations are more accurate.
So libm is definitely good enough, accuracy-wise.

In any case I would like to build a testing framework to compare some libms
and check accuracy/performance (at least Intel has a history of
underestimating their error bounds in transcendental functions [2]). crlibm
offers worst-case arguments for some functions which could be used to
complement randomized sampling. Maybe I have some time in the next weeks...

[1] http://lipforge.ens-lyon.fr/www/crlibm/
[2]
https://randomascii.wordpress.com/2014/10/09/intel-underestimates-error-bounds-by-1-3-quintillion/
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Linking other libm-Implementation

2016-02-09 Thread Nils Becker
2016-02-08 18:54 GMT+01:00 Julian Taylor :
> which version of glibm was used here? There are significant difference
> in performance between versions.
> Also the input ranges are very important for these functions, depending
> on input the speed of these functions can vary by factors of 1000.
>
> glibm now includes vectorized versions of most math functions, does
> openlibm have vectorized math?
> Thats where most speed can be gained, a lot more than 25%.

glibc 2.22 was used running on archlinux. As far as I know openlibm does
not include special vectorized functions. (for reference vectorized
operations in glibc: https://sourceware.org/glibc/wiki/libmvec).

2016-02-08 23:32 GMT+01:00 Gregor Thalhammer :

> Years ago I made the vectorized math functions from Intels Vector Math
> Library (VML), part of MKL, available for numpy, see
> https://github.com/geggo/uvml
> Not particularly difficult, you not even have to change numpy. For some
> cases (e.g., exp) I have seen speedups up to 5x-10x. Unfortunately MKL is
> not free, and free vector math libraries like yeppp implement much fewer
> functions or do not support the required strided memory layout. But to
> improve performance, numexpr, numba or theano are much better.
>
> Gregor
>
>
Thank you very much for the link! I did not know about
numpy.set_numeric_ops.
You are right, vectorized operations can push down calculation time per
element by factors. The benchmarks done for the yeppp-project also indicate
that (however far you would trust them:
http://www.yeppp.info/benchmarks.html). But I would agree that this domain
should be left to specialized tools like numexpr as fully exploiting the
speedup depends on the expression, that should be calculated. It is not
suitable as a standard for numpy.

Still, I think it would be good to give the possibility to choose the libm
numpy links against. And be it simply to allow to choose or guarantee a
specific accuracy/performance on different platforms and systems.
Maybe maintaining a de-facto libm in npy_math could be replaced with a
dependency on e.g. openlibm. But such a decision would require a thorough
benchmark/testing of the available solutions. Especially with respect to
the accuracy-performance-tradeoff that was mentioned.

Cheers
Nils
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Linking other libm-Implementation

2016-02-08 Thread Nils Becker
> The npy_math functions are used if otherwise unavailable OR if someone
> has at some point noticed that say glibc 2.4-2.10 has a bad quality
> tan (or whatever) and added a special case hack that checks for those
> particular library versions and uses our built-in version instead.
> It's not the most convenient setup to maintain, so there's been some
> discussion of trying openlibm instead [1], but AFAIK you're the first
> person to find the time to actually sit down and try doing it :-).
>
> You should be able to tell what math library you're linked to by
> running ldd (on linux) or otool (on OS X) against the .so / .dylib
> files inside your built copy of numpy -- e.g.
>
>   ldd numpy/core/umath.cpython-34m.so
>
> (exact filename and command will vary depending on python version and
> platform).
>
> -n
>
> [1]
> https://github.com/numpy/numpy/search?q=openlibm=Issues=%E2%9C%93
>
>
Ok, I with a little help from someone, at least I got it to work somehow.
Apparently linking to openlibm is not a problem, MATHLIB=openlibm does the
job. The resulting .so-files are linked to openlibm AND libm. I do not know
why, maybe you would have to call gcc with -nostdlib and explicitly include
everything you need.
When running such a build of numpy, however, only the functions in libm are
called.

What did the job was to export LD_PRELOAD=/usr/lib/libopenlibm.so. In that
case the functions from openlibm are used. This works with any build of
numpy and needs no rebuilding. Of course its hacky and not a solution but
at the moment it seems by far the easiest way to use a different libm
implementation. This does also work with intels libimf. It does not work
with amdlibm as they use the prefix amd_ in function names which would
require real changes to the build system.

Very superficial benchmarks (see below) seem devastating for gnu libm. It
seems that openlibm (compiled with gcc -mtune=native -O3) performs really
well and intels libm implementation is the best (on my intel CPU). I did
not check the accuracy of the functions, though.

My own code uses a lot of trigonometric and complex functions (optics
calculations). I'd guess it could go 25% faster by just using a better libm
implementation. Therefore, I have an interest in getting sane linking to a
defined libm implementation to work.

Apparently openlibm seems quite a good choice for numpy, at least
performance wise. However, I did not find any documentation or tests of the
accuracy of its functions. A benchmarking and testing (for accuracy) code
for libms would probably be a good starting point for a discussion. I could
maybe help with that - but apparently not with any linking/building stuff
(I just don't get it).

Benchmark:

gnu libm.so
3000 x sin(double[10]):  6.68215647800389 s
3000 x log(double[10]):  8.86350397899514 s
3000 x exp(double[10]):  6.560557693999726 s

openlibm.so
3000 x sin(double[10]):  4.5058218560006935 s
3000 x log(double[10]):  4.106520485998772 s
3000 x exp(double[10]):  4.597905882001214 s

Intel libimf.so
3000 x sin(double[10]):  4.282402812998043 s
3000 x log(double[10]):  4.008453270995233 s
3000 x exp(double[10]):  3.30127963848 s
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Linking other libm-Implementation

2016-02-07 Thread Nils Becker
Hi all,

I wanted to know if there is any sane way to build numpy while linking to a
different implementation of libm?
A drop-in replacement for libm (e.g. openlibm) should in principle work, I
guess, but I did not manage to actually make it work. As far as I
understand the build code, setting MATHLIB=openlibm should suffice, but it
did not. The build works fine, but in the end when running numpy apparently
the functions of the system libm.so are used. I could not verify this
directly (as I do not know how) but noticed that there is no performance
difference between the builds - while there is one with pure C programs
linked against libm and openlibm.
Using amdlibm would require some work as the functions are prefixed with
"_amd", I guess? Using intels libimf should work when using intels
compiler, but I did not try this. With gcc I did not get it to work.

A quite general question: At the moment the performance and the accuracy of
the base mathematical functions depends on the platform and
libm-Implementation of the system. Although there are functions defined in
npy_math, they are only used as fall-backs, if they are not provided by a
library. (correct me if I am wrong here)
Is there some plan to change this in the future and provide defined
behaviour (specified accuracy and/or speed) across platforms? As I
understood it Julia started openlibm for this reason (which is based on
fdlibm/msun, same as npy_math).

Cheers
Nils
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] method to calculate the magnitude squared

2015-10-11 Thread Nils Becker
Hey,

I use complex numbers a lot and obviously need the modulus a lot. However,
I am not sure if we need a special function for _performance_ reasons.

At 10:01 AM 9/20/2015, you wrote:

It is, but since that involves taking sqrt, it is *much* slower. Even now,
```
In [32]: r = np.arange(1)*(1+1j)

In [33]: %timeit np.abs(r)**2
1000 loops, best of 3: 213 µs per loop

In [34]: %timeit r.real**2 + r.imag**2
1 loops, best of 3: 47.5 µs per loop

This benchmark is not quite fair as the first example needs a python
function call and the second doesn't. If you benchmark a modulus function
against np.abs(x)**2 the performance gain is ca. 30% on my machine. This
means that for such a basic operation most of the time is spent in the
function call.
In my opinion if you want to have speed you write the modulus explicitly in
your expression (3-4x speedup on my machine). If you don't need speed you
can afford the function call (be it to abs2 or to abs).

By not providing abs2 in numpy, however, people do not loose out on a lot
of performance...

There may be reasons to provide abs2 related to accuracy. If people (for
not knowing it better) use np.abs(x)**2 they lose significant digits I
think (may be wrong on that...). I did not look into it, though.

Cheers
Nils
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] FFTS for numpy's FFTs (was: Re: Choosing between NumPy and SciPy functions)

2014-10-30 Thread Nils Becker
 I think that numpy.fft should be left there in its current state
(although perhaps as deprecated). Now scipy.fft should have a good generic
algorithm as default, and easily allow for different implementations to be
accessed through the same interface.

I also agree with the above. But I want to add that in this case it would
be wise to include some (sophisticated) testing suite to ensure that all
possible libraries implement the DFT with high accuracy.
numpy's fftpack (or scipy's) has the advantage that it is old and well
tested. FFTW also provides extensive benchmarks of speed and accuracy.
Other libraries do not. Most users just want an fft function that works and
not bother with details like numerical accuracy. When I encountered such an
issue (https://github.com/hgomersall/pyFFTW/issues/51) it took me really a
long time to track it down to the fft function.

One remark to FFTS: does it implement double precision yet? The
corresponding issue (https://github.com/anthonix/ffts/issues/24) seems to
be still open but I did not look into it. If it does not it is not suited
as a fftpack replacement (I hope I did not overlook some comment about that
in the thread).

Cheers
Nils

PS: although I am a long time user of numpy, I am fairly new to the list.
So hello!
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Dropping support for, Accelerate/veclib?

2013-06-11 Thread Nils Becker

 I think for Scipy homebrew uses the Gfortran ABI:
 https://trac.macports.org/browser/trunk/dports/python/py-scipy/Portfile

fwiw, homebrew is not macports. it's a more recent replacement that
seems to be taking over gradually.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] histogram2d and histogramdd return counts as floats while histogram returns ints

2012-05-14 Thread Nils Becker
 is this intended?
 
 np.histogramdd([[1,2],[3,4]],bins=2)
 
 (array([[ 1.,  0.],
[ 0.,  1.]]),
  [array([ 1. ,  1.5,  2. ]), array([ 3. ,  3.5,  4. ])])
 
 np.histogram2d([1,2],[3,4],bins=2)
 
 (array([[ 1.,  0.],
[ 0.,  1.]]),
  array([ 1. ,  1.5,  2. ]),
  array([ 3. ,  3.5,  4. ]))
 
 np.histogram([1,2],bins=2)
 
 (array([1, 1]), array([ 1. ,  1.5,  2. ]))

maybe i should have been more explicit. what i meant to say is that

1. the counts in a histogram are integers. whenever no normalization is
used i would expect that i get an integer array when i call a histogram
function.
2. now it might be intended that the data type is always the same so
that float has to be used to accomodate the normalized histograms.
3. in any case, the 1d histogram function handles this differently from
the 2d and dd ones. this seems inconsistent and might be considered a bug.

n.

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] histogram2d and histogramdd return counts as floats while histogram returns ints

2012-05-11 Thread Nils Becker
hi,

is this intended?

np.histogramdd([[1,2],[3,4]],bins=2)

(array([[ 1.,  0.],
   [ 0.,  1.]]),
 [array([ 1. ,  1.5,  2. ]), array([ 3. ,  3.5,  4. ])])

np.histogram2d([1,2],[3,4],bins=2)

(array([[ 1.,  0.],
   [ 0.,  1.]]),
 array([ 1. ,  1.5,  2. ]),
 array([ 3. ,  3.5,  4. ]))

np.histogram([1,2],bins=2)

(array([1, 1]), array([ 1. ,  1.5,  2. ]))


n.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] fast numpy i/o

2011-06-27 Thread Nils Becker
Hi,

 Finally, the former Scientific.IO NetCDF interface is now part of
 scipy.io, but I assume it only supports netCDF 3 (the documentation
 is not specific about that). This might be the easiest option for a
 portable data format (if Matlab supports it).

 Yes, it is NetCDF 3.

In recent versions of Scientific, NetCDF 4 can be written and read, see

http://dirac.cnrs-orleans.fr/ScientificPython/ScientificPythonManual/Scientific.IO.NetCDF.NetCDFFile-class.html

N.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.histogramdd of empty data

2011-03-31 Thread Nils Becker
Hi Ralf,

I cloned numpy/master and played around a little.

when giving the bins explicitely, now histogram2d and histogramdd work
as expected in all tests i tried.


However, some of the cases with missing bin specification appear
somewhat inconsistent.

The first question is if creating arbitrary bins for empty data and
empty bin specification is better than raising an Exception:

Specifically:

numpy.histogram2d([],[],bins=[0,0])
 (array([ 0.,  0.]), array([ 0.]), array([ 0.]))

numpy.histogram([],bins=0)
 ValueError: zero-size array to minimum.reduce without identity

so 1-d and 2-d behave not quite the same.

also, these work (although with arbitrary bin edges):

numpy.histogram2d([],[],bins=[1,1])
 (array([ 0.,  0.]), array([ 0.,  1.]), array([ 0.,  1.]))

numpy.histogram2d([],[],bins=[0,1])
 (array([ 0.,  0.]), array([ 0.]), array([ 0.,  1.]))

while this raises an error:

numpy.histogram([],bins=1)
 ValueError: zero-size array to minimum.reduce without identity

another thing with non-empty data:

numpy.histogram([1],bins=1)
 (array([1]), array([ 0.5,  1.5]))

numpy.histogram([1],bins=0)
 (array([], dtype=int64), array([ 0.5]))

while

numpy.histogram2d([1],[1],bins=A)
 ValueError: zero-size array to minimum.reduce without identity

(here A==[0,0] or A==[0,1] but not A==[1,1] which gives a result)

Nils
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] np.histogramdd of empty data

2011-03-22 Thread Nils Becker
Hi,

I was wondering why histogram2d and histogramdd raise a ValueError when
fed with empty data of the correct dimensions. I came across this as a
corner case when calling histogram2d from my own specialized histogram
function.

In comparison, histogram does handle this case correctly when bins are
specified explicitely (obviously automatic binning cannot work without
data...)

np.histogram([], bins=([0,1]))
(array([0]), array([0, 1]))

np.histogram2d([],[], bins=([0,1],[0,1]))
ValueError: zero-size array to ufunc.reduce without identity

np.histogramdd([[],[]], bins=([0,1],[0,1]))
ValueError: zero-size array to ufunc.reduce without identity

cheers, nils
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] indexing of rank-0 structured arrays: why not?

2011-01-10 Thread Nils Becker
Hi,

I noticed that I can index into a dtype when I take an element
of a rank-1 array but not if I make a rank-0 array directly. This seems
inconsistent. A bug?

Nils


In [76]: np.version.version
Out[76]: '1.5.1'

In [78]: dt = np.dtype([('x', 'f8'), ('y', 'f8')])

In [80]: a_rank_1 = np.zeros((1,), dtype=dt)

In [81]: a_rank_0 = np.zeros((), dtype=dt)

In [83]: a_rank_1[0]
Out[83]: (0.0, 0.0)

In [84]: a_rank_1[0] == a_rank_0
Out[84]: True

In [85]: a_rank_1[0][0]
Out[85]: 0.0

In [86]: a_rank_0[0]
---
IndexErrorTraceback (most recent call last)

/home/pabra/ramp/testramp/ipython console in module()

IndexError: 0-d arrays can't be indexed

In [87]: a_rank_0['x']
Out[87]: array(0.0)
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] indexing of rank-0 structured arrays: why not?

2011-01-10 Thread Nils Becker
Robert,

your answer does work: after indexing with () I can then further index
into the datatype.

In [115]: a_rank_0[()][0]
Out[115]: 0.0

I guess I just found the fact confusing that a_rank_1[0] and a_rank_0
compare and print equal but behave differently under indexing.

More precisely if I do
In [117]: b = a_rank_1[0]

then

In [118]: b.shape
Out[118]: ()

and

In [120]: a_rank_0 == b
Out[120]: True

but

In [119]: b[0]
Out[119]: 0.0

works but a_rank_0[0] doesn't. I thought b is a rank-0 array which it
apparently is not since it can be indexed. So maybe b[0] should fail for
consistency?

N.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] truth value of dtypes

2010-12-10 Thread Nils Becker
Hi,

why is

 bool(np.dtype(np.float))
False

?

I came across this when using this python idiom:

def f(dtype=None):
if not dtype:
print 'using default dtype'

If there is no good reason to have a False truth value, I would vote for
making it True since that is what one would expect (no?)
N.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Schedule for 1.5.1?

2010-10-07 Thread Nils Becker
Hi,

what about the normed=True bug in numpy.histogram? It was discussed here
a while ago, and fixed (although i did not find it on the tracker), but
the message

aanlktikwbsrxq0ynf3u3jo3ekrikszmwqy30pnsfg...@mail.gmail.com

suggests it just missed 1.5.0? I don't have 1.5 installed, so I can't
check right now.

sorry to mention my personal pet bug, but maybe it's also a candidate?

Nils
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numpy histogram normed=True (bug / confusing behavior)

2010-08-30 Thread Nils Becker

 I think (a corrected) density histogram is core functionality for
 unequal bin lengths.
 
 The graph with raw count in the case of unequal bin sizes would be
 quite misleading when plotted and interpreted on the real line and not
 on discrete points (shaded areas instead of vertical lines). And as
 the origin of this thread showed, it's not trivial to figure out what
 the correct normalization is.
 So, I think, if we drop the density normalization, we just need a new
 function that does it.
 
 My 2c,
 
 Josef

(Not a dev, but ) I agree with Josef. Throwing out normalization
completely would in addition have to destroy backwards compatibility.
It's another question whether the frequency ( not density! )
normalization is so simple that it is better left out, since everyone
can be expected to be able to do a raw histogram and divide by the
number of data points without a headache.

Nils
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numpy histogram normed=True (bug / confusing behavior)

2010-08-29 Thread Nils Becker
 On Sat, Aug 28, 2010 at 04:12, Zbyszek Szmek zbys...@in.waw.pl wrote:
 Hi,

 On Fri, Aug 27, 2010 at 06:43:26PM -0600, Charles R Harris wrote:
 ? ?On Fri, Aug 27, 2010 at 2:47 PM, Robert Kern robert.k...@gmail.com
 ? ?wrote:

 ? ? ?On Fri, Aug 27, 2010 at 15:32, David Huard david.hu...@gmail.com
 ? ? ?wrote:
 ? ? ? Nils and Joseph,
 ? ? ? Thanks for the bug report, this is now fixed in SVN (r8672).

 ? ? ?While we're at it, can we change the name of the argument? normed
 ? ? ?has caused so much confusion over the years. We could deprecate
 ? ? ?normed=True in favor of pdf=True or density=True.
 I think it might be a good moment to also include a different type of 
 normalization:
 ? ? ? n = n / n.sum()
 i.e. the frequency of counts in each bin. This one is of course very simple 
 to calculate
 by hand, but very common. I think it would be useful to have this 
 normalization
 available too. 
 [http://www.itl.nist.gov/div898/handbook/eda/section3/histogra.htm]
 
 My feeling is that this is trivial to do by hand. I do not see a
 reason to add an option to histogram() to do this.
 
Hi,

+1 for not silently changing the behavior of normed=True. (I'm one of
the people who have worked around it).

One argument in favor of putting both normalizing styles 'frequency' and
'density' may be that the documentation will automatically become very
clear. A user sees all options and there is little chance of a
misunderstanding. Of course, a sentence like If you want frequency
normalization, use histogram(data, normalized=False)/sum(data) would
also make things clear, without adding the frequency option.

Nils
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] numpy histogram normed=True (bug / confusing behavior)

2010-08-06 Thread Nils Becker
Hi,

I found what looks like a bug in histogram, when the option normed=True
is used together with non-uniform bins.

Consider this example:

import numpy as np
data = np.array([1, 2, 3, 4])
bins = np.array([.5, 1.5, 4.5])
bin_widths = np.diff(bins)
(counts, dummy) = np.histogram(data, bins)
(densities, dummy) = np.histogram(data, bins, normed=True)

What this gives is:

bin_widths
array([ 1.,  3.])

counts
array([1, 3])

densities
array([ 0.1,  0.3])

The documentation claims that histogram with normed=True gives a
density, which integrates to 1. In this example, it is true that
(densities * bin_widths).sum() is 1. However, clearly the data are
equally spaced, so their density should be uniform and equal to 0.25.
Note that (0.25 * bin_widths).sum() is also 1.

I believe np.histogram(data, bins, normed=True) effectively does :
np.histogram(data, bins, normed=False) / (bins[-1] - bins[0]).

However, it _should_ do
np.histogram(data, bins, normed=False) / bins_widths

to get a true density over the data coordinate as a result. It's easy to
fix by hand, but I think the documentation is at least misleading?!

sorry if this has been discussed before; I did not find it anyway (numpy
1.3)





___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numpy histogram normed=True (bug / confusing behavior)

2010-08-06 Thread Nils Becker
Hi again,

first a correction: I posted

 I believe np.histogram(data, bins, normed=True) effectively does :
 np.histogram(data, bins, normed=False) / (bins[-1] - bins[0]).

 However, it _should_ do
 np.histogram(data, bins, normed=False) / bins_widths

but there is a normalization missing; it should read

I believe np.histogram(data, bins, normed=True) effectively does
np.histogram(data, bins, normed=False) / (bins[-1] - bins[0]) / data.sum()

However, it _should_ do
np.histogram(data, bins, normed=False) / bins_widths / data.sum()

Bruce Southey replied:
 As I recall, there as issues with this aspect.
 Please search the discussion regarding histogram especially David
 Huard's reply in this thread:
 http://thread.gmane.org/gmane.comp.python.numeric.general/22445
I think this discussion pertains to a switch in calling conventions
which happened at the time. The last reply of D. Huard (to me) seems to
say that they did not fix anything in the _old_ semantics, but that the
new semantics is expected to work properly.

I tried with an infinite bin:
counts, dmy = np.histogram([1,2,3,4], [0.5,1.5,np.inf])
counts
array([1,3])
ncounts, dmy = np.histogram([1,2,3,4], [0.5,1.5,np.inf], normed=1)
ncounts
array([0.,0.])

this also does not make a lot of sense to me. A better result would be
array([0.25, 0.]), since 25% of the points fall in the first bin; 75%
fall in the second but are spread out over an infinite interval, giving
0. This is what my second proposal would give. I cannot find anything
wrong with it so far...

Cheers, Nils
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion