Re: [Numpy-discussion] Memory allocation cleanup

2014-01-09 Thread Nathaniel Smith
On Thu, Jan 9, 2014 at 11:21 PM, Charles R Harris
 wrote:
> Apropos Julian's changes to use the PyObject_* allocation suite for some
> parts of numpy, I posted the following
>
> I think numpy memory management is due a cleanup. Currently we have
>
> PyDataMem_*
> PyDimMem_*
> PyArray_*
>
> Plus the malloc, PyMem_*, and PyObject_* interfaces. That is six ways to
> manage heap allocations. As far as I can tell, PyArray_* is always PyMem_*
> in practice. We probably need to keep the PyDataMem family as it has a
> memory tracking option, but PyDimMem just confuses things, I'd rather just
> use PyMem_* with explicit size. Curiously, the PyObject_Malloc family is not
> documented apart from some release notes.
>
> We should also check for the macro versions of PyMem_* as they are
> deprecated for extension modules.
>
> Nathaniel then suggested that we consider going all Python allocators,
> especially as new memory tracing tools are coming online in 3.4. Given that
> these changes could have some impact on current extension writers I thought
> I'd bring this up on the list to gather opinions.
>
> Thoughts?

After a bit more research, some further points to keep in mind:

Currently, PyDimMem_* and PyArray_* are just aliases for malloc/free,
and PyDataMem_* is an alias for malloc/free with some extra tracing
hooks wrapped around it. (AFAIK, these tracing hooks are not used by
anyone anywhere -- at least, if they are I haven't heard about it, and
there is no code on github that uses them.)

There is one substantial difference between the PyMem_* and PyObject_*
interfaces as compared to malloc(), which is that the Py* interfaces
require that the GIL be held when they are called. (@Julian -- I think
your PR we just merged fulfills this requirement, is that right?) I
strongly suspect that we have PyDataMem_* calls outside of the GIL --
e.g., when allocating ufunc buffers -- and third-party code might as
well.

Python 3.4's new memory allocation API and tracing stuff is documented here:
  http://www.python.org/dev/peps/pep-0445/
  http://docs.python.org/dev/c-api/memory.html
  http://docs.python.org/dev/library/tracemalloc.html

In particular, 3.4 adds a set of PyRawMem_* functions, which do not
require the GIL. Checking the current source code for _tracemalloc.c,
it appears that PyRawMem_* functions *are* traced, so that's nice -
that means that switching PyDataMem_* to use PyRawMem_* would be both
safe and provide benefits. However, PyRawMem_* does not provide the
pymalloc optimizations for small allocations.

Also, none of the Py* interfaces implement calloc(), which is annoying
because it messes up our new optimization of using calloc() for
np.zeros. (calloc() is generally faster than malloc()+explicit
zeroing, because it can use OS-specific virtual memory tricks to zero
out the memory "for free". These same tricks also mean that if you use
np.zeros() to allocate a large array, and then only write to a few
entries in that array, the total memory used is proportional to the
number of non-zero entries, rather than to the actual size of the
array, which can be extremely useful in some situations as a kind of
"poor man's sparse array".)

I'm pretty sure that the vast majority of our allocations do occur
with GIL protection, so we might want to switch to using PyObject_*
for most cases to take advantage of the small-object optimizations,
and use PyRawMem_* for any non-GIL cases (like possibly ufunc
buffers), with a compatibility wrapper to replace PyRawMem_* with
malloc() on pre-3.4 pythons. Of course this will need some profiling
to see if PyObject_* is actually better than malloc() in practice. For
calloc(), we could try and convince python-dev to add this, or
np.zeros() could explicitly use calloc() even when other code uses Py*
interface and then uses an ndarray flag or special .base object to
keep track of the fact that we need to use free() to deallocate this
memory, or we could give up on the calloc optimization.

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] adding fused multiply and add to numpy

2014-01-09 Thread Julian Taylor
On 10.01.2014 01:49, Frédéric Bastien wrote:
> 
> Do you know if those instruction are automatically used by gcc if we
> use the good architecture parameter?
> 
> 

they are used if you enable -ffp-contract=fast. Do not set it to `on`
this is an alias to `off` due to the semantics of C.
-ffast-math enables in in gcc 4.7 and 4.8 but not in 4.9 but this might
be a bug, I filed one a while ago.

Also you need to set the -mfma or -arch=bdver{1,2,3,4}. Its not part of
-mavx2 last I checked.

But there are not many places in numpy the compiler can use it, only dot
comes to mind which goes over blas libraries in the high performance case.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] adding fused multiply and add to numpy

2014-01-09 Thread Frédéric Bastien
Good questions where do we stop.

I think as you that the fma with guarantees is a good new feature. But
if this is made available, people will want to use it for speed. Some
people won't like to use another library or dependency. They won't
like to have random speed up or slow down. So why not add the ma and
fma and trace the line to the operation implemented on the CPU that
have an fused version? That will make a sensible limit I think.

Anyway, we won't use it directly. This is just my taught.

Do you know if those instruction are automatically used by gcc if we
use the good architecture parameter?


Fred

On Thu, Jan 9, 2014 at 12:07 PM, Nathaniel Smith  wrote:
> On Thu, Jan 9, 2014 at 3:30 PM, Julian Taylor
>  wrote:
>> On Thu, Jan 9, 2014 at 3:50 PM, Frédéric Bastien  wrote:
>>> How hard would it be to provide the choise to the user? We could
>>> provide 2 functions like: fma_fast() fma_prec() (for precision)? Or
>>> this could be a parameter or a user configuration option like for the
>>> overflow/underflow error.
>>
>> I like Freddie Witherden proposal to name the function madd which does not
>> guarantee one rounding operation. This leaves the namespace open for a
>> special fma function with that guarantee. It can use the libc fma function
>> which is very slow sometimes but platform independent. This is assuming
>> apple did not again take shortcuts like they did with their libc hypot
>> implementation, can someone disassemble apple libc to check what they are
>> doing for C99 fma?
>> And it leaves users the possibility to use the faster madd function if they
>> do not need the precision guarantee.
>
> If madd doesn't provide any rounding guarantees, then its only reason
> for existence is that it provides a fused a*b+c loop that better
> utilizes memory bandwidth, right? I'm guessing that speed-wise it
> doesn't really matter whether you use the fancy AVX instructions or
> not, since even the naive implementation is memory bound -- the
> advantage is just in the fusion?
>
> Lack of loop fusion is obviously a major limitation of numpy, but it's
> a very general problem. I'm sceptical about whether we want to get
> into the business of adding functions whose only purpose is to provide
> pre-fused loops. After madd, what other operations should we provide
> like this? msub (a*b-c)? add3 (a+b+c)? maddm (a*b+c*d)? mult3 (a*b*c)?
> How do we decide? Surely it's better to direct people who are hitting
> memory bottlenecks to much more powerful and general solutions to this
> problem, like numexpr/cython/numba/theano?
>
> (OTOH the verison that gives rounding guarantees is obviously a unique
> new feature.)
>
> -n
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Memory allocation cleanup

2014-01-09 Thread Frédéric Bastien
This shouldn't affect Theano. So I have no objection.

Making thing faster and more tracktable is always good. So I think it
seam a good idea.

Fred

On Thu, Jan 9, 2014 at 6:21 PM, Charles R Harris
 wrote:
> Apropos Julian's changes to use the PyObject_* allocation suite for some
> parts of numpy, I posted the following
>
> I think numpy memory management is due a cleanup. Currently we have
>
> PyDataMem_*
> PyDimMem_*
> PyArray_*
>
> Plus the malloc, PyMem_*, and PyObject_* interfaces. That is six ways to
> manage heap allocations. As far as I can tell, PyArray_* is always PyMem_*
> in practice. We probably need to keep the PyDataMem family as it has a
> memory tracking option, but PyDimMem just confuses things, I'd rather just
> use PyMem_* with explicit size. Curiously, the PyObject_Malloc family is not
> documented apart from some release notes.
>
> We should also check for the macro versions of PyMem_* as they are
> deprecated for extension modules.
>
> Nathaniel then suggested that we consider going all Python allocators,
> especially as new memory tracing tools are coming online in 3.4. Given that
> these changes could have some impact on current extension writers I thought
> I'd bring this up on the list to gather opinions.
>
> Thoughts?
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Memory allocation cleanup

2014-01-09 Thread Charles R Harris
Apropos Julian's changes  to use
the PyObject_* allocation suite for some parts of numpy, I posted the
following

I think numpy memory management is due a cleanup. Currently we have

PyDataMem_*
PyDimMem_*
PyArray_*

Plus the malloc, PyMem_*, and PyObject_* interfaces. That is six ways to
manage heap allocations. As far as I can tell, PyArray_* is always
PyMem_*in practice. We probably need to keep the
PyDataMem family as it has a memory tracking option, but PyDimMem just
confuses things, I'd rather just use PyMem_* with explicit size. Curiously,
the PyObject_Malloc family is not documented apart from some release notes.

We should also check for the macro versions of PyMem_* as they are
deprecated for extension modules.
 Nathaniel then suggested that we consider going all Python allocators,
especially as new memory tracing tools are coming online in 3.4. Given that
these changes could have some impact on current extension writers I thought
I'd bring this up on the list to gather opinions.

Thoughts?
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] ENH: add a 'return_counts=' keyword argument to `np.unique`

2014-01-09 Thread Jaime Fernández del Río
Hi,

I have just sent a PR, adding a `return_counts` keyword argument to
`np.unique` that does exactly what the name suggests: counting the number
of times each unique time comes up in the array. It reuses the `flag` array
that is constructed whenever any optional index is requested, extracts the
indices of the `True`s in it, and returns their diff. You can check it here:

https://github.com/numpy/numpy/pull/4180

Regards,

Jaime

-- 
(\__/)
( O.o)
( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] adding fused multiply and add to numpy

2014-01-09 Thread Nathaniel Smith
On Thu, Jan 9, 2014 at 3:30 PM, Julian Taylor
 wrote:
> On Thu, Jan 9, 2014 at 3:50 PM, Frédéric Bastien  wrote:
>> How hard would it be to provide the choise to the user? We could
>> provide 2 functions like: fma_fast() fma_prec() (for precision)? Or
>> this could be a parameter or a user configuration option like for the
>> overflow/underflow error.
>
> I like Freddie Witherden proposal to name the function madd which does not
> guarantee one rounding operation. This leaves the namespace open for a
> special fma function with that guarantee. It can use the libc fma function
> which is very slow sometimes but platform independent. This is assuming
> apple did not again take shortcuts like they did with their libc hypot
> implementation, can someone disassemble apple libc to check what they are
> doing for C99 fma?
> And it leaves users the possibility to use the faster madd function if they
> do not need the precision guarantee.

If madd doesn't provide any rounding guarantees, then its only reason
for existence is that it provides a fused a*b+c loop that better
utilizes memory bandwidth, right? I'm guessing that speed-wise it
doesn't really matter whether you use the fancy AVX instructions or
not, since even the naive implementation is memory bound -- the
advantage is just in the fusion?

Lack of loop fusion is obviously a major limitation of numpy, but it's
a very general problem. I'm sceptical about whether we want to get
into the business of adding functions whose only purpose is to provide
pre-fused loops. After madd, what other operations should we provide
like this? msub (a*b-c)? add3 (a+b+c)? maddm (a*b+c*d)? mult3 (a*b*c)?
How do we decide? Surely it's better to direct people who are hitting
memory bottlenecks to much more powerful and general solutions to this
problem, like numexpr/cython/numba/theano?

(OTOH the verison that gives rounding guarantees is obviously a unique
new feature.)

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] adding fused multiply and add to numpy

2014-01-09 Thread Julian Taylor
On Thu, Jan 9, 2014 at 3:50 PM, Frédéric Bastien  wrote:

> Hi,
>
> It happen frequently that NumPy isn't compiled with all instruction
> that is available where it run. For example in distro. So if the
> decision is made to use the fast version when we don't use the newer
> instruction, the user need a way to know that. So the library need a
> function/attribute to tell that.
>

As these instructions are very new runtime cpu feature detection is
required. That way also distribution users get the fast code if their cpu
supports it.



>
> How hard would it be to provide the choise to the user? We could
> provide 2 functions like: fma_fast() fma_prec() (for precision)? Or
> this could be a parameter or a user configuration option like for the
> overflow/underflow error.
>

I like Freddie Witherden proposal to name the function madd which does not
guarantee one rounding operation. This leaves the namespace open for a
special fma function with that guarantee. It can use the libc fma function
which is very slow sometimes but platform independent. This is assuming
apple did not again take shortcuts like they did with their libc hypot
implementation, can someone disassemble apple libc to check what they are
doing for C99 fma?
And it leaves users the possibility to use the faster madd function if they
do not need the precision guarantee.

Another option would be a precision context manager which tells numpy which
variant to use. This would also be useful for other code (like
abs/hypot/abs2/sum/reciprocal sqrt) but probably it involves more work.

with numpy.precision_mode('fast'):
  ... # allow no fma, use fast hypot, fast sum, ignore overflow/invalid
errors

with numpy.precision_mode('precise'):
  ... # require fma, use precise hypot, use exact summation (math.fsum) or
at least kahan summation, full overflow/invalid checks etc



>
>
> On Thu, Jan 9, 2014 at 9:43 AM, Freddie Witherden 
> wrote:
> > On 08/01/14 21:39, Julian Taylor wrote:
> >> An issue is software emulation of real fma. This can be enabled in the
> >> test ufunc with npfma.set_type("libc").
> >> This is unfortunately incredibly slow about a factor 300 on my machine
> >> without hardware fma.
> >> This means we either have a function that is fast on some platforms and
> >> slow on others but always gives the same result or we have a fast
> >> function that gives better results on some platforms.
> >> Given that we are not worth that what numpy currently provides I favor
> >> the latter.
> >>
> >> Any opinions on whether this should go into numpy or maybe stay a third
> >> party ufunc?
> >
> > My preference would be to initially add an "madd" intrinsic.  This can
> > be supported on all platforms and can be documented to permit the use of
> > FMA where available.
> >
> > A 'true' FMA intrinsic function should only be provided when hardware
> > FMA support is available.  Many of the more interesting applications of
> > FMA depend on there only being a single rounding step and as such "FMA"
> > should probably mean "a*b + c with only a single rounding".
> >
> > Regards, Freddie.
> >
> >
> > ___
> > NumPy-Discussion mailing list
> > NumPy-Discussion@scipy.org
> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] adding fused multiply and add to numpy

2014-01-09 Thread Julian Taylor
On Thu, Jan 9, 2014 at 3:54 PM, Daπid  wrote:

>
> On 8 January 2014 22:39, Julian Taylor wrote:
>
>> As you can see even without real hardware support it is about 30% faster
>> than inplace unblocked numpy due better use of memory bandwidth. Its
>> even more than two times faster than unoptimized numpy.
>>
>
> I have an i5, and AVX crashes, even though it is supported by my CPU.
>

I forgot about the 32 byte alignment avx (as it is used in this code)
requires. I pushed a new version that takes care of it.
It should now work with avx.


> Following the instructions in the readme, there is only one compiled file,
> npfma.so, but no .o.
>
>
> the .o files are in the build/ subfolder
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] adding fused multiply and add to numpy

2014-01-09 Thread Daπid
On 8 January 2014 22:39, Julian Taylor wrote:

> As you can see even without real hardware support it is about 30% faster
> than inplace unblocked numpy due better use of memory bandwidth. Its
> even more than two times faster than unoptimized numpy.
>

I have an i5, and AVX crashes, even though it is supported by my CPU.
Here are my timings:

SSE2:

In [24]: %timeit npfma.fma(a, b, c)
10 loops, best of 3: 15 us per loop

In [28]: %timeit npfma.fma(a, b, c)
100 loops, best of 3: 2.36 ms per loop

In [29]: %timeit npfma.fms(a, b, c)
100 loops, best of 3: 2.36 ms per loop

In [31]: %timeit pure_numpy_fma(a, b, c)
100 loops, best of 3: 7.5 ms per loop

In [33]: %timeit pure_numpy_fma2(a, b, c)
100 loops, best of 3: 4.41 ms per loop

The model supports all the way to sse4_2

libc:

In [24]: %timeit npfma.fma(a, b, c)
1000 loops, best of 3: 883 us per loop

In [28]: %timeit npfma.fma(a, b, c)
10 loops, best of 3: 88.7 ms per loop

In [29]: %timeit npfma.fms(a, b, c)
10 loops, best of 3: 87.4 ms per loop

In [31]: %timeit pure_numpy_fma(a, b, c)
100 loops, best of 3: 7.94 ms per loop

In [33]: %timeit pure_numpy_fma2(a, b, c)
100 loops, best of 3: 3.03 ms per loop



> If you have a machine capable of fma instructions give it a spin to see
> if you get similar or better results. Please verify the assembly
> (objdump -d fma-.o) to check if the compiler properly used the
> machine fma.
>

Following the instructions in the readme, there is only one compiled file,
npfma.so, but no .o.


/David.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] adding fused multiply and add to numpy

2014-01-09 Thread Frédéric Bastien
Hi,

It happen frequently that NumPy isn't compiled with all instruction
that is available where it run. For example in distro. So if the
decision is made to use the fast version when we don't use the newer
instruction, the user need a way to know that. So the library need a
function/attribute to tell that.

How hard would it be to provide the choise to the user? We could
provide 2 functions like: fma_fast() fma_prec() (for precision)? Or
this could be a parameter or a user configuration option like for the
overflow/underflow error.

Fred

On Thu, Jan 9, 2014 at 9:43 AM, Freddie Witherden  wrote:
> On 08/01/14 21:39, Julian Taylor wrote:
>> An issue is software emulation of real fma. This can be enabled in the
>> test ufunc with npfma.set_type("libc").
>> This is unfortunately incredibly slow about a factor 300 on my machine
>> without hardware fma.
>> This means we either have a function that is fast on some platforms and
>> slow on others but always gives the same result or we have a fast
>> function that gives better results on some platforms.
>> Given that we are not worth that what numpy currently provides I favor
>> the latter.
>>
>> Any opinions on whether this should go into numpy or maybe stay a third
>> party ufunc?
>
> My preference would be to initially add an "madd" intrinsic.  This can
> be supported on all platforms and can be documented to permit the use of
> FMA where available.
>
> A 'true' FMA intrinsic function should only be provided when hardware
> FMA support is available.  Many of the more interesting applications of
> FMA depend on there only being a single rounding step and as such "FMA"
> should probably mean "a*b + c with only a single rounding".
>
> Regards, Freddie.
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] adding fused multiply and add to numpy

2014-01-09 Thread Freddie Witherden
On 08/01/14 21:39, Julian Taylor wrote:
> An issue is software emulation of real fma. This can be enabled in the
> test ufunc with npfma.set_type("libc").
> This is unfortunately incredibly slow about a factor 300 on my machine
> without hardware fma.
> This means we either have a function that is fast on some platforms and
> slow on others but always gives the same result or we have a fast
> function that gives better results on some platforms.
> Given that we are not worth that what numpy currently provides I favor
> the latter.
> 
> Any opinions on whether this should go into numpy or maybe stay a third
> party ufunc?

My preference would be to initially add an "madd" intrinsic.  This can
be supported on all platforms and can be documented to permit the use of
FMA where available.

A 'true' FMA intrinsic function should only be provided when hardware
FMA support is available.  Many of the more interesting applications of
FMA depend on there only being a single rounding step and as such "FMA"
should probably mean "a*b + c with only a single rounding".

Regards, Freddie.



signature.asc
Description: OpenPGP digital signature
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] adding fused multiply and add to numpy

2014-01-09 Thread Neal Becker
Charles R Harris wrote:

> On Wed, Jan 8, 2014 at 2:39 PM, Julian Taylor > wrote:
> 
...
> 
> Another function that could be useful is a |a|**2 function, abs2 perhaps.
> 
> Chuck

I use mag_sqr all the time.  It should be much faster for complex, if computed
via:

x.real**2 + x.imag**2

avoiding the sqrt of abs.

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion