Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-07 Thread Nathaniel Smith
On Sat, Jun 7, 2014 at 1:37 AM, Greg Ewing greg.ew...@canterbury.ac.nz wrote:
 Julian Taylor wrote:

 tp_can_elide receives two objects and returns one of three values:
 * can work inplace, operation is associative
 * can work inplace but not associative
 * cannot work inplace


 Does it really need to be that complicated? Isn't it
 sufficient just to ask the object potentially being
 overwritten whether it's okay to overwrite it?
 I.e. a parameterless method returning a boolean.

For the numpy case, we really need to see all the operands, *and* know
what the operation in question is. Consider

tmp1 = np.ones((3, 1))
tmp2 = np.ones((1, 3))
tmp1 + tmp2

which returns an array with shape (3, 3).

Both input arrays are temporaries, but neither of them can be stolen
to use for the output array.

Or suppose 'a' is an array of integers and 'b' is an array of floats,
then 'a + b' and 'a += b' have very different results (the former
upcasts 'a' to float, the latter has to either downcast 'b' to int or
raise an error). But the casting rules depend on the particular input
types and the particular operation -- operations like  and  want to
cast to int,  and  return bools, etc. So one really needs to know
all the details of the operation before one can determine whether
temporary elision is possible.

-n

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-07 Thread Greg Ewing

Nathaniel Smith wrote:

For the numpy case, we really need to see all the operands, *and* know
what the operation in question is...


Okay, I see what you mean now.

Given all that, it might be simpler just to have the
method perform the operation itself if it can. It has
all the information necessary to do so, after all.

This would also make it possible for the inplace
operators to have different semantics from temp-elided
non-inplace ones if desired.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-06 Thread Julian Taylor
On 06.06.2014 04:18, Sturla Molden wrote:
 On 05/06/14 22:51, Nathaniel Smith wrote:
 
 This gets evaluated as:

 tmp1 = a + b
 tmp2 = tmp1 + c
 result = tmp2 / c

 All these temporaries are very expensive. Suppose that a, b, c are
 arrays with N bytes each, and N is large. For simple arithmetic like
 this, then costs are dominated by memory access. Allocating an N byte
 array requires the kernel to clear the memory, which incurs N bytes of
 memory traffic.
 
 It seems to be the case that a large portion of the run-time in Python
 code using NumPy can be spent in the kernel zeroing pages (which the
 kernel does for security reasons).
 
 I think this can also be seen as a 'malloc problem'. It comes about
 because each new NumPy array starts with a fresh buffer allocated by
 malloc. Perhaps buffers can be reused?
 
 Sturla
 
 

Caching memory inside of numpy would indeed solve this issue too. There
has even been a paper written on this which contains some more serious
benchmarks than the laplace case which runs on very old hardware (and
the inplace and out of place cases are actually not the same, one
computes array/scalar the other array * (1 / scalar)):

hiperfit.dk/pdf/Doubling.pdf
The result is an improvement of as much as 2.29 times speedup, on
average 1.32 times speedup across a benchmark suite of 15 applications

The problem with this approach is that it is already difficult enough to
handle memory in numpy. Having a cache that potentially stores gigabytes
of memory out of the users sight will just make things worse.

This would not be needed if we can come up with a way on how python can
help out numpy in eliding the temporaries.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-06 Thread Sturla Molden
Julian Taylor jtaylor.deb...@googlemail.com wrote:
 
 The problem with this approach is that it is already difficult enough to
 handle memory in numpy.

I would not do this in a way that complicates memory management in NumPy. I
would just replace malloc and free with temporarily cached versions. From
the perspective of NumPy the API should be the same.

 Having a cache that potentially stores gigabytes
 of memory out of the users sight will just make things worse.

Buffer don't need to stay in cache forver, just long enough to allow resue
within an expression. We are probably talking about delaying the call to
free with just a few microseconds.

We could e.g. have a setup like this:

NumPy thread on malloc:
- tries to grab memory off the internal heap
- calls system malloc on failure

NumPy thread on free:
- returns a buffer to the internal heap
- signals a condition

Background daemonic GC thread:
- wakes after sleeping on the condition
- sleeps for another N microseconds (N = magic number)
- flushes or shrinks the internal heap with system free
- goes back to sleeping on the condition 
 
It can be implemented with the same API as malloc and free, and plugged
directly into the existing NumPy code. 

We would in total need two mutexes, one condition variable, a pthread, and
a heap.

Sturla

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-06 Thread Nathaniel Smith
On 6 Jun 2014 17:07, Sturla Molden sturla.mol...@gmail.com wrote:
 We would in total need two mutexes, one condition variable, a pthread, and
 a heap.

The proposal in my initial email requires zero pthreads, and is
substantially more effective. (Your proposal reduces only the alloc
overhead for large arrays; mine reduces both alloc and memory access
overhead for boyh large and small arrays.)

-n
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-06 Thread Julian Taylor
On 06.06.2014 04:26, Greg Ewing wrote:
 Nathaniel Smith wrote:
 
 I'd be a little nervous about whether anyone has implemented, say, an
 iadd with side effects such that you can tell whether a copy was made,
 even if the object being copied is immediately destroyed.
 
 I can think of at least one plausible scenario where
 this could occur: the operand is a view object that
 wraps another object, and its __iadd__ method updates
 that other object.
 
 In fact, now that I think about it, exactly this
 kind of thing happens in numpy when you slice an
 array!
 
 So the opt-in indicator would need to be dynamic, on
 a per-object basis, rather than a type flag.
 

yes an opt-in indicator would need to receive both operand objects so it
would need to be a slot in the object or number type object.
Would the addition of a tp_can_elide slot to the object types be
acceptable for this rather specialized case?


tp_can_elide receives two objects and returns one of three values:
* can work inplace, operation is associative
* can work inplace but not associative
* cannot work inplace

Implementation could e.g. look about like this:

TARGET(BINARY_SUBTRACT) {
   fl = left-obj_type-tp_can_elide
   fr = right-obj_type-tp_can_elide
   elide = 0
   if (unlikely(fl)) {
  elide = fl(left, right)
   }
   else if (unlikely(fr)) {
  elide = fr(left, right)
   }
   if (unlikely(elide == YES)  left-refcnt == 1) {
   PyNumber_InPlaceSubtract(left, right)
   }
   else if (unlikely(elide == SWAPPABLE)  right-refcnt == 1) {
   PyNumber_InPlaceSubtract(right, left)
   }
   else {
   PyNumber_Subtract(left, right)
   }
}
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-06 Thread Sturla Molden
Nathaniel Smith n...@pobox.com wrote:

 The proposal in my initial email requires zero pthreads, and is
 substantially more effective. (Your proposal reduces only the alloc
 overhead for large arrays; mine reduces both alloc and memory access
 overhead for boyh large and small arrays.)

My suggestion prevents the kernel from zeroing pages in the middle of a
computation, which is an important part. It would also be an optimiation
the Python interpreter could benefit from indepently of NumPy, by allowing
reuse of allocated memory pages within CPU bound portions of the Python
code. And no, the method I suggested does not only work for large arrays.

If we really want to take out the memory access overhead, we need to
consider lazy evaluation. E.g. a context manager that collects a symbolic
expression and triggers evaluation on exit:

with numpy.accelerate:
x = expression
y = expression
z = expression
# evaluation of x,y,z happens here

Sturla

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-06 Thread Nathaniel Smith
On Fri, Jun 6, 2014 at 11:33 PM, Sturla Molden sturla.mol...@gmail.com wrote:
 Nathaniel Smith n...@pobox.com wrote:

 The proposal in my initial email requires zero pthreads, and is
 substantially more effective. (Your proposal reduces only the alloc
 overhead for large arrays; mine reduces both alloc and memory access
 overhead for boyh large and small arrays.)

 My suggestion prevents the kernel from zeroing pages in the middle of a
 computation, which is an important part. It would also be an optimiation
 the Python interpreter could benefit from indepently of NumPy, by allowing
 reuse of allocated memory pages within CPU bound portions of the Python
 code. And no, the method I suggested does not only work for large arrays.

Small allocations are already recycled within process and don't touch
the kernel, so your method doesn't affect them at all.

My guess is that PyMalloc is unlikely to start spawning background
threads any time soon, but if you'd like to propose it maybe you
should start a new thread for that?

 If we really want to take out the memory access overhead, we need to
 consider lazy evaluation. E.g. a context manager that collects a symbolic
 expression and triggers evaluation on exit:

 with numpy.accelerate:
 x = expression
 y = expression
 z = expression
 # evaluation of x,y,z happens here

Using an alternative evaluation engine is indeed another way to
optimize execution, which is why projects like numexpr, numba, theano,
etc. exist. But this is basically switching to a different language in
a different VM. I think people will keep running plain-old-CPython
code for some time yet, and the point of this thread is that there's
some low-hanging fruit for making all *that* code faster.

-n

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-06 Thread Greg Ewing

Julian Taylor wrote:

tp_can_elide receives two objects and returns one of three values:
* can work inplace, operation is associative
* can work inplace but not associative
* cannot work inplace


Does it really need to be that complicated? Isn't it
sufficient just to ask the object potentially being
overwritten whether it's okay to overwrite it?
I.e. a parameterless method returning a boolean.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-06 Thread Sturla Molden
Greg Ewing greg.ew...@canterbury.ac.nz wrote:
 Julian Taylor wrote:
 tp_can_elide receives two objects and returns one of three values:
 * can work inplace, operation is associative
 * can work inplace but not associative
 * cannot work inplace
 
 Does it really need to be that complicated? Isn't it
 sufficient just to ask the object potentially being
 overwritten whether it's okay to overwrite it?

How can it know this without help from the interpreter? 

Sturla

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-06 Thread Sturla Molden
Nathaniel Smith n...@pobox.com wrote:

 with numpy.accelerate:
 x = expression
 y = expression
 z = expression
 # evaluation of x,y,z happens here
 
 Using an alternative evaluation engine is indeed another way to
 optimize execution, which is why projects like numexpr, numba, theano,
 etc. exist. But this is basically switching to a different language in
 a different VM.

I was not thinking that complicated. Let us focus on what an unmodified
CPython can do.

A compound expression with arrays can also be seen as a pipeline. Imagine
what would happen if in NumPy 2.0 arithmetic operators returned
coroutines instead of temporary arrays. That way an expression could be
evaluated chunkwise, and the chunks would be small enough to fit in cache.

Sturla

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-05 Thread Nathaniel Smith
Hi all,

There's a very valuable optimization -- temporary elision -- which
numpy can *almost* do. It gives something like a 10-30% speedup for
lots of common real-world expressions. It would probably would be
useful for non-numpy code too. (In fact it generalizes the str += str
special case that's currently hardcoded in ceval.c.) But it can't be
done safely without help from the interpreter, and possibly not even
then. So I thought I'd raise it here and see if we can get any
consensus on whether and how CPython could support this.

=== The dream ===

Here's the idea. Take an innocuous expression like:

   result = (a + b + c) / c

This gets evaluated as:

   tmp1 = a + b
   tmp2 = tmp1 + c
   result = tmp2 / c

All these temporaries are very expensive. Suppose that a, b, c are
arrays with N bytes each, and N is large. For simple arithmetic like
this, then costs are dominated by memory access. Allocating an N byte
array requires the kernel to clear the memory, which incurs N bytes of
memory traffic. If all the operands are already allocated, then
performing a three-operand operation like tmp1 = a + b involves 3N
bytes of memory traffic (reading the two inputs plus writing the
output). In total our example does 3 allocations and has 9 operands,
so it does 12N bytes of memory access.

If our arrays are small, then the kernel doesn't get involved and some
of these accesses will hit the cache, but OTOH the overhead of things
like malloc won't be amortized out; the best case starting from a cold
cache is 3 mallocs and 6N bytes worth of cache misses (or maybe 5N if
we get lucky and malloc'ing 'result' returns the same memory that tmp1
used, and it's still in cache).

There's an obvious missed optimization in this code, though, which is
that it keeps allocating new temporaries and throwing away old ones.
It would be better to just allocate a temporary once and re-use it:

   tmp1 = a + b
   tmp1 += c
   tmp1 /= c
   result = tmp1

Now we have only 1 allocation and 7 operands, so we touch only 8N
bytes of memory. For large arrays -- that don't fit into cache, and
for which per-op overhead is amortized out -- this gives a theoretical
33% speedup, and we can realistically get pretty close to this. For
smaller arrays, the re-use of tmp1 means that in the best case we have
only 1 malloc and 4N bytes worth of cache misses, and we also have a
smaller cache footprint, which means this best case will be achieved
more often in practice. For small arrays it's harder to estimate the
total speedup here, but 66% fewer mallocs and 33% fewer cache misses
is certainly enough to make a practical difference.

Such optimizations are important enough that numpy operations always
give the option of explicitly specifying the output array (like
in-place operators but more general and with clumsier syntax). Here's
an example small-array benchmark that IIUC uses Jacobi iteration to
solve Laplace's equation. It's been written in both natural and
hand-optimized formats (compare num_update to num_inplace):

   https://yarikoptic.github.io/numpy-vbench/vb_vb_app.html#laplace-inplace

num_inplace is totally unreadable, but because we've manually elided
temporaries, it's 10-15% faster than num_update. With our prototype
automatic temporary elision turned on, this difference disappears --
the natural code gets 10-15% faster, *and* we remove the temptation to
write horrible things like num_inplace.

What do I mean by automatic temporary elision? It's *almost*
possible for numpy to automatically convert the first example into the
second. The idea is: we want to replace

  tmp2 = tmp1 + c

with

  tmp1 += c
  tmp2 = tmp1

And we can do this by defining

   def __add__(self, other):
   if is_about_to_be_thrown_away(self):
   return self.__iadd__(other)
   else:
   ...

now tmp1.__add__(c) does an in-place add and returns tmp1, no
allocation occurs, woohoo.

The only little problem is implementing is_about_to_be_thrown_away().

=== The sneaky-but-flawed approach ===

The following implementation may make you cringe, but it comes
tantalizingly close to working:

bool is_about_to_be_thrown_away(PyObject * obj) {
return (Py_REFCNT(obj) == 1);
}

In fact, AFAICT it's 100% correct for libraries being called by
regular python code (which is why I'm able to quote benchmarks at you
:-)). The bytecode eval loop always holds a reference to all operands,
and then immediately DECREFs them after the operation completes. If
one of our arguments has no other references besides this one, then we
can be sure that it is a dead obj walking, and steal its corpse.

But this has a fatal flaw: people are unreasonable creatures, and
sometimes they call Python libraries without going through ceval.c
:-(. It's legal for random C code to hold an array object with a
single reference count, and then call PyNumber_Add on it, and then
expect the original array object to still be valid. But who writes
code like that in practice? Well, Cython does. So, 

Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-05 Thread Paul Moore
On 5 June 2014 21:51, Nathaniel Smith n...@pobox.com wrote:
 Is there a better idea I'm missing?

Just a thought, but the temporaries come from the stack manipulation
done by the likes of the BINARY_ADD opcode. (After all the bytecode
doesn't use temporaries, it's a stack machine). Maybe BINARY_ADD and
friends could allow for an alternative fast calling convention for
__add__implementations that uses the stack slots directly? This may be
something that's only plausible from C code, though. Or may not be
plausible at all. I haven't looked at ceval.c for many years...

If this is an insane idea, please feel free to ignore me :-)

Paul
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-05 Thread Nathaniel Smith
On Thu, Jun 5, 2014 at 10:37 PM, Paul Moore p.f.mo...@gmail.com wrote:
 On 5 June 2014 21:51, Nathaniel Smith n...@pobox.com wrote:
 Is there a better idea I'm missing?

 Just a thought, but the temporaries come from the stack manipulation
 done by the likes of the BINARY_ADD opcode. (After all the bytecode
 doesn't use temporaries, it's a stack machine). Maybe BINARY_ADD and
 friends could allow for an alternative fast calling convention for
 __add__implementations that uses the stack slots directly? This may be
 something that's only plausible from C code, though. Or may not be
 plausible at all. I haven't looked at ceval.c for many years...

 If this is an insane idea, please feel free to ignore me :-)

To make sure I understand correctly, you're suggesting something like
adding a new set of special method slots, __te_add__, __te_mul__,
etc., which BINARY_ADD and friends would check for and if found,
dispatch to without going through PyNumber_Add? And this way, a type
like numpy's array could have a special implementation for __te_add__
that works the same as __add__, except with the added wrinkle that it
knows that it will only be called by the interpreter and thus any
arguments with refcnt 1 must be temporaries?

-n

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-05 Thread Paul Moore
On 5 June 2014 22:47, Nathaniel Smith n...@pobox.com wrote:
 To make sure I understand correctly, you're suggesting something like
 adding a new set of special method slots, __te_add__, __te_mul__,
 etc.

I wasn't thinking in that much detail, TBH. I'm not sure adding a
whole set of new slots is sensible for such a specialised case. I
think I was more assuming that the special method implementations
could use an alternative calling convention, METH_STACK in place of
METH_VARARGS, for example. That would likely only be viable for types
implemented in C.

But either way, it may be more complicated than the advantages would justify...
Paul
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-05 Thread Terry Reedy

On 6/5/2014 4:51 PM, Nathaniel Smith wrote:


In fact, AFAICT it's 100% correct for libraries being called by
regular python code (which is why I'm able to quote benchmarks at you
:-)). The bytecode eval loop always holds a reference to all operands,
and then immediately DECREFs them after the operation completes. If
one of our arguments has no other references besides this one, then we
can be sure that it is a dead obj walking, and steal its corpse.

But this has a fatal flaw: people are unreasonable creatures, and
sometimes they call Python libraries without going through ceval.c
:-(. It's legal for random C code to hold an array object with a
single reference count, and then call PyNumber_Add on it, and then
expect the original array object to still be valid. But who writes
code like that in practice? Well, Cython does. So, this is no-go.


I understand that a lot of numpy/scipy code is compiled with Cython, so 
you really want the optimization to continue working when so compiled. 
Is there a simple change to Cython that would work, perhaps in 
coordination with a change to numpy? Is so, you could get the result 
before 3.5 comes out.


I realized that there are other compilers than Cython and non-numpy code 
that could benefit, so that a more generic solution would also be good. 
In particular


 Here's the idea. Take an innocuous expression like:

 result = (a + b + c) / c

 This gets evaluated as:

 tmp1 = a + b
 tmp2 = tmp1 + c
 result = tmp2 / c
...
 There's an obvious missed optimization in this code, though, which is
 that it keeps allocating new temporaries and throwing away old ones.
 It would be better to just allocate a temporary once and re-use it:
 tmp1 = a + b
 tmp1 += c
 tmp1 /= c
 result = tmp1

Could this transformation be done in the ast? And would that help?

A prolonged discussion might be better on python-ideas. See what others say.

--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-05 Thread Nathaniel Smith
On Thu, Jun 5, 2014 at 11:12 PM, Paul Moore p.f.mo...@gmail.com wrote:
 On 5 June 2014 22:47, Nathaniel Smith n...@pobox.com wrote:
 To make sure I understand correctly, you're suggesting something like
 adding a new set of special method slots, __te_add__, __te_mul__,
 etc.

 I wasn't thinking in that much detail, TBH. I'm not sure adding a
 whole set of new slots is sensible for such a specialised case. I
 think I was more assuming that the special method implementations
 could use an alternative calling convention, METH_STACK in place of
 METH_VARARGS, for example. That would likely only be viable for types
 implemented in C.

 But either way, it may be more complicated than the advantages would 
 justify...

Oh, I see, that's clever. But, unfortunately most __special__ methods
at the C level don't use METH_*, they just have hard-coded calling
conventions:
  https://docs.python.org/3/c-api/typeobj.html#number-structs

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-05 Thread Nikolaus Rath
Nathaniel Smith n...@pobox.com writes:
 Such optimizations are important enough that numpy operations always
 give the option of explicitly specifying the output array (like
 in-place operators but more general and with clumsier syntax). Here's
 an example small-array benchmark that IIUC uses Jacobi iteration to
 solve Laplace's equation. It's been written in both natural and
 hand-optimized formats (compare num_update to num_inplace):

https://yarikoptic.github.io/numpy-vbench/vb_vb_app.html#laplace-inplace

 num_inplace is totally unreadable, but because we've manually elided
 temporaries, it's 10-15% faster than num_update. 

Does it really have to be that ugly? Shouldn't using

  tmp += u[2:,1:-1]
  tmp *= dy2
  
instead of

  np.add(tmp, u[2:,1:-1], out=tmp)
  np.multiply(tmp, dy2, out=tmp)

give the same performance? (yes, not as nice as what you're proposing,
but I'm still curious).


Best,
-Nikolaus

-- 
GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

 »Time flies like an arrow, fruit flies like a Banana.«
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-05 Thread Nathaniel Smith
On 6 Jun 2014 02:16, Nikolaus Rath nikol...@rath.org wrote:

 Nathaniel Smith n...@pobox.com writes:
  Such optimizations are important enough that numpy operations always
  give the option of explicitly specifying the output array (like
  in-place operators but more general and with clumsier syntax). Here's
  an example small-array benchmark that IIUC uses Jacobi iteration to
  solve Laplace's equation. It's been written in both natural and
  hand-optimized formats (compare num_update to num_inplace):
 
 
https://yarikoptic.github.io/numpy-vbench/vb_vb_app.html#laplace-inplace
 
  num_inplace is totally unreadable, but because we've manually elided
  temporaries, it's 10-15% faster than num_update.

 Does it really have to be that ugly? Shouldn't using

   tmp += u[2:,1:-1]
   tmp *= dy2

 instead of

   np.add(tmp, u[2:,1:-1], out=tmp)
   np.multiply(tmp, dy2, out=tmp)

 give the same performance? (yes, not as nice as what you're proposing,
 but I'm still curious).

Yes, only the last line actually requires the out= syntax, everything else
could use in place operators instead (and automatic temporary elision
wouldn't work for the last line anyway). I guess whoever wrote it did it
that way for consistency (and perhaps in hopes of eking out a tiny bit more
speed - in numpy currently the in-place operators are implemented by
dispatching to function calls like those).

Not sure how much difference it really makes in practice though. It'd still
be 8 statements and two named temporaries to do the work of one infix
expression, with order of operations implicit.

-n
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-05 Thread Nathaniel Smith
On 5 Jun 2014 23:58, Terry Reedy tjre...@udel.edu wrote:

 On 6/5/2014 4:51 PM, Nathaniel Smith wrote:

 In fact, AFAICT it's 100% correct for libraries being called by
 regular python code (which is why I'm able to quote benchmarks at you
 :-)). The bytecode eval loop always holds a reference to all operands,
 and then immediately DECREFs them after the operation completes. If
 one of our arguments has no other references besides this one, then we
 can be sure that it is a dead obj walking, and steal its corpse.

 But this has a fatal flaw: people are unreasonable creatures, and
 sometimes they call Python libraries without going through ceval.c
 :-(. It's legal for random C code to hold an array object with a
 single reference count, and then call PyNumber_Add on it, and then
 expect the original array object to still be valid. But who writes
 code like that in practice? Well, Cython does. So, this is no-go.


 I understand that a lot of numpy/scipy code is compiled with Cython, so
you really want the optimization to continue working when so compiled. Is
there a simple change to Cython that would work, perhaps in coordination
with a change to numpy? Is so, you could get the result before 3.5 comes
out.

Unfortunately we don't actually know whether Cython is the only culprit
(such code *could* be written by hand), and even if we fixed Cython it
would take some unknowable amount of time before all downstream users
upgraded their Cythons. (It's pretty common for projects to check in
Cython-generated .c files, and only regenerate when the Cython source
actually gets modified.) Pretty risky for an optimization.

 I realized that there are other compilers than Cython and non-numpy code
that could benefit, so that a more generic solution would also be good. In
particular

  Here's the idea. Take an innocuous expression like:
 
  result = (a + b + c) / c
 
  This gets evaluated as:
 
  tmp1 = a + b
  tmp2 = tmp1 + c
  result = tmp2 / c
 ...

  There's an obvious missed optimization in this code, though, which is
  that it keeps allocating new temporaries and throwing away old ones.
  It would be better to just allocate a temporary once and re-use it:
  tmp1 = a + b
  tmp1 += c
  tmp1 /= c
  result = tmp1

 Could this transformation be done in the ast? And would that help?

I don't think it could be done in the ast because I don't think you can
work with anonymous temporaries there. But, now that you mention it, it
could be done on the fly in the implementation of the relevant opcodes.
I.e., BIN_ADD could do

if (Py_REFCNT(left) == 1)
result = PyNumber_InPlaceAdd(left, right);
else
result = PyNumber_Add(left, right)

Upside: all packages automagically benefit!

Potential downsides to consider:
- Subtle but real and user-visible change in Python semantics. I'd be a
little nervous about whether anyone has implemented, say, an iadd with side
effects such that you can tell whether a copy was made, even if the object
being copied is immediately destroyed. Maybe this doesn't make sense though.
- Only works when left operand is the temporary (remember that a*b+c is
faster than c+a*b), and only for arithmetic (no benefit for np.sin(a +
b)). Probably does cover the majority of cases though.

 A prolonged discussion might be better on python-ideas. See what others
say.

Yeah, I wasn't sure which list to use for this one, happy to move if it
would work better.

-n
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-05 Thread Chris Angelico
On Fri, Jun 6, 2014 at 11:47 AM, Nathaniel Smith n...@pobox.com wrote:
 Unfortunately we don't actually know whether Cython is the only culprit
 (such code *could* be written by hand), and even if we fixed Cython it would
 take some unknowable amount of time before all downstream users upgraded
 their Cythons. (It's pretty common for projects to check in Cython-generated
 .c files, and only regenerate when the Cython source actually gets
 modified.) Pretty risky for an optimization.

But code will still work, right? I mean, you miss out on an
optimization, but it won't actually be wrong code? It should be
possible to say After upgrading to Cython version x.y, regenerate all
your .c files to take advantage of this new optimization.

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-05 Thread Greg Ewing

Nathaniel Smith wrote:

I.e., BIN_ADD could do

if (Py_REFCNT(left) == 1)
result = PyNumber_InPlaceAdd(left, right);
else
result = PyNumber_Add(left, right)

Upside: all packages automagically benefit!

Potential downsides to consider:
- Subtle but real and user-visible change in Python semantics.


That would be a real worry. Even if such cases were rare,
they'd be damnably difficult to debug when they did occur.

I think for safety's sake this should only be done if the
type concerned opts in somehow, perhaps by a tp_flag
indicating that the type is eligible for temporary
elision.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-05 Thread Sturla Molden

On 05/06/14 22:51, Nathaniel Smith wrote:


This gets evaluated as:

tmp1 = a + b
tmp2 = tmp1 + c
result = tmp2 / c

All these temporaries are very expensive. Suppose that a, b, c are
arrays with N bytes each, and N is large. For simple arithmetic like
this, then costs are dominated by memory access. Allocating an N byte
array requires the kernel to clear the memory, which incurs N bytes of
memory traffic.


It seems to be the case that a large portion of the run-time in Python 
code using NumPy can be spent in the kernel zeroing pages (which the 
kernel does for security reasons).


I think this can also be seen as a 'malloc problem'. It comes about 
because each new NumPy array starts with a fresh buffer allocated by 
malloc. Perhaps buffers can be reused?


Sturla






___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-05 Thread Greg Ewing

Nathaniel Smith wrote:

I'd be a 
little nervous about whether anyone has implemented, say, an iadd with 
side effects such that you can tell whether a copy was made, even if the 
object being copied is immediately destroyed.


I can think of at least one plausible scenario where
this could occur: the operand is a view object that
wraps another object, and its __iadd__ method updates
that other object.

In fact, now that I think about it, exactly this
kind of thing happens in numpy when you slice an
array!

So the opt-in indicator would need to be dynamic, on
a per-object basis, rather than a type flag.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-05 Thread Nikolaus Rath
Nathaniel Smith n...@pobox.com writes:
  tmp1 = a + b
  tmp1 += c
  tmp1 /= c
  result = tmp1

 Could this transformation be done in the ast? And would that help?

 I don't think it could be done in the ast because I don't think you can
 work with anonymous temporaries there. But, now that you mention it, it
 could be done on the fly in the implementation of the relevant opcodes.
 I.e., BIN_ADD could do

 if (Py_REFCNT(left) == 1)
 result = PyNumber_InPlaceAdd(left, right);
 else
 result = PyNumber_Add(left, right)

 Upside: all packages automagically benefit!

 Potential downsides to consider:
 - Subtle but real and user-visible change in Python semantics. I'd be a
 little nervous about whether anyone has implemented, say, an iadd with side
 effects such that you can tell whether a copy was made, even if the object
 being copied is immediately destroyed. Maybe this doesn't make sense
 though.

Hmm. I don't think this is as unlikely as it may sound. Consider eg the
h5py module:

with h5py.File('database.h5') as fh:
 result = fh['key'] + np.ones(42)

if this were transformed to

with h5py.File('database.h5') as fh:
tmp = fh['key']
tmp += np.ones(42)
result = tmp

then the database.h5 file would get modified, *and* result would be of
type h5py.Dataset rather than np.array.


Best,
-Nikolaus

-- 
GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

 »Time flies like an arrow, fruit flies like a Banana.«
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com