Re: [Numpy-discussion] Optimizing numpy's einsum expression (again)

2015-01-16 Thread Eelco Hoogendoorn
Thanks for taking the time to think about this; good work.

Personally, I don't think a factor 5 memory overhead is much to sweat over.
The most complex einsum I have ever needed in a production environment was
5/6 terms, and for what this anecdote is worth, speed was a far
bigger concern to me than memory.

On Fri, Jan 16, 2015 at 12:30 AM, Daniel Smith dgasm...@icloud.com wrote:

 Hello everyone,
 I originally brought an optimized einsum routine forward a few weeks back
 that attempts to contract numpy arrays together in an optimal way. This can
 greatly reduce the scaling and overall cost of the einsum expression for
 the cost of a few intermediate arrays. The current version (and more
 details) can be found here:
 https://github.com/dgasmith/opt_einsum

 I think this routine is close to a finalized version, but there are two
 problems which I would like the community to weigh in on:

 Memory usage-
 Currently the routine only considers the maximum size of intermediates
 created rather than cumulative memory usage. If we only use np.einsum it is
 straightforward to calculate cumulative memory usage as einsum does not
 require any copies of inputs to be made; however, if we attempt to use a
 vendor BLAS the memory usage can becomes quite complex. For example, the
 np.tensordot routine always forces a copy for ndarrays because it uses
 ndarray.transpose(…).reshape(…). A more memory-conscious way to do this is
 to first try and do ndarray.reshape(…).T which does not copy the data and
 numpy can just pass a transpose flag to the vendor BLAS. The caveat here is
 that the summed indices must be in the correct order- if not a copy is
 required. Maximum array size is usually close to the total overhead of the
 opt_einsum function, but can occasionally be 2-5 times this size. I see the
 following ways forward:

- Ignore cumulative memory and stick with maximum array size as the
limiting memory factor.
- Implement logic to figure out if the input arrays needs to be copied
to use BLAS, compute the extra memory required, and add an extra dimension
to the optimization algorithms (use BLAS or do not use BLAS at each step).
Some of this is already done, but may get fairly complex.
- Build an in-place nd-transpose algorithm.
- Cut out BLAS entirely. Keeping in mind that vendor BLAS can be
orders of magnitude faster than a pure einsum implementation, especially if
the BLAS threading is used.


 Path algorithms-
 There are two algorithms “optimal” (a brute force algorithm, scales like
 N!) and “opportunistic” (a greedy algorithm, scales like N^3). The optimal
 path can take seconds to minutes to calculate for a 7-10 term expression
 while the opportunistic path takes microseconds even for 20+ term
 expressions. The opportunistic algorithm works surprisingly well and
 appears to obtain the correct scaling in all test cases that I can think
 of. Both algorithms use the maximum array size as a sieve- this is very
 beneficial from several aspects. The problem occurs when a much needed
 intermediate cannot be created due to this limit- on occasions not making
 this intermediate can have slowdowns of orders of magnitude even for small
 systems. This leads to certain (and sometimes unexpected) performance
 walls. Possible solutions:

- Warn the user if the ratio between an unlimited memory solution and
a limited memory solution becomes large.
- Do not worry about it.


 Thank you for your time,
 -Daniel Smith


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Optimizing numpy's einsum expression (again)

2015-01-16 Thread Dave Hirschfeld
Daniel Smith dgasmith at icloud.com writes:

 
 Hello everyone,I originally brought an optimized einsum routine 
forward a few weeks back that attempts to contract numpy arrays together 
in an optimal way. This can greatly reduce the scaling and overall cost 
of the einsum expression for the cost of a few intermediate arrays. The 
current version (and more details) can be found here:
 https://github.com/dgasmith/opt_einsum
 
 I think this routine is close to a finalized version, but there are 
two problems which I would like the community to weigh in on:
 
 Thank you for your time,
 
 
 -Daniel Smith
 

I wasn't aware of this work, but it's very interesting to me as a user 
of einsum whose principal reason for doing so is speed. Even though I 
use it on largish arrays I'm only concerned with the performance as I'm 
on x64 with plenty of memory even were it to require temporaries 5x the 
original size.

I don't use einsum that much because I've noticed the performance can be 
very problem dependant so I've always profiled it to check. Hopefully 
this work will make the performance more consistent, allowing it to be 
used more generally throughout my code.

Thanks,
Dave


* An anecdotal example from a user only.


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Sorting refactor

2015-01-16 Thread Lars Buitinck
2015-01-16 11:55 GMT+01:00  numpy-discussion-requ...@scipy.org:
 Message: 2
 Date: Thu, 15 Jan 2015 21:24:00 -0800
 From: Jaime Fern?ndez del R?o jaime.f...@gmail.com
 Subject: [Numpy-discussion] Sorting refactor
 To: Discussion of Numerical Python numpy-discussion@scipy.org
 Message-ID:
 capowhwkf6rnwcrgmcwsmq_lo3hshjgbvlsrn19z-mdpe25e...@mail.gmail.com
 Content-Type: text/plain; charset=utf-8

 This changes will make it easier for me to add a Timsort generic type
 function to numpy's arsenal of sorting routines. And I think they have
 value by cleaning the source code on their own.

Yes, they do. I've been looking at the sorting functions as well and
I've found the following:

* The code is generally hard to read because it prefers pointer over
indices. I'm wondering if it would get slower using indices. The
closer these algorithms are to the textbook, the easier to insert
fancy optimizations.

* The heap sort exploits undefined behavior by using a pointer that
points before the start of the array. However, rewriting it to always
point within the array made it slower. I haven't tried rewriting it
using indices.

* Quicksort has a quadratic time worst case. I think it should be
turned into an introsort [1] for O(n log n) worst case; we have the
heapsort needed to do that.

* Quicksort is robust to repeated elements, but doesn't exploit them.
It can be made to run in linear time if the input array has only O(1)
distinct elements [2]. This may come at the expense of some
performance on arrays with no repeated elements.

* Using optimal sorting networks instead of insertion sort as the base
case can speed up quicksort on float arrays by 5-10%, but only if NaNs
are moved out of the way first so that comparisons become cheaper [3].
This has consequences for the selection algorithms that I haven't
fully worked out yet.

* Using Cilk Plus to parallelize merge sort can make it significantly
faster than quicksort at the expense of only a few lines of code, but
I haven't checked whether Cilk Plus plays nicely with multiprocessing
and other parallelism options (remember the trouble with OpenMP-ified
OpenBLAS?).

This isn't really an answer to your questions, more like a brain dump
from someone who's stared at the same code for a while and did some
experiments. I'm not saying we should implement all of this, but keep
in mind that there are some interesting options besides implementing
timsort.

[1] https://en.wikipedia.org/wiki/Introsort
[2] http://www.sorting-algorithms.com/static/QuicksortIsOptimal.pdf
[3] https://github.com/larsmans/numpy/tree/sorting-nets
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Sorting refactor

2015-01-16 Thread Julian Taylor
On 16.01.2015 12:33, Lars Buitinck wrote:
 2015-01-16 11:55 GMT+01:00  numpy-discussion-requ...@scipy.org:
 Message: 2
 Date: Thu, 15 Jan 2015 21:24:00 -0800
 From: Jaime Fern?ndez del R?o jaime.f...@gmail.com
 Subject: [Numpy-discussion] Sorting refactor
 To: Discussion of Numerical Python numpy-discussion@scipy.org
 Message-ID:
 capowhwkf6rnwcrgmcwsmq_lo3hshjgbvlsrn19z-mdpe25e...@mail.gmail.com
 Content-Type: text/plain; charset=utf-8

 This changes will make it easier for me to add a Timsort generic type
 function to numpy's arsenal of sorting routines. And I think they have
 value by cleaning the source code on their own.
 
 Yes, they do. I've been looking at the sorting functions as well and
 I've found the following:
 
 * The code is generally hard to read because it prefers pointer over
 indices. I'm wondering if it would get slower using indices. The
 closer these algorithms are to the textbook, the easier to insert
 fancy optimizations.
 
 * The heap sort exploits undefined behavior by using a pointer that
 points before the start of the array. However, rewriting it to always
 point within the array made it slower. I haven't tried rewriting it
 using indices.
 
 * Quicksort has a quadratic time worst case. I think it should be
 turned into an introsort [1] for O(n log n) worst case; we have the
 heapsort needed to do that.

This probably rarely happens in numeric data, and we do have guaranteed
nlog runtime algorithms available.
But it also is not costly to do, e.g. the selection code is a
introselect instead of a normal quickselect.
I'd say not high priority, but if someone wants to do it I don't see why
not.

 
 * Quicksort is robust to repeated elements, but doesn't exploit them.
 It can be made to run in linear time if the input array has only O(1)
 distinct elements [2]. This may come at the expense of some
 performance on arrays with no repeated elements.
 
 * Using optimal sorting networks instead of insertion sort as the base
 case can speed up quicksort on float arrays by 5-10%, but only if NaNs
 are moved out of the way first so that comparisons become cheaper [3].
 This has consequences for the selection algorithms that I haven't
 fully worked out yet.

I was also thinking about this, an advantage of a sorting network is
that it can be vectorized to be significantly faster than an insertion
sort. Handling NaN's should also be directly possible.
The issue is that its probably too much complicated code for only a very
small gain.

 
 * Using Cilk Plus to parallelize merge sort can make it significantly
 faster than quicksort at the expense of only a few lines of code, but
 I haven't checked whether Cilk Plus plays nicely with multiprocessing
 and other parallelism options (remember the trouble with OpenMP-ified
 OpenBLAS?).

you should also be able to do this with openmp tasks, though it will be
a little less efficient as cilk+ has a better scheduler for this type of
work.
But I assume you will get the same trouble as openmp but that needs
testing, also cilk+ in gcc is not really production ready yet, I got
lots of crashes when I last tried it (it will be in 5.0 though).


 
 This isn't really an answer to your questions, more like a brain dump
 from someone who's stared at the same code for a while and did some
 experiments. I'm not saying we should implement all of this, but keep
 in mind that there are some interesting options besides implementing
 timsort.
 
 [1] https://en.wikipedia.org/wiki/Introsort
 [2] http://www.sorting-algorithms.com/static/QuicksortIsOptimal.pdf
 [3] https://github.com/larsmans/numpy/tree/sorting-nets
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
 

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Sorting refactor

2015-01-16 Thread Eelco Hoogendoorn
I don't know if there is a general consensus or guideline on these matters,
but I am personally not entirely charmed by the use of behind-the-scenes
parallelism, unless explicitly requested.

Perhaps an algorithm can be made faster, but often these multicore
algorithms are also less efficient, and a less data-dependent way of
putting my cores to good use would have been preferable. Potentially, other
code could slow down due to cache trashing if too many parallel tasks run
in parallel. Id rather be in charge of such matters myself; but I imagine
adding a keyword arg for these matters would not be much trouble?

On Fri, Jan 16, 2015 at 12:43 PM, Julian Taylor 
jtaylor.deb...@googlemail.com wrote:

 On 16.01.2015 12:33, Lars Buitinck wrote:
  2015-01-16 11:55 GMT+01:00  numpy-discussion-requ...@scipy.org:
  Message: 2
  Date: Thu, 15 Jan 2015 21:24:00 -0800
  From: Jaime Fern?ndez del R?o jaime.f...@gmail.com
  Subject: [Numpy-discussion] Sorting refactor
  To: Discussion of Numerical Python numpy-discussion@scipy.org
  Message-ID:
  
 capowhwkf6rnwcrgmcwsmq_lo3hshjgbvlsrn19z-mdpe25e...@mail.gmail.com
  Content-Type: text/plain; charset=utf-8
 
  This changes will make it easier for me to add a Timsort generic type
  function to numpy's arsenal of sorting routines. And I think they have
  value by cleaning the source code on their own.
 
  Yes, they do. I've been looking at the sorting functions as well and
  I've found the following:
 
  * The code is generally hard to read because it prefers pointer over
  indices. I'm wondering if it would get slower using indices. The
  closer these algorithms are to the textbook, the easier to insert
  fancy optimizations.
 
  * The heap sort exploits undefined behavior by using a pointer that
  points before the start of the array. However, rewriting it to always
  point within the array made it slower. I haven't tried rewriting it
  using indices.
 
  * Quicksort has a quadratic time worst case. I think it should be
  turned into an introsort [1] for O(n log n) worst case; we have the
  heapsort needed to do that.

 This probably rarely happens in numeric data, and we do have guaranteed
 nlog runtime algorithms available.
 But it also is not costly to do, e.g. the selection code is a
 introselect instead of a normal quickselect.
 I'd say not high priority, but if someone wants to do it I don't see why
 not.

 
  * Quicksort is robust to repeated elements, but doesn't exploit them.
  It can be made to run in linear time if the input array has only O(1)
  distinct elements [2]. This may come at the expense of some
  performance on arrays with no repeated elements.
 
  * Using optimal sorting networks instead of insertion sort as the base
  case can speed up quicksort on float arrays by 5-10%, but only if NaNs
  are moved out of the way first so that comparisons become cheaper [3].
  This has consequences for the selection algorithms that I haven't
  fully worked out yet.

 I was also thinking about this, an advantage of a sorting network is
 that it can be vectorized to be significantly faster than an insertion
 sort. Handling NaN's should also be directly possible.
 The issue is that its probably too much complicated code for only a very
 small gain.

 
  * Using Cilk Plus to parallelize merge sort can make it significantly
  faster than quicksort at the expense of only a few lines of code, but
  I haven't checked whether Cilk Plus plays nicely with multiprocessing
  and other parallelism options (remember the trouble with OpenMP-ified
  OpenBLAS?).

 you should also be able to do this with openmp tasks, though it will be
 a little less efficient as cilk+ has a better scheduler for this type of
 work.
 But I assume you will get the same trouble as openmp but that needs
 testing, also cilk+ in gcc is not really production ready yet, I got
 lots of crashes when I last tried it (it will be in 5.0 though).


 
  This isn't really an answer to your questions, more like a brain dump
  from someone who's stared at the same code for a while and did some
  experiments. I'm not saying we should implement all of this, but keep
  in mind that there are some interesting options besides implementing
  timsort.
 
  [1] https://en.wikipedia.org/wiki/Introsort
  [2] http://www.sorting-algorithms.com/static/QuicksortIsOptimal.pdf
  [3] https://github.com/larsmans/numpy/tree/sorting-nets
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Sorting refactor

2015-01-16 Thread Daπid
On 16 January 2015 at 13:15, Eelco Hoogendoorn
hoogendoorn.ee...@gmail.com wrote:
 Perhaps an algorithm can be made faster, but often these multicore
 algorithms are also less efficient, and a less data-dependent way of putting
 my cores to good use would have been preferable. Potentially, other code
 could slow down due to cache trashing if too many parallel tasks run in
 parallel. Id rather be in charge of such matters myself; but I imagine
 adding a keyword arg for these matters would not be much trouble?

As I understand it, that is where the strength of Cilk+ lies. It does
not force parallelisation, just suggests it. The decision to actually
spawn parallel is decided at runtime depending on the load of the
other cores.


/David.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Sorting refactor

2015-01-16 Thread Jaime Fernández del Río
On Fri, Jan 16, 2015 at 4:19 AM, Matthew Brett matthew.br...@gmail.com
wrote:

 Hi,

 On Fri, Jan 16, 2015 at 5:24 AM, Jaime Fernández del Río
 jaime.f...@gmail.com wrote:
  Hi all,
 
  I have been taking a deep look at the sorting functionality in numpy,
 and I
  think it could use a face lift in the form of a big code refactor, to get
  rid of some of the ugliness in the code and make it easier to maintain.
 What
  I have in mind basically amounts to:
 
  Refactor _new_argsortlike to get rid of code duplication (there are two
  branches, one with buffering, one without, virtually identical, that
 could
  be merged into a single one).
  Modify _new_argsortlike so that it can properly handle byte-swapped
 inputs
  of any dtype, see gh-5441. Add proper handling of types with references,
 in
  preparation for the rest of changes.
  Add three functions to the npy_sort library: npy_aquicksort,
 npy_aheapsort,
  npy_amergesort, with a signature compatible with PyArray_ArgSortFunc ,
 i.e.
  (char* start, npy_intp* result, npy_intp length, PyArrayObject *arr).
 These
  turn out to be almost identical to the string and unicode sort functions,
  but using the dtype's compare function to handle comparisons.
  Modify PyArray_ArgSort (and PyArray_ArgPartition) to always call
  _new_argsortlike, even when there is no type specific argsort function,
 by
  using the newly added npy_axxx functions. This simplifies
 PyArray_ArgSort a
  lot, and gets rid of some of the global variable ugliness in the current
  code. And makes argsorting over non-contiguous axis more memory
 efficient.
  Refactor _new_sortlike similarly to _new_argsortlike
  Modify the npy_quicksort, npy_mergesort and npy_heapsort functions in
  npy_sort to have a signature compatible with PyArray_SortFunc, i.e.
 (char*
  start, npy_intp length, PyArrayObject *arr). npy_quicksort will no longer
  rely on libc's qsort, but be very similar to the string or unicode
 quicksort
  functions
  Modify PyArray_Sort (and PyArray_Partition) to always call _new_sortlike,
  similarly to what was done with PyArray_ArgSort. This allows completing
 the
  removal of the remaining global variable ugliness, as well as similar
  benefits as for argsort before.
 
  This changes will make it easier for me to add a Timsort generic type
  function to numpy's arsenal of sorting routines. And I think they have
 value
  by cleaning the source code on their own. So my questions, mostly to the
  poor souls that will have to code review changes to  several hundred
 lines
  of code:
 
  Does this make sense, or is it better left alone? A subset of 1, 2 and 5
 are
  a must to address the issues in gh-5441, the rest could arguably be left
 as
  is.
  Would you rather see it submitted as one ginormous PR? Or split into 4
 or 5
  incremental ones?

 Do you think it would be possible to split this into several PRs, with
 the initial one being the refactoring, and the subsequent ones being
 additions to sorting functionality?


Just to be clear, nothing in the long list of changes I posted earlier is
truly a change in the sorting functionality, except in the case of
quicksort (and argquicksort) for generic types, which would no longer rely
on qsort, but on our own implementation of it, just like every other sort
in numpy.

So yes, the refactor PR can precede new functionality. And it can even be
split into 3 or 4 incremental PRs, to make the reviewers life easier. Since
it is most likely Charles and/or Julian that are going to have to swallow
the review pill, I'd like to hear from them how would they like it better.

Jaime

-- 
(\__/)
( O.o)
(  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NumPy-Discussion Digest, Vol 100, Issue 28

2015-01-16 Thread Lars Buitinck
2015-01-16 13:29 GMT+01:00  numpy-discussion-requ...@scipy.org:
 Date: Fri, 16 Jan 2015 12:43:43 +0100
 From: Julian Taylor jtaylor.deb...@googlemail.com
 Subject: Re: [Numpy-discussion] Sorting refactor
 To: Discussion of Numerical Python numpy-discussion@scipy.org
 Message-ID: 54b8f96f.7090...@googlemail.com
 Content-Type: text/plain; charset=windows-1252

 On 16.01.2015 12:33, Lars Buitinck wrote:
 * Quicksort has a quadratic time worst case. I think it should be
 turned into an introsort [1] for O(n log n) worst case; we have the
 heapsort needed to do that.

 This probably rarely happens in numeric data, and we do have guaranteed
 nlog runtime algorithms available.

It's no more likely or unlikely than in any other type of data
(AFAIK), but it's possible for an adversary to DOS any (web server)
code that uses np.sort.

 I was also thinking about this, an advantage of a sorting network is
 that it can be vectorized to be significantly faster than an insertion
 sort. Handling NaN's should also be directly possible.

Tried that, and it didn't give any speedup, at least not without
explicit vector instructions.

Just moving the NaNs aside didn't cost anything in my preliminary
benchmarks (without sorting nets), the cost of the operation was
almost exactly compensated by simpler comparisons.

 The issue is that its probably too much complicated code for only a very
 small gain.

Maybe. The thing is that the selection algorithms are optimized for
NaNs and seem to depend on the current comparison code. We'd need
distinct TYPE_LT and TYPE_LT_NONAN for each TYPE.

The sorting nets themselves aren't complicated, just lengthy. My
branch has the length-optimal (not optimally parallel) ones for n =
16.

 But I assume you will get the same trouble as openmp but that needs
 testing, also cilk+ in gcc is not really production ready yet, I got
 lots of crashes when I last tried it (it will be in 5.0 though).

The data parallel constructs tend to crash the compiler, but task
spawning seems to be stable in 4.9.2. I've still to see how it handles
multiprocessing/fork.

What do you mean by will be in 5.0, did they do a big push?


 Date: Fri, 16 Jan 2015 13:28:56 +0100
 From: Da?id davidmen...@gmail.com
 Subject: Re: [Numpy-discussion] Sorting refactor
 To: Discussion of Numerical Python numpy-discussion@scipy.org
 Message-ID:
 CAJhcF=1O5Own_5ydzu+To8HHbm3e66k=iunqreiasdy23dn...@mail.gmail.com
 Content-Type: text/plain; charset=UTF-8

 On 16 January 2015 at 13:15, Eelco Hoogendoorn
 hoogendoorn.ee...@gmail.com wrote:
 Perhaps an algorithm can be made faster, but often these multicore
 algorithms are also less efficient, and a less data-dependent way of putting
 my cores to good use would have been preferable. Potentially, other code
 could slow down due to cache trashing if too many parallel tasks run in
 parallel. Id rather be in charge of such matters myself; but I imagine
 adding a keyword arg for these matters would not be much trouble?

 As I understand it, that is where the strength of Cilk+ lies. It does
 not force parallelisation, just suggests it. The decision to actually
 spawn parallel is decided at runtime depending on the load of the
 other cores.

cilk+ guarantees that the amount of space used by a pool of P threads
is at most P times the stack space used by the sequential version (+ a
constant). The idea is that you can say

for (i = 0; i  100; i++) {
cilk_spawn f(a[i]);
}

and it will never create more than P work items in memory, rather than
1e6, even if each f() spawns a bunch itself. Of course, it won't
guarantee that OpenMP will not also spawn P threads and/or check that
you're one of P processes cooperating on a task using multiprocessing.

Personally I'd rather have an np.setnumthreads() to turn this on or
off for a process and have the runtime distribute work for me instead
of having to do it myself.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Sorting refactor

2015-01-16 Thread Lars Buitinck
2015-01-16 15:14 GMT+01:00  numpy-discussion-requ...@scipy.org:
 Date: Fri, 16 Jan 2015 06:11:29 -0800
 From: Jaime Fern?ndez del R?o jaime.f...@gmail.com
 Subject: Re: [Numpy-discussion] Sorting refactor
 To: Discussion of Numerical Python numpy-discussion@scipy.org

 Most of my proposed original changes do not affect the core sorting
 functionality, just the infrastructure around it. But if we agree that
 sorting has potential for being an actively developed part of the code
 base, then cleaning up its surroundings for clarity makes sense, so I'm
 taking your brain dump as an aye for my proposal. ;-)

It is!
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Sorting refactor

2015-01-16 Thread Eelco Hoogendoorn
I agree; an np.setnumthreads to manage a numpy-global threadpool makes
sense to me.

Of course there are a great many cases where just spawning as many threads
as cores is a sensible default, but if this kind of behavior could not be
overridden I could see that greatly reduce performance for some of my more
complex projects

On Fri, Jan 16, 2015 at 4:11 PM, Julian Taylor 
jtaylor.deb...@googlemail.com wrote:

 On 01/16/2015 03:14 PM, Lars Buitinck wrote:
  2015-01-16 13:29 GMT+01:00  numpy-discussion-requ...@scipy.org:
  Date: Fri, 16 Jan 2015 12:43:43 +0100
  From: Julian Taylor jtaylor.deb...@googlemail.com
  Subject: Re: [Numpy-discussion] Sorting refactor
  To: Discussion of Numerical Python numpy-discussion@scipy.org
  Message-ID: 54b8f96f.7090...@googlemail.com
  Content-Type: text/plain; charset=windows-1252
 
  On 16.01.2015 12:33, Lars Buitinck wrote:
  * Quicksort has a quadratic time worst case. I think it should be
  turned into an introsort [1] for O(n log n) worst case; we have the
  heapsort needed to do that.
 
  This probably rarely happens in numeric data, and we do have guaranteed
  nlog runtime algorithms available.
 
  It's no more likely or unlikely than in any other type of data
  (AFAIK), but it's possible for an adversary to DOS any (web server)
  code that uses np.sort.

 if you are using numpy where an arbitrary user is allowed to control the
 data passed to a non isolated environment you have a problem anyway.
 numpy is far from secure software and there are likely hundreds of ways
 to produce DOS and dozens of ways to do code execution in any nontrivial
 numpy using application.

 
  I was also thinking about this, an advantage of a sorting network is
  that it can be vectorized to be significantly faster than an insertion
  sort. Handling NaN's should also be directly possible.
 
  Tried that, and it didn't give any speedup, at least not without
  explicit vector instructions.
 
  Just moving the NaNs aside didn't cost anything in my preliminary
  benchmarks (without sorting nets), the cost of the operation was
  almost exactly compensated by simpler comparisons.

 an SSE2 implementation a 16 entry bitonic sort is available here:
 https://github.com/mischasan/sse2/blob/master/ssesort.c
 there is also a benchmark, on my machine its 6 times faster than
 insertion sort.
 But again this would only gain us 5-10% improvement at best as the
 partition part of quicksort is still the major time consuming part.

 
  The issue is that its probably too much complicated code for only a very
  small gain.
 
  Maybe. The thing is that the selection algorithms are optimized for
  NaNs and seem to depend on the current comparison code. We'd need
  distinct TYPE_LT and TYPE_LT_NONAN for each TYPE.
 
  The sorting nets themselves aren't complicated, just lengthy. My
  branch has the length-optimal (not optimally parallel) ones for n =
  16.
 
  But I assume you will get the same trouble as openmp but that needs
  testing, also cilk+ in gcc is not really production ready yet, I got
  lots of crashes when I last tried it (it will be in 5.0 though).
 
  The data parallel constructs tend to crash the compiler, but task
  spawning seems to be stable in 4.9.2. I've still to see how it handles
  multiprocessing/fork.
 
  What do you mean by will be in 5.0, did they do a big push?

 gcc 5.0 changelog reports full support for cilk plus.
 Also all bugs I have filed have been fixed in 5.0.

 
 
  Date: Fri, 16 Jan 2015 13:28:56 +0100
  From: Da?id davidmen...@gmail.com
  Subject: Re: [Numpy-discussion] Sorting refactor
  To: Discussion of Numerical Python numpy-discussion@scipy.org
  Message-ID:
  CAJhcF=1O5Own_5ydzu+To8HHbm3e66k=
 iunqreiasdy23dn...@mail.gmail.com
  Content-Type: text/plain; charset=UTF-8
 
  On 16 January 2015 at 13:15, Eelco Hoogendoorn
  hoogendoorn.ee...@gmail.com wrote:
  Perhaps an algorithm can be made faster, but often these multicore
  algorithms are also less efficient, and a less data-dependent way of
 putting
  my cores to good use would have been preferable. Potentially, other
 code
  could slow down due to cache trashing if too many parallel tasks run in
  parallel. Id rather be in charge of such matters myself; but I imagine
  adding a keyword arg for these matters would not be much trouble?
 
  As I understand it, that is where the strength of Cilk+ lies. It does
  not force parallelisation, just suggests it. The decision to actually
  spawn parallel is decided at runtime depending on the load of the
  other cores.
 
  cilk+ guarantees that the amount of space used by a pool of P threads
  is at most P times the stack space used by the sequential version (+ a
  constant). The idea is that you can say
 
  for (i = 0; i  100; i++) {
  cilk_spawn f(a[i]);
  }
 
  and it will never create more than P work items in memory, rather than
  1e6, even if each f() spawns a bunch itself. Of course, it won't
  guarantee that OpenMP will not also spawn P threads and/or 

Re: [Numpy-discussion] Sorting refactor

2015-01-16 Thread Charles R Harris
On Fri, Jan 16, 2015 at 7:11 AM, Jaime Fernández del Río 
jaime.f...@gmail.com wrote:

 On Fri, Jan 16, 2015 at 3:33 AM, Lars Buitinck larsm...@gmail.com wrote:

 2015-01-16 11:55 GMT+01:00  numpy-discussion-requ...@scipy.org:
  Message: 2
  Date: Thu, 15 Jan 2015 21:24:00 -0800
  From: Jaime Fern?ndez del R?o jaime.f...@gmail.com
  Subject: [Numpy-discussion] Sorting refactor
  To: Discussion of Numerical Python numpy-discussion@scipy.org
  Message-ID:
  
 capowhwkf6rnwcrgmcwsmq_lo3hshjgbvlsrn19z-mdpe25e...@mail.gmail.com
  Content-Type: text/plain; charset=utf-8
 
  This changes will make it easier for me to add a Timsort generic type
  function to numpy's arsenal of sorting routines. And I think they have
  value by cleaning the source code on their own.

 Yes, they do. I've been looking at the sorting functions as well and
 I've found the following:

 * The code is generally hard to read because it prefers pointer over
 indices. I'm wondering if it would get slower using indices. The
 closer these algorithms are to the textbook, the easier to insert
 fancy optimizations.


 They are harder to read, but so cute to look at! C code just wouldn't feel
 the same without some magical pointer arithmetic thrown in here and there.
 ;-)


Pointers were faster than indexing. That advantage can be hardware
dependent, but for small numbers of pointers is typical.




 * The heap sort exploits undefined behavior by using a pointer that
 points before the start of the array. However, rewriting it to always
 point within the array made it slower. I haven't tried rewriting it
 using indices


Fortran uses the same pointer trick for one based indexing, or at least the
old DEC compilers did ;) There is no reason to avoid it.


 .

 * Quicksort has a quadratic time worst case. I think it should be
 turned into an introsort [1] for O(n log n) worst case; we have the
 heapsort needed to do that.

 * Quicksort is robust to repeated elements, but doesn't exploit them.
 It can be made to run in linear time if the input array has only O(1)
 distinct elements [2]. This may come at the expense of some
 performance on arrays with no repeated elements.


 Java famously changed its library implementation of quicksort to a dual
 pivot one invented by Vladimir Yaroslavskiy[1], they claim that with
 substantial performance gains. I tried to implement that for numpy [2], but
 couldn't get it to work any faster than the current code.


For sorting, simple often beats fancy.



 * Using optimal sorting networks instead of insertion sort as the base
 case can speed up quicksort on float arrays by 5-10%, but only if NaNs
 are moved out of the way first so that comparisons become cheaper [3].
 This has consequences for the selection algorithms that I haven't
 fully worked out yet.



I expect the gains here would be for small sorts, which tend to be
dominated by call overhead.


 Even if we stick with selection sort, we should spin it off into an inline
 smallsort function within the npy_sort library, and have quicksort and
 mergesort call the same function, instead of each implementing their own.
 It would make optimizations like the sorting networks easier to implement
 for all sorts. We could even expose it outside npy_sort, as there are a few
 places around the code base that have ad-hoc implementations of sorting.


Good idea, I've thought of doing it myself.



 * Using Cilk Plus to parallelize merge sort can make it significantly
 faster than quicksort at the expense of only a few lines of code, but
 I haven't checked whether Cilk Plus plays nicely with multiprocessing
 and other parallelism options (remember the trouble with OpenMP-ified
 OpenBLAS?).

 This isn't really an answer to your questions, more like a brain dump
 from someone who's stared at the same code for a while and did some
 experiments. I'm not saying we should implement all of this, but keep
 in mind that there are some interesting options besides implementing
 timsort.


 Timsort came up in a discussion several months ago, where I proposed
 adding a mergesorted function (which I have mostly ready, by the way, [3])
 to speed-up some operations in arraysetops. I have serious doubts that it
 will perform comparably to the other sorts unless comparisons are terribly
 expensive, which they typically aren't in numpy, but it has been an
 interesting learning exercise so far, and I don't mind taking it all the
 way.

 Most of my proposed original changes do not affect the core sorting
 functionality, just the infrastructure around it. But if we agree that
 sorting has potential for being an actively developed part of the code
 base, then cleaning up its surroundings for clarity makes sense, so I'm
 taking your brain dump as an aye for my proposal. ;-)


I have a generic quicksort with standard interface sitting around somewhere
in an ancient branch. Sorting objects needs to be sensitive to comparison
exceptions, which is something to keep in mind. I'd also like to 

Re: [Numpy-discussion] Sorting refactor

2015-01-16 Thread Jaime Fernández del Río
On Fri, Jan 16, 2015 at 8:15 AM, Charles R Harris charlesr.har...@gmail.com
 wrote:



 On Fri, Jan 16, 2015 at 7:11 AM, Jaime Fernández del Río 
 jaime.f...@gmail.com wrote:

 On Fri, Jan 16, 2015 at 3:33 AM, Lars Buitinck larsm...@gmail.com
 wrote:

 2015-01-16 11:55 GMT+01:00  numpy-discussion-requ...@scipy.org:
  Message: 2
  Date: Thu, 15 Jan 2015 21:24:00 -0800
  From: Jaime Fern?ndez del R?o jaime.f...@gmail.com
  Subject: [Numpy-discussion] Sorting refactor
  To: Discussion of Numerical Python numpy-discussion@scipy.org
  Message-ID:
  
 capowhwkf6rnwcrgmcwsmq_lo3hshjgbvlsrn19z-mdpe25e...@mail.gmail.com
  Content-Type: text/plain; charset=utf-8
 
  This changes will make it easier for me to add a Timsort generic type
  function to numpy's arsenal of sorting routines. And I think they have
  value by cleaning the source code on their own.

 Yes, they do. I've been looking at the sorting functions as well and
 I've found the following:

 * The code is generally hard to read because it prefers pointer over
 indices. I'm wondering if it would get slower using indices. The
 closer these algorithms are to the textbook, the easier to insert
 fancy optimizations.


 They are harder to read, but so cute to look at! C code just wouldn't
 feel the same without some magical pointer arithmetic thrown in here and
 there. ;-)


 Pointers were faster than indexing. That advantage can be hardware
 dependent, but for small numbers of pointers is typical.




 * The heap sort exploits undefined behavior by using a pointer that
 points before the start of the array. However, rewriting it to always
 point within the array made it slower. I haven't tried rewriting it
 using indices


 Fortran uses the same pointer trick for one based indexing, or at least
 the old DEC compilers did ;) There is no reason to avoid it.


 .

 * Quicksort has a quadratic time worst case. I think it should be
 turned into an introsort [1] for O(n log n) worst case; we have the
 heapsort needed to do that.

 * Quicksort is robust to repeated elements, but doesn't exploit them.
 It can be made to run in linear time if the input array has only O(1)
 distinct elements [2]. This may come at the expense of some
 performance on arrays with no repeated elements.


 Java famously changed its library implementation of quicksort to a dual
 pivot one invented by Vladimir Yaroslavskiy[1], they claim that with
 substantial performance gains. I tried to implement that for numpy [2], but
 couldn't get it to work any faster than the current code.


 For sorting, simple often beats fancy.



 * Using optimal sorting networks instead of insertion sort as the base
 case can speed up quicksort on float arrays by 5-10%, but only if NaNs
 are moved out of the way first so that comparisons become cheaper [3].
 This has consequences for the selection algorithms that I haven't
 fully worked out yet.



 I expect the gains here would be for small sorts, which tend to be
 dominated by call overhead.


 Even if we stick with selection sort, we should spin it off into an
 inline smallsort function within the npy_sort library, and have quicksort
 and mergesort call the same function, instead of each implementing their
 own. It would make optimizations like the sorting networks easier to
 implement for all sorts. We could even expose it outside npy_sort, as there
 are a few places around the code base that have ad-hoc implementations of
 sorting.


 Good idea, I've thought of doing it myself.



 * Using Cilk Plus to parallelize merge sort can make it significantly
 faster than quicksort at the expense of only a few lines of code, but
 I haven't checked whether Cilk Plus plays nicely with multiprocessing
 and other parallelism options (remember the trouble with OpenMP-ified
 OpenBLAS?).

 This isn't really an answer to your questions, more like a brain dump
 from someone who's stared at the same code for a while and did some
 experiments. I'm not saying we should implement all of this, but keep
 in mind that there are some interesting options besides implementing
 timsort.


 Timsort came up in a discussion several months ago, where I proposed
 adding a mergesorted function (which I have mostly ready, by the way, [3])
 to speed-up some operations in arraysetops. I have serious doubts that it
 will perform comparably to the other sorts unless comparisons are terribly
 expensive, which they typically aren't in numpy, but it has been an
 interesting learning exercise so far, and I don't mind taking it all the
 way.

 Most of my proposed original changes do not affect the core sorting
 functionality, just the infrastructure around it. But if we agree that
 sorting has potential for being an actively developed part of the code
 base, then cleaning up its surroundings for clarity makes sense, so I'm
 taking your brain dump as an aye for my proposal. ;-)


 I have a generic quicksort with standard interface sitting around
 somewhere in an ancient branch. Sorting objects