Re: [Numpy-discussion] Optimizing numpy's einsum expression (again)
Thanks for taking the time to think about this; good work. Personally, I don't think a factor 5 memory overhead is much to sweat over. The most complex einsum I have ever needed in a production environment was 5/6 terms, and for what this anecdote is worth, speed was a far bigger concern to me than memory. On Fri, Jan 16, 2015 at 12:30 AM, Daniel Smith dgasm...@icloud.com wrote: Hello everyone, I originally brought an optimized einsum routine forward a few weeks back that attempts to contract numpy arrays together in an optimal way. This can greatly reduce the scaling and overall cost of the einsum expression for the cost of a few intermediate arrays. The current version (and more details) can be found here: https://github.com/dgasmith/opt_einsum I think this routine is close to a finalized version, but there are two problems which I would like the community to weigh in on: Memory usage- Currently the routine only considers the maximum size of intermediates created rather than cumulative memory usage. If we only use np.einsum it is straightforward to calculate cumulative memory usage as einsum does not require any copies of inputs to be made; however, if we attempt to use a vendor BLAS the memory usage can becomes quite complex. For example, the np.tensordot routine always forces a copy for ndarrays because it uses ndarray.transpose(…).reshape(…). A more memory-conscious way to do this is to first try and do ndarray.reshape(…).T which does not copy the data and numpy can just pass a transpose flag to the vendor BLAS. The caveat here is that the summed indices must be in the correct order- if not a copy is required. Maximum array size is usually close to the total overhead of the opt_einsum function, but can occasionally be 2-5 times this size. I see the following ways forward: - Ignore cumulative memory and stick with maximum array size as the limiting memory factor. - Implement logic to figure out if the input arrays needs to be copied to use BLAS, compute the extra memory required, and add an extra dimension to the optimization algorithms (use BLAS or do not use BLAS at each step). Some of this is already done, but may get fairly complex. - Build an in-place nd-transpose algorithm. - Cut out BLAS entirely. Keeping in mind that vendor BLAS can be orders of magnitude faster than a pure einsum implementation, especially if the BLAS threading is used. Path algorithms- There are two algorithms “optimal” (a brute force algorithm, scales like N!) and “opportunistic” (a greedy algorithm, scales like N^3). The optimal path can take seconds to minutes to calculate for a 7-10 term expression while the opportunistic path takes microseconds even for 20+ term expressions. The opportunistic algorithm works surprisingly well and appears to obtain the correct scaling in all test cases that I can think of. Both algorithms use the maximum array size as a sieve- this is very beneficial from several aspects. The problem occurs when a much needed intermediate cannot be created due to this limit- on occasions not making this intermediate can have slowdowns of orders of magnitude even for small systems. This leads to certain (and sometimes unexpected) performance walls. Possible solutions: - Warn the user if the ratio between an unlimited memory solution and a limited memory solution becomes large. - Do not worry about it. Thank you for your time, -Daniel Smith ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Optimizing numpy's einsum expression (again)
Daniel Smith dgasmith at icloud.com writes: Hello everyone,I originally brought an optimized einsum routine forward a few weeks back that attempts to contract numpy arrays together in an optimal way. This can greatly reduce the scaling and overall cost of the einsum expression for the cost of a few intermediate arrays. The current version (and more details) can be found here: https://github.com/dgasmith/opt_einsum I think this routine is close to a finalized version, but there are two problems which I would like the community to weigh in on: Thank you for your time, -Daniel Smith I wasn't aware of this work, but it's very interesting to me as a user of einsum whose principal reason for doing so is speed. Even though I use it on largish arrays I'm only concerned with the performance as I'm on x64 with plenty of memory even were it to require temporaries 5x the original size. I don't use einsum that much because I've noticed the performance can be very problem dependant so I've always profiled it to check. Hopefully this work will make the performance more consistent, allowing it to be used more generally throughout my code. Thanks, Dave * An anecdotal example from a user only. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Sorting refactor
2015-01-16 11:55 GMT+01:00 numpy-discussion-requ...@scipy.org: Message: 2 Date: Thu, 15 Jan 2015 21:24:00 -0800 From: Jaime Fern?ndez del R?o jaime.f...@gmail.com Subject: [Numpy-discussion] Sorting refactor To: Discussion of Numerical Python numpy-discussion@scipy.org Message-ID: capowhwkf6rnwcrgmcwsmq_lo3hshjgbvlsrn19z-mdpe25e...@mail.gmail.com Content-Type: text/plain; charset=utf-8 This changes will make it easier for me to add a Timsort generic type function to numpy's arsenal of sorting routines. And I think they have value by cleaning the source code on their own. Yes, they do. I've been looking at the sorting functions as well and I've found the following: * The code is generally hard to read because it prefers pointer over indices. I'm wondering if it would get slower using indices. The closer these algorithms are to the textbook, the easier to insert fancy optimizations. * The heap sort exploits undefined behavior by using a pointer that points before the start of the array. However, rewriting it to always point within the array made it slower. I haven't tried rewriting it using indices. * Quicksort has a quadratic time worst case. I think it should be turned into an introsort [1] for O(n log n) worst case; we have the heapsort needed to do that. * Quicksort is robust to repeated elements, but doesn't exploit them. It can be made to run in linear time if the input array has only O(1) distinct elements [2]. This may come at the expense of some performance on arrays with no repeated elements. * Using optimal sorting networks instead of insertion sort as the base case can speed up quicksort on float arrays by 5-10%, but only if NaNs are moved out of the way first so that comparisons become cheaper [3]. This has consequences for the selection algorithms that I haven't fully worked out yet. * Using Cilk Plus to parallelize merge sort can make it significantly faster than quicksort at the expense of only a few lines of code, but I haven't checked whether Cilk Plus plays nicely with multiprocessing and other parallelism options (remember the trouble with OpenMP-ified OpenBLAS?). This isn't really an answer to your questions, more like a brain dump from someone who's stared at the same code for a while and did some experiments. I'm not saying we should implement all of this, but keep in mind that there are some interesting options besides implementing timsort. [1] https://en.wikipedia.org/wiki/Introsort [2] http://www.sorting-algorithms.com/static/QuicksortIsOptimal.pdf [3] https://github.com/larsmans/numpy/tree/sorting-nets ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Sorting refactor
On 16.01.2015 12:33, Lars Buitinck wrote: 2015-01-16 11:55 GMT+01:00 numpy-discussion-requ...@scipy.org: Message: 2 Date: Thu, 15 Jan 2015 21:24:00 -0800 From: Jaime Fern?ndez del R?o jaime.f...@gmail.com Subject: [Numpy-discussion] Sorting refactor To: Discussion of Numerical Python numpy-discussion@scipy.org Message-ID: capowhwkf6rnwcrgmcwsmq_lo3hshjgbvlsrn19z-mdpe25e...@mail.gmail.com Content-Type: text/plain; charset=utf-8 This changes will make it easier for me to add a Timsort generic type function to numpy's arsenal of sorting routines. And I think they have value by cleaning the source code on their own. Yes, they do. I've been looking at the sorting functions as well and I've found the following: * The code is generally hard to read because it prefers pointer over indices. I'm wondering if it would get slower using indices. The closer these algorithms are to the textbook, the easier to insert fancy optimizations. * The heap sort exploits undefined behavior by using a pointer that points before the start of the array. However, rewriting it to always point within the array made it slower. I haven't tried rewriting it using indices. * Quicksort has a quadratic time worst case. I think it should be turned into an introsort [1] for O(n log n) worst case; we have the heapsort needed to do that. This probably rarely happens in numeric data, and we do have guaranteed nlog runtime algorithms available. But it also is not costly to do, e.g. the selection code is a introselect instead of a normal quickselect. I'd say not high priority, but if someone wants to do it I don't see why not. * Quicksort is robust to repeated elements, but doesn't exploit them. It can be made to run in linear time if the input array has only O(1) distinct elements [2]. This may come at the expense of some performance on arrays with no repeated elements. * Using optimal sorting networks instead of insertion sort as the base case can speed up quicksort on float arrays by 5-10%, but only if NaNs are moved out of the way first so that comparisons become cheaper [3]. This has consequences for the selection algorithms that I haven't fully worked out yet. I was also thinking about this, an advantage of a sorting network is that it can be vectorized to be significantly faster than an insertion sort. Handling NaN's should also be directly possible. The issue is that its probably too much complicated code for only a very small gain. * Using Cilk Plus to parallelize merge sort can make it significantly faster than quicksort at the expense of only a few lines of code, but I haven't checked whether Cilk Plus plays nicely with multiprocessing and other parallelism options (remember the trouble with OpenMP-ified OpenBLAS?). you should also be able to do this with openmp tasks, though it will be a little less efficient as cilk+ has a better scheduler for this type of work. But I assume you will get the same trouble as openmp but that needs testing, also cilk+ in gcc is not really production ready yet, I got lots of crashes when I last tried it (it will be in 5.0 though). This isn't really an answer to your questions, more like a brain dump from someone who's stared at the same code for a while and did some experiments. I'm not saying we should implement all of this, but keep in mind that there are some interesting options besides implementing timsort. [1] https://en.wikipedia.org/wiki/Introsort [2] http://www.sorting-algorithms.com/static/QuicksortIsOptimal.pdf [3] https://github.com/larsmans/numpy/tree/sorting-nets ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Sorting refactor
I don't know if there is a general consensus or guideline on these matters, but I am personally not entirely charmed by the use of behind-the-scenes parallelism, unless explicitly requested. Perhaps an algorithm can be made faster, but often these multicore algorithms are also less efficient, and a less data-dependent way of putting my cores to good use would have been preferable. Potentially, other code could slow down due to cache trashing if too many parallel tasks run in parallel. Id rather be in charge of such matters myself; but I imagine adding a keyword arg for these matters would not be much trouble? On Fri, Jan 16, 2015 at 12:43 PM, Julian Taylor jtaylor.deb...@googlemail.com wrote: On 16.01.2015 12:33, Lars Buitinck wrote: 2015-01-16 11:55 GMT+01:00 numpy-discussion-requ...@scipy.org: Message: 2 Date: Thu, 15 Jan 2015 21:24:00 -0800 From: Jaime Fern?ndez del R?o jaime.f...@gmail.com Subject: [Numpy-discussion] Sorting refactor To: Discussion of Numerical Python numpy-discussion@scipy.org Message-ID: capowhwkf6rnwcrgmcwsmq_lo3hshjgbvlsrn19z-mdpe25e...@mail.gmail.com Content-Type: text/plain; charset=utf-8 This changes will make it easier for me to add a Timsort generic type function to numpy's arsenal of sorting routines. And I think they have value by cleaning the source code on their own. Yes, they do. I've been looking at the sorting functions as well and I've found the following: * The code is generally hard to read because it prefers pointer over indices. I'm wondering if it would get slower using indices. The closer these algorithms are to the textbook, the easier to insert fancy optimizations. * The heap sort exploits undefined behavior by using a pointer that points before the start of the array. However, rewriting it to always point within the array made it slower. I haven't tried rewriting it using indices. * Quicksort has a quadratic time worst case. I think it should be turned into an introsort [1] for O(n log n) worst case; we have the heapsort needed to do that. This probably rarely happens in numeric data, and we do have guaranteed nlog runtime algorithms available. But it also is not costly to do, e.g. the selection code is a introselect instead of a normal quickselect. I'd say not high priority, but if someone wants to do it I don't see why not. * Quicksort is robust to repeated elements, but doesn't exploit them. It can be made to run in linear time if the input array has only O(1) distinct elements [2]. This may come at the expense of some performance on arrays with no repeated elements. * Using optimal sorting networks instead of insertion sort as the base case can speed up quicksort on float arrays by 5-10%, but only if NaNs are moved out of the way first so that comparisons become cheaper [3]. This has consequences for the selection algorithms that I haven't fully worked out yet. I was also thinking about this, an advantage of a sorting network is that it can be vectorized to be significantly faster than an insertion sort. Handling NaN's should also be directly possible. The issue is that its probably too much complicated code for only a very small gain. * Using Cilk Plus to parallelize merge sort can make it significantly faster than quicksort at the expense of only a few lines of code, but I haven't checked whether Cilk Plus plays nicely with multiprocessing and other parallelism options (remember the trouble with OpenMP-ified OpenBLAS?). you should also be able to do this with openmp tasks, though it will be a little less efficient as cilk+ has a better scheduler for this type of work. But I assume you will get the same trouble as openmp but that needs testing, also cilk+ in gcc is not really production ready yet, I got lots of crashes when I last tried it (it will be in 5.0 though). This isn't really an answer to your questions, more like a brain dump from someone who's stared at the same code for a while and did some experiments. I'm not saying we should implement all of this, but keep in mind that there are some interesting options besides implementing timsort. [1] https://en.wikipedia.org/wiki/Introsort [2] http://www.sorting-algorithms.com/static/QuicksortIsOptimal.pdf [3] https://github.com/larsmans/numpy/tree/sorting-nets ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Sorting refactor
On 16 January 2015 at 13:15, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: Perhaps an algorithm can be made faster, but often these multicore algorithms are also less efficient, and a less data-dependent way of putting my cores to good use would have been preferable. Potentially, other code could slow down due to cache trashing if too many parallel tasks run in parallel. Id rather be in charge of such matters myself; but I imagine adding a keyword arg for these matters would not be much trouble? As I understand it, that is where the strength of Cilk+ lies. It does not force parallelisation, just suggests it. The decision to actually spawn parallel is decided at runtime depending on the load of the other cores. /David. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Sorting refactor
On Fri, Jan 16, 2015 at 4:19 AM, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Fri, Jan 16, 2015 at 5:24 AM, Jaime Fernández del Río jaime.f...@gmail.com wrote: Hi all, I have been taking a deep look at the sorting functionality in numpy, and I think it could use a face lift in the form of a big code refactor, to get rid of some of the ugliness in the code and make it easier to maintain. What I have in mind basically amounts to: Refactor _new_argsortlike to get rid of code duplication (there are two branches, one with buffering, one without, virtually identical, that could be merged into a single one). Modify _new_argsortlike so that it can properly handle byte-swapped inputs of any dtype, see gh-5441. Add proper handling of types with references, in preparation for the rest of changes. Add three functions to the npy_sort library: npy_aquicksort, npy_aheapsort, npy_amergesort, with a signature compatible with PyArray_ArgSortFunc , i.e. (char* start, npy_intp* result, npy_intp length, PyArrayObject *arr). These turn out to be almost identical to the string and unicode sort functions, but using the dtype's compare function to handle comparisons. Modify PyArray_ArgSort (and PyArray_ArgPartition) to always call _new_argsortlike, even when there is no type specific argsort function, by using the newly added npy_axxx functions. This simplifies PyArray_ArgSort a lot, and gets rid of some of the global variable ugliness in the current code. And makes argsorting over non-contiguous axis more memory efficient. Refactor _new_sortlike similarly to _new_argsortlike Modify the npy_quicksort, npy_mergesort and npy_heapsort functions in npy_sort to have a signature compatible with PyArray_SortFunc, i.e. (char* start, npy_intp length, PyArrayObject *arr). npy_quicksort will no longer rely on libc's qsort, but be very similar to the string or unicode quicksort functions Modify PyArray_Sort (and PyArray_Partition) to always call _new_sortlike, similarly to what was done with PyArray_ArgSort. This allows completing the removal of the remaining global variable ugliness, as well as similar benefits as for argsort before. This changes will make it easier for me to add a Timsort generic type function to numpy's arsenal of sorting routines. And I think they have value by cleaning the source code on their own. So my questions, mostly to the poor souls that will have to code review changes to several hundred lines of code: Does this make sense, or is it better left alone? A subset of 1, 2 and 5 are a must to address the issues in gh-5441, the rest could arguably be left as is. Would you rather see it submitted as one ginormous PR? Or split into 4 or 5 incremental ones? Do you think it would be possible to split this into several PRs, with the initial one being the refactoring, and the subsequent ones being additions to sorting functionality? Just to be clear, nothing in the long list of changes I posted earlier is truly a change in the sorting functionality, except in the case of quicksort (and argquicksort) for generic types, which would no longer rely on qsort, but on our own implementation of it, just like every other sort in numpy. So yes, the refactor PR can precede new functionality. And it can even be split into 3 or 4 incremental PRs, to make the reviewers life easier. Since it is most likely Charles and/or Julian that are going to have to swallow the review pill, I'd like to hear from them how would they like it better. Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NumPy-Discussion Digest, Vol 100, Issue 28
2015-01-16 13:29 GMT+01:00 numpy-discussion-requ...@scipy.org: Date: Fri, 16 Jan 2015 12:43:43 +0100 From: Julian Taylor jtaylor.deb...@googlemail.com Subject: Re: [Numpy-discussion] Sorting refactor To: Discussion of Numerical Python numpy-discussion@scipy.org Message-ID: 54b8f96f.7090...@googlemail.com Content-Type: text/plain; charset=windows-1252 On 16.01.2015 12:33, Lars Buitinck wrote: * Quicksort has a quadratic time worst case. I think it should be turned into an introsort [1] for O(n log n) worst case; we have the heapsort needed to do that. This probably rarely happens in numeric data, and we do have guaranteed nlog runtime algorithms available. It's no more likely or unlikely than in any other type of data (AFAIK), but it's possible for an adversary to DOS any (web server) code that uses np.sort. I was also thinking about this, an advantage of a sorting network is that it can be vectorized to be significantly faster than an insertion sort. Handling NaN's should also be directly possible. Tried that, and it didn't give any speedup, at least not without explicit vector instructions. Just moving the NaNs aside didn't cost anything in my preliminary benchmarks (without sorting nets), the cost of the operation was almost exactly compensated by simpler comparisons. The issue is that its probably too much complicated code for only a very small gain. Maybe. The thing is that the selection algorithms are optimized for NaNs and seem to depend on the current comparison code. We'd need distinct TYPE_LT and TYPE_LT_NONAN for each TYPE. The sorting nets themselves aren't complicated, just lengthy. My branch has the length-optimal (not optimally parallel) ones for n = 16. But I assume you will get the same trouble as openmp but that needs testing, also cilk+ in gcc is not really production ready yet, I got lots of crashes when I last tried it (it will be in 5.0 though). The data parallel constructs tend to crash the compiler, but task spawning seems to be stable in 4.9.2. I've still to see how it handles multiprocessing/fork. What do you mean by will be in 5.0, did they do a big push? Date: Fri, 16 Jan 2015 13:28:56 +0100 From: Da?id davidmen...@gmail.com Subject: Re: [Numpy-discussion] Sorting refactor To: Discussion of Numerical Python numpy-discussion@scipy.org Message-ID: CAJhcF=1O5Own_5ydzu+To8HHbm3e66k=iunqreiasdy23dn...@mail.gmail.com Content-Type: text/plain; charset=UTF-8 On 16 January 2015 at 13:15, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: Perhaps an algorithm can be made faster, but often these multicore algorithms are also less efficient, and a less data-dependent way of putting my cores to good use would have been preferable. Potentially, other code could slow down due to cache trashing if too many parallel tasks run in parallel. Id rather be in charge of such matters myself; but I imagine adding a keyword arg for these matters would not be much trouble? As I understand it, that is where the strength of Cilk+ lies. It does not force parallelisation, just suggests it. The decision to actually spawn parallel is decided at runtime depending on the load of the other cores. cilk+ guarantees that the amount of space used by a pool of P threads is at most P times the stack space used by the sequential version (+ a constant). The idea is that you can say for (i = 0; i 100; i++) { cilk_spawn f(a[i]); } and it will never create more than P work items in memory, rather than 1e6, even if each f() spawns a bunch itself. Of course, it won't guarantee that OpenMP will not also spawn P threads and/or check that you're one of P processes cooperating on a task using multiprocessing. Personally I'd rather have an np.setnumthreads() to turn this on or off for a process and have the runtime distribute work for me instead of having to do it myself. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Sorting refactor
2015-01-16 15:14 GMT+01:00 numpy-discussion-requ...@scipy.org: Date: Fri, 16 Jan 2015 06:11:29 -0800 From: Jaime Fern?ndez del R?o jaime.f...@gmail.com Subject: Re: [Numpy-discussion] Sorting refactor To: Discussion of Numerical Python numpy-discussion@scipy.org Most of my proposed original changes do not affect the core sorting functionality, just the infrastructure around it. But if we agree that sorting has potential for being an actively developed part of the code base, then cleaning up its surroundings for clarity makes sense, so I'm taking your brain dump as an aye for my proposal. ;-) It is! ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Sorting refactor
I agree; an np.setnumthreads to manage a numpy-global threadpool makes sense to me. Of course there are a great many cases where just spawning as many threads as cores is a sensible default, but if this kind of behavior could not be overridden I could see that greatly reduce performance for some of my more complex projects On Fri, Jan 16, 2015 at 4:11 PM, Julian Taylor jtaylor.deb...@googlemail.com wrote: On 01/16/2015 03:14 PM, Lars Buitinck wrote: 2015-01-16 13:29 GMT+01:00 numpy-discussion-requ...@scipy.org: Date: Fri, 16 Jan 2015 12:43:43 +0100 From: Julian Taylor jtaylor.deb...@googlemail.com Subject: Re: [Numpy-discussion] Sorting refactor To: Discussion of Numerical Python numpy-discussion@scipy.org Message-ID: 54b8f96f.7090...@googlemail.com Content-Type: text/plain; charset=windows-1252 On 16.01.2015 12:33, Lars Buitinck wrote: * Quicksort has a quadratic time worst case. I think it should be turned into an introsort [1] for O(n log n) worst case; we have the heapsort needed to do that. This probably rarely happens in numeric data, and we do have guaranteed nlog runtime algorithms available. It's no more likely or unlikely than in any other type of data (AFAIK), but it's possible for an adversary to DOS any (web server) code that uses np.sort. if you are using numpy where an arbitrary user is allowed to control the data passed to a non isolated environment you have a problem anyway. numpy is far from secure software and there are likely hundreds of ways to produce DOS and dozens of ways to do code execution in any nontrivial numpy using application. I was also thinking about this, an advantage of a sorting network is that it can be vectorized to be significantly faster than an insertion sort. Handling NaN's should also be directly possible. Tried that, and it didn't give any speedup, at least not without explicit vector instructions. Just moving the NaNs aside didn't cost anything in my preliminary benchmarks (without sorting nets), the cost of the operation was almost exactly compensated by simpler comparisons. an SSE2 implementation a 16 entry bitonic sort is available here: https://github.com/mischasan/sse2/blob/master/ssesort.c there is also a benchmark, on my machine its 6 times faster than insertion sort. But again this would only gain us 5-10% improvement at best as the partition part of quicksort is still the major time consuming part. The issue is that its probably too much complicated code for only a very small gain. Maybe. The thing is that the selection algorithms are optimized for NaNs and seem to depend on the current comparison code. We'd need distinct TYPE_LT and TYPE_LT_NONAN for each TYPE. The sorting nets themselves aren't complicated, just lengthy. My branch has the length-optimal (not optimally parallel) ones for n = 16. But I assume you will get the same trouble as openmp but that needs testing, also cilk+ in gcc is not really production ready yet, I got lots of crashes when I last tried it (it will be in 5.0 though). The data parallel constructs tend to crash the compiler, but task spawning seems to be stable in 4.9.2. I've still to see how it handles multiprocessing/fork. What do you mean by will be in 5.0, did they do a big push? gcc 5.0 changelog reports full support for cilk plus. Also all bugs I have filed have been fixed in 5.0. Date: Fri, 16 Jan 2015 13:28:56 +0100 From: Da?id davidmen...@gmail.com Subject: Re: [Numpy-discussion] Sorting refactor To: Discussion of Numerical Python numpy-discussion@scipy.org Message-ID: CAJhcF=1O5Own_5ydzu+To8HHbm3e66k= iunqreiasdy23dn...@mail.gmail.com Content-Type: text/plain; charset=UTF-8 On 16 January 2015 at 13:15, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: Perhaps an algorithm can be made faster, but often these multicore algorithms are also less efficient, and a less data-dependent way of putting my cores to good use would have been preferable. Potentially, other code could slow down due to cache trashing if too many parallel tasks run in parallel. Id rather be in charge of such matters myself; but I imagine adding a keyword arg for these matters would not be much trouble? As I understand it, that is where the strength of Cilk+ lies. It does not force parallelisation, just suggests it. The decision to actually spawn parallel is decided at runtime depending on the load of the other cores. cilk+ guarantees that the amount of space used by a pool of P threads is at most P times the stack space used by the sequential version (+ a constant). The idea is that you can say for (i = 0; i 100; i++) { cilk_spawn f(a[i]); } and it will never create more than P work items in memory, rather than 1e6, even if each f() spawns a bunch itself. Of course, it won't guarantee that OpenMP will not also spawn P threads and/or
Re: [Numpy-discussion] Sorting refactor
On Fri, Jan 16, 2015 at 7:11 AM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Fri, Jan 16, 2015 at 3:33 AM, Lars Buitinck larsm...@gmail.com wrote: 2015-01-16 11:55 GMT+01:00 numpy-discussion-requ...@scipy.org: Message: 2 Date: Thu, 15 Jan 2015 21:24:00 -0800 From: Jaime Fern?ndez del R?o jaime.f...@gmail.com Subject: [Numpy-discussion] Sorting refactor To: Discussion of Numerical Python numpy-discussion@scipy.org Message-ID: capowhwkf6rnwcrgmcwsmq_lo3hshjgbvlsrn19z-mdpe25e...@mail.gmail.com Content-Type: text/plain; charset=utf-8 This changes will make it easier for me to add a Timsort generic type function to numpy's arsenal of sorting routines. And I think they have value by cleaning the source code on their own. Yes, they do. I've been looking at the sorting functions as well and I've found the following: * The code is generally hard to read because it prefers pointer over indices. I'm wondering if it would get slower using indices. The closer these algorithms are to the textbook, the easier to insert fancy optimizations. They are harder to read, but so cute to look at! C code just wouldn't feel the same without some magical pointer arithmetic thrown in here and there. ;-) Pointers were faster than indexing. That advantage can be hardware dependent, but for small numbers of pointers is typical. * The heap sort exploits undefined behavior by using a pointer that points before the start of the array. However, rewriting it to always point within the array made it slower. I haven't tried rewriting it using indices Fortran uses the same pointer trick for one based indexing, or at least the old DEC compilers did ;) There is no reason to avoid it. . * Quicksort has a quadratic time worst case. I think it should be turned into an introsort [1] for O(n log n) worst case; we have the heapsort needed to do that. * Quicksort is robust to repeated elements, but doesn't exploit them. It can be made to run in linear time if the input array has only O(1) distinct elements [2]. This may come at the expense of some performance on arrays with no repeated elements. Java famously changed its library implementation of quicksort to a dual pivot one invented by Vladimir Yaroslavskiy[1], they claim that with substantial performance gains. I tried to implement that for numpy [2], but couldn't get it to work any faster than the current code. For sorting, simple often beats fancy. * Using optimal sorting networks instead of insertion sort as the base case can speed up quicksort on float arrays by 5-10%, but only if NaNs are moved out of the way first so that comparisons become cheaper [3]. This has consequences for the selection algorithms that I haven't fully worked out yet. I expect the gains here would be for small sorts, which tend to be dominated by call overhead. Even if we stick with selection sort, we should spin it off into an inline smallsort function within the npy_sort library, and have quicksort and mergesort call the same function, instead of each implementing their own. It would make optimizations like the sorting networks easier to implement for all sorts. We could even expose it outside npy_sort, as there are a few places around the code base that have ad-hoc implementations of sorting. Good idea, I've thought of doing it myself. * Using Cilk Plus to parallelize merge sort can make it significantly faster than quicksort at the expense of only a few lines of code, but I haven't checked whether Cilk Plus plays nicely with multiprocessing and other parallelism options (remember the trouble with OpenMP-ified OpenBLAS?). This isn't really an answer to your questions, more like a brain dump from someone who's stared at the same code for a while and did some experiments. I'm not saying we should implement all of this, but keep in mind that there are some interesting options besides implementing timsort. Timsort came up in a discussion several months ago, where I proposed adding a mergesorted function (which I have mostly ready, by the way, [3]) to speed-up some operations in arraysetops. I have serious doubts that it will perform comparably to the other sorts unless comparisons are terribly expensive, which they typically aren't in numpy, but it has been an interesting learning exercise so far, and I don't mind taking it all the way. Most of my proposed original changes do not affect the core sorting functionality, just the infrastructure around it. But if we agree that sorting has potential for being an actively developed part of the code base, then cleaning up its surroundings for clarity makes sense, so I'm taking your brain dump as an aye for my proposal. ;-) I have a generic quicksort with standard interface sitting around somewhere in an ancient branch. Sorting objects needs to be sensitive to comparison exceptions, which is something to keep in mind. I'd also like to
Re: [Numpy-discussion] Sorting refactor
On Fri, Jan 16, 2015 at 8:15 AM, Charles R Harris charlesr.har...@gmail.com wrote: On Fri, Jan 16, 2015 at 7:11 AM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Fri, Jan 16, 2015 at 3:33 AM, Lars Buitinck larsm...@gmail.com wrote: 2015-01-16 11:55 GMT+01:00 numpy-discussion-requ...@scipy.org: Message: 2 Date: Thu, 15 Jan 2015 21:24:00 -0800 From: Jaime Fern?ndez del R?o jaime.f...@gmail.com Subject: [Numpy-discussion] Sorting refactor To: Discussion of Numerical Python numpy-discussion@scipy.org Message-ID: capowhwkf6rnwcrgmcwsmq_lo3hshjgbvlsrn19z-mdpe25e...@mail.gmail.com Content-Type: text/plain; charset=utf-8 This changes will make it easier for me to add a Timsort generic type function to numpy's arsenal of sorting routines. And I think they have value by cleaning the source code on their own. Yes, they do. I've been looking at the sorting functions as well and I've found the following: * The code is generally hard to read because it prefers pointer over indices. I'm wondering if it would get slower using indices. The closer these algorithms are to the textbook, the easier to insert fancy optimizations. They are harder to read, but so cute to look at! C code just wouldn't feel the same without some magical pointer arithmetic thrown in here and there. ;-) Pointers were faster than indexing. That advantage can be hardware dependent, but for small numbers of pointers is typical. * The heap sort exploits undefined behavior by using a pointer that points before the start of the array. However, rewriting it to always point within the array made it slower. I haven't tried rewriting it using indices Fortran uses the same pointer trick for one based indexing, or at least the old DEC compilers did ;) There is no reason to avoid it. . * Quicksort has a quadratic time worst case. I think it should be turned into an introsort [1] for O(n log n) worst case; we have the heapsort needed to do that. * Quicksort is robust to repeated elements, but doesn't exploit them. It can be made to run in linear time if the input array has only O(1) distinct elements [2]. This may come at the expense of some performance on arrays with no repeated elements. Java famously changed its library implementation of quicksort to a dual pivot one invented by Vladimir Yaroslavskiy[1], they claim that with substantial performance gains. I tried to implement that for numpy [2], but couldn't get it to work any faster than the current code. For sorting, simple often beats fancy. * Using optimal sorting networks instead of insertion sort as the base case can speed up quicksort on float arrays by 5-10%, but only if NaNs are moved out of the way first so that comparisons become cheaper [3]. This has consequences for the selection algorithms that I haven't fully worked out yet. I expect the gains here would be for small sorts, which tend to be dominated by call overhead. Even if we stick with selection sort, we should spin it off into an inline smallsort function within the npy_sort library, and have quicksort and mergesort call the same function, instead of each implementing their own. It would make optimizations like the sorting networks easier to implement for all sorts. We could even expose it outside npy_sort, as there are a few places around the code base that have ad-hoc implementations of sorting. Good idea, I've thought of doing it myself. * Using Cilk Plus to parallelize merge sort can make it significantly faster than quicksort at the expense of only a few lines of code, but I haven't checked whether Cilk Plus plays nicely with multiprocessing and other parallelism options (remember the trouble with OpenMP-ified OpenBLAS?). This isn't really an answer to your questions, more like a brain dump from someone who's stared at the same code for a while and did some experiments. I'm not saying we should implement all of this, but keep in mind that there are some interesting options besides implementing timsort. Timsort came up in a discussion several months ago, where I proposed adding a mergesorted function (which I have mostly ready, by the way, [3]) to speed-up some operations in arraysetops. I have serious doubts that it will perform comparably to the other sorts unless comparisons are terribly expensive, which they typically aren't in numpy, but it has been an interesting learning exercise so far, and I don't mind taking it all the way. Most of my proposed original changes do not affect the core sorting functionality, just the infrastructure around it. But if we agree that sorting has potential for being an actively developed part of the code base, then cleaning up its surroundings for clarity makes sense, so I'm taking your brain dump as an aye for my proposal. ;-) I have a generic quicksort with standard interface sitting around somewhere in an ancient branch. Sorting objects