Re: [Numpy-discussion] Does a `mergesorted` function make sense?
Re: [Numpy-discussion] Does a `mergesorted` function make sense? On 02/09/2014 8:40 PM, Charles R Harris wrote: On Mon, Sep 1, 2014 at 7:58 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com mailto:hoogendoorn.ee...@gmail.com wrote: On Mon, Sep 1, 2014 at 2:05 PM, Charles R Harris charlesr.har...@gmail.com mailto:charlesr.har...@gmail.com wrote: On Mon, Sep 1, 2014 at 1:49 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com mailto:hoogendoorn.ee...@gmail.com wrote: Sure, id like to do the hashing things out, but I would also like some preliminary feedback as to whether this is going in a direction anyone else sees the point of, if it conflicts with other plans, and indeed if we can agree that numpy is the right place for it; a point which I would very much like to defend. If there is some obvious no-go that im missing, I can do without the drudgery of writing proper documentation ;). These are good issues, that need to be discussed and resolved. Python has the benefit of having a BDFL. Numpy has no similar arrangement. In the post-numarray period, Travis Oliphant took that role and advanced the package in many ways. It seems that Travis is no longer available. There is a need for someone, or maybe some group of three, who would be guided by the sort of thinking Charles Harris sets out above, who would make the final decision. Colin W. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Does a `mergesorted` function make sense?
On Wed, Sep 3, 2014 at 4:07 AM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Tue, Sep 2, 2014 at 5:40 PM, Charles R Harris charlesr.har...@gmail.com wrote: What do you think about the suggestion of timsort? One would need to concatenate the arrays before sorting, but it should be fairly efficient. Timsort is very cool, and it would definitely be fun to implement in numpy. It is also a lot more work that merging two sorted arrays! I think +1 if someone else does it, but although I would love to be part of it, I am not sure I will be able to find time to look into it seriously in the next couple of months. From a setops point of view, merging two sorted arrays makes it very straightforward to compute, together with (or instead of) the result of the merge, index arrays that let you calculate things like `in1d` faster. Although perhaps an `argtimsort` could provide the same functionality, not sure. I will probably wrap up what I have, put a lace on it, and submit it as a PR. Even if it is not destined to be merged, it may serve as a warning to others. I have also been thinking lately that one of the problems with all these unique-related computations may be a case of having a hammer and seeing everything as nails. Perhaps the algorithm that needs to be ported from Python is not the sorting one, but the hash table... Jaime Not sure about the hashing. Indeed one can also build an index of a set by means of a hash table, but its questionable if this leads to improved performance over performing an argsort. Hashing may have better asymptotic time complexity in theory, but many datasets used in practice are very easy to sort (O(N)-ish), and the time-constant of hashing is higher. But more importantly, using a hash-table guarantees poor cache behavior for many operations using this index. By contrast, sorting may (but need not) make one random access pass to build the index, and may (but need not) perform one random access to reorder values for grouping. But insofar as the keys are better behaved than pure random, this coherence will be exploited. Also, getting the unique values/keys in sorted order is a nice side-benefit for many applications. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Does a `mergesorted` function make sense?
On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: Not sure about the hashing. Indeed one can also build an index of a set by means of a hash table, but its questionable if this leads to improved performance over performing an argsort. Hashing may have better asymptotic time complexity in theory, but many datasets used in practice are very easy to sort (O(N)-ish), and the time-constant of hashing is higher. But more importantly, using a hash-table guarantees poor cache behavior for many operations using this index. By contrast, sorting may (but need not) make one random access pass to build the index, and may (but need not) perform one random access to reorder values for grouping. But insofar as the keys are better behaved than pure random, this coherence will be exploited. If you want to give it a try, these branch of my numpy fork has hash table based implementations of unique (with no extra indices) and in1d: https://github.com/jaimefrio/numpy/tree/hash-unique A use cases where the hash table is clearly better: In [1]: import numpy as np In [2]: from numpy.lib._compiled_base import _unique, _in1d In [3]: a = np.random.randint(10, size=(1,)) In [4]: %timeit np.unique(a) 1000 loops, best of 3: 258 us per loop In [5]: %timeit _unique(a) 1 loops, best of 3: 143 us per loop In [6]: %timeit np.sort(_unique(a)) 1 loops, best of 3: 149 us per loop It typically performs between 1.5x and 4x faster than sorting. I haven't profiled it properly to know, but there may be quite a bit of performance to dig out: have type specific comparison functions, optimize the starting hash table size based on the size of the array to avoid reinsertions... If getting the elements sorted is a necessity, and the array contains very few or no repeated items, then the hash table approach may even perform worse,: In [8]: a = np.random.randint(1, size=(5000,)) In [9]: %timeit np.unique(a) 1000 loops, best of 3: 277 us per loop In [10]: %timeit np.sort(_unique(a)) 1000 loops, best of 3: 320 us per loop But the hash table still wins in extracting the unique items only: In [11]: %timeit _unique(a) 1 loops, best of 3: 187 us per loop Where the hash table shines is in more elaborate situations. If you keep the first index where it was found, and the number of repeats, in the hash table, you can get return_index and return_counts almost for free, which means you are performing an extra 3x faster than with sorting. return_inverse requires a little more trickery, so I won;t attempt to quantify the improvement. But I wouldn't be surprised if, after fine tuning it, there is close to an order of magnitude overall improvement The spped-up for in1d is also nice: In [16]: a = np.random.randint(1000, size=(1000,)) In [17]: b = np.random.randint(1000, size=(500,)) In [18]: %timeit np.in1d(a, b) 1000 loops, best of 3: 178 us per loop In [19]: %timeit _in1d(a, b) 1 loops, best of 3: 30.1 us per loop Of course, there is no point in ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Does a `mergesorted` function make sense?
On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: Not sure about the hashing. Indeed one can also build an index of a set by means of a hash table, but its questionable if this leads to improved performance over performing an argsort. Hashing may have better asymptotic time complexity in theory, but many datasets used in practice are very easy to sort (O(N)-ish), and the time-constant of hashing is higher. But more importantly, using a hash-table guarantees poor cache behavior for many operations using this index. By contrast, sorting may (but need not) make one random access pass to build the index, and may (but need not) perform one random access to reorder values for grouping. But insofar as the keys are better behaved than pure random, this coherence will be exploited. If you want to give it a try, these branch of my numpy fork has hash table based implementations of unique (with no extra indices) and in1d: https://github.com/jaimefrio/numpy/tree/hash-unique A use cases where the hash table is clearly better: In [1]: import numpy as np In [2]: from numpy.lib._compiled_base import _unique, _in1d In [3]: a = np.random.randint(10, size=(1,)) In [4]: %timeit np.unique(a) 1000 loops, best of 3: 258 us per loop In [5]: %timeit _unique(a) 1 loops, best of 3: 143 us per loop In [6]: %timeit np.sort(_unique(a)) 1 loops, best of 3: 149 us per loop It typically performs between 1.5x and 4x faster than sorting. I haven't profiled it properly to know, but there may be quite a bit of performance to dig out: have type specific comparison functions, optimize the starting hash table size based on the size of the array to avoid reinsertions... If getting the elements sorted is a necessity, and the array contains very few or no repeated items, then the hash table approach may even perform worse,: In [8]: a = np.random.randint(1, size=(5000,)) In [9]: %timeit np.unique(a) 1000 loops, best of 3: 277 us per loop In [10]: %timeit np.sort(_unique(a)) 1000 loops, best of 3: 320 us per loop But the hash table still wins in extracting the unique items only: In [11]: %timeit _unique(a) 1 loops, best of 3: 187 us per loop Where the hash table shines is in more elaborate situations. If you keep the first index where it was found, and the number of repeats, in the hash table, you can get return_index and return_counts almost for free, which means you are performing an extra 3x faster than with sorting. return_inverse requires a little more trickery, so I won;t attempt to quantify the improvement. But I wouldn't be surprised if, after fine tuning it, there is close to an order of magnitude overall improvement The spped-up for in1d is also nice: In [16]: a = np.random.randint(1000, size=(1000,)) In [17]: b = np.random.randint(1000, size=(500,)) In [18]: %timeit np.in1d(a, b) 1000 loops, best of 3: 178 us per loop In [19]: %timeit _in1d(a, b) 1 loops, best of 3: 30.1 us per loop Of course, there is no point in Ooops!!! Hit the send button too quick. Not to extend myself too long: if we are going to rethink all of this, we should approach it with an open mind. Still, and this post is not helping with that either, I am afraid that we are discussing implementation details, but are missing a broader vision of what we want to accomplish and why. That vision of what numpy's grouping functionality, if any, should be, and how it complements or conflicts with what pandas is providing, should precede anything else. I now I haven't, but has anyone looked at how pandas implements grouping? Their documentation on the subject is well worth a read: http://pandas.pydata.org/pandas-docs/stable/groupby.html Does numpy need to replicate this? What/why/how can we add to that? Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Does a `mergesorted` function make sense?
On Wed, Sep 3, 2014 at 5:26 AM, cjw c...@ncf.ca wrote: These are good issues, that need to be discussed and resolved. Python has the benefit of having a BDFL. Numpy has no similar arrangement. In the post-numarray period, Travis Oliphant took that role and advanced the package in many ways. It seems that Travis is no longer available. There is a need for someone, or maybe some group of three, who would be guided by the sort of thinking Charles Harris sets out above, who would make the final decision. We should crown Charles philosopher king, and let him wisely rule over us with the aid of his aristocracy of devs with merge rights. We would make Plato proud! :-) Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Does a `mergesorted` function make sense?
On Wed, Sep 3, 2014 at 10:53 AM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Wed, Sep 3, 2014 at 5:26 AM, cjw c...@ncf.ca wrote: These are good issues, that need to be discussed and resolved. Python has the benefit of having a BDFL. Numpy has no similar arrangement. In the post-numarray period, Travis Oliphant took that role and advanced the package in many ways. It seems that Travis is no longer available. There is a need for someone, or maybe some group of three, who would be guided by the sort of thinking Charles Harris sets out above, who would make the final decision. We should crown Charles philosopher king, and let him wisely rule over us with the aid of his aristocracy of devs with merge rights. We would make Plato proud! :-) Charles I needs an heir in case of execution by the Parliamentary forces. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] odd (?) behavior: negative integer scalar in exponent
What should be the value of `2**np.int_(-32)`? It is apparently currently computed as `1. / (2**np.int_(32))`, so the computation overflows (when a C long is 32 bits). I would have hoped for it to be computed as `1./(2.**np.int_(32))`. Cheers, Alan Isaac ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Give Jaime Fernandez commit rights.
Hi All, I'd like to give Jaime commit rights. Having at three active developers with commit rights is the goal and Jaime has been pretty consistent with code submissions and discussion participation. Thoughts? Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Give Jaime Fernandez commit rights.
On Wed, Sep 3, 2014 at 10:47 PM, Charles R Harris charlesr.har...@gmail.com wrote: Hi All, I'd like to give Jaime commit rights. Having at three active developers with commit rights is the goal and Jaime has been pretty consistent with code submissions and discussion participation. +1 -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Give Jaime Fernandez commit rights.
On Wed, Sep 3, 2014 at 11:48 PM, Robert Kern robert.k...@gmail.com wrote: On Wed, Sep 3, 2014 at 10:47 PM, Charles R Harris charlesr.har...@gmail.com wrote: Hi All, I'd like to give Jaime commit rights. Having at three active developers with commit rights is the goal and Jaime has been pretty consistent with code submissions and discussion participation. +1 +1 excellent idea Ralf ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] odd (?) behavior: negative integer scalar in exponent
On Wed, Sep 3, 2014 at 3:19 PM, Alan G Isaac alan.is...@gmail.com wrote: What should be the value of `2**np.int_(-32)`? It is apparently currently computed as `1. / (2**np.int_(32))`, so the computation overflows (when a C long is 32 bits). I would have hoped for it to be computed as `1./(2.**np.int_(32))`. Looks like a bug to me. Chuck. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Give Jaime Fernandez commit rights.
+1; though I am relatively new to the scene, Jaime's contributions have always stood out to me as thoughtful. On Thu, Sep 4, 2014 at 12:42 AM, Ralf Gommers ralf.gomm...@gmail.com wrote: On Wed, Sep 3, 2014 at 11:48 PM, Robert Kern robert.k...@gmail.com wrote: On Wed, Sep 3, 2014 at 10:47 PM, Charles R Harris charlesr.har...@gmail.com wrote: Hi All, I'd like to give Jaime commit rights. Having at three active developers with commit rights is the goal and Jaime has been pretty consistent with code submissions and discussion participation. +1 +1 excellent idea Ralf ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Give Jaime Fernandez commit rights.
On Wed, Sep 3, 2014 at 5:48 PM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: +1; though I am relatively new to the scene, Jaime's contributions have always stood out to me as thoughtful. On Thu, Sep 4, 2014 at 12:42 AM, Ralf Gommers ralf.gomm...@gmail.com wrote: On Wed, Sep 3, 2014 at 11:48 PM, Robert Kern robert.k...@gmail.com wrote: On Wed, Sep 3, 2014 at 10:47 PM, Charles R Harris charlesr.har...@gmail.com wrote: Hi All, I'd like to give Jaime commit rights. Having at three active developers with commit rights is the goal and Jaime has been pretty consistent with code submissions and discussion participation. +1 +1 excellent idea I think the ayes will have it. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Give Jaime Fernandez commit rights.
On Wed, Sep 3, 2014 at 5:47 PM, Charles R Harris charlesr.har...@gmail.com wrote: I think the ayes will have it. As I told Chuck (because I now get to call Charles Chuck, right? :-)), I am not sure I am fully qualified for the job: looking at the names on that list is a humbling experience. But even if I am the idiot brother, it feels good to be part of the family. Numpy has provided me with countless hours of learning and enjoyment. and I really look forward to giving back, even if only a fraction of that. Thanks a lot for the trust! Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion