Re: [Numpy-discussion] Does a `mergesorted` function make sense?
On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: Not sure about the hashing. Indeed one can also build an index of a set by means of a hash table, but its questionable if this leads to improved performance over performing an argsort. Hashing may have better asymptotic time complexity in theory, but many datasets used in practice are very easy to sort (O(N)-ish), and the time-constant of hashing is higher. But more importantly, using a hash-table guarantees poor cache behavior for many operations using this index. By contrast, sorting may (but need not) make one random access pass to build the index, and may (but need not) perform one random access to reorder values for grouping. But insofar as the keys are better behaved than pure random, this coherence will be exploited. If you want to give it a try, these branch of my numpy fork has hash table based implementations of unique (with no extra indices) and in1d: https://github.com/jaimefrio/numpy/tree/hash-unique A use cases where the hash table is clearly better: In [1]: import numpy as np In [2]: from numpy.lib._compiled_base import _unique, _in1d In [3]: a = np.random.randint(10, size=(1,)) In [4]: %timeit np.unique(a) 1000 loops, best of 3: 258 us per loop In [5]: %timeit _unique(a) 1 loops, best of 3: 143 us per loop In [6]: %timeit np.sort(_unique(a)) 1 loops, best of 3: 149 us per loop It typically performs between 1.5x and 4x faster than sorting. I haven't profiled it properly to know, but there may be quite a bit of performance to dig out: have type specific comparison functions, optimize the starting hash table size based on the size of the array to avoid reinsertions... If getting the elements sorted is a necessity, and the array contains very few or no repeated items, then the hash table approach may even perform worse,: In [8]: a = np.random.randint(1, size=(5000,)) In [9]: %timeit np.unique(a) 1000 loops, best of 3: 277 us per loop In [10]: %timeit np.sort(_unique(a)) 1000 loops, best of 3: 320 us per loop But the hash table still wins in extracting the unique items only: In [11]: %timeit _unique(a) 1 loops, best of 3: 187 us per loop Where the hash table shines is in more elaborate situations. If you keep the first index where it was found, and the number of repeats, in the hash table, you can get return_index and return_counts almost for free, which means you are performing an extra 3x faster than with sorting. return_inverse requires a little more trickery, so I won;t attempt to quantify the improvement. But I wouldn't be surprised if, after fine tuning it, there is close to an order of magnitude overall improvement The spped-up for in1d is also nice: In [16]: a = np.random.randint(1000, size=(1000,)) In [17]: b = np.random.randint(1000, size=(500,)) In [18]: %timeit np.in1d(a, b) 1000 loops, best of 3: 178 us per loop In [19]: %timeit _in1d(a, b) 1 loops, best of 3: 30.1 us per loop Of course, there is no point in Ooops!!! Hit the send button too quick. Not to extend myself too long: if we are going to rethink all of this, we should approach it with an open mind. Still, and this post is not helping with that either, I am afraid that we are discussing implementation details, but are missing a broader vision of what we want to accomplish and why. That vision of what numpy's grouping functionality, if any, should be, and how it complements or conflicts with what pandas is providing, should precede anything else. I now I haven't, but has anyone looked at how pandas implements grouping? Their documentation on the subject is well worth a read: http://pandas.pydata.org/pandas-docs/stable/groupby.html Does numpy need to replicate this? What/why/how can we add to that? Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I would certainly not be opposed to having a hashing based indexing mechanism; I think it would make sense design-wise to have a HashIndex class with the same interface as the rest, and use that subclass in those arraysetops where it makes sense. The 'how to' of indexing and its applications are largely orthogonal I think (with some tiny performance compromises which are worth the abstraction imo). For datasets which are not purely random, have many unique items, and which do not fit into cache, I would expect sorting to come out on top, but indeed it depends on the dataset. Yeah, the question how pandas does grouping, and whether we can do better, is a relevant one. From what
Re: [Numpy-discussion] Does a `mergesorted` function make sense?
On Thu, Sep 4, 2014 at 10:31 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: Not sure about the hashing. Indeed one can also build an index of a set by means of a hash table, but its questionable if this leads to improved performance over performing an argsort. Hashing may have better asymptotic time complexity in theory, but many datasets used in practice are very easy to sort (O(N)-ish), and the time-constant of hashing is higher. But more importantly, using a hash-table guarantees poor cache behavior for many operations using this index. By contrast, sorting may (but need not) make one random access pass to build the index, and may (but need not) perform one random access to reorder values for grouping. But insofar as the keys are better behaved than pure random, this coherence will be exploited. If you want to give it a try, these branch of my numpy fork has hash table based implementations of unique (with no extra indices) and in1d: https://github.com/jaimefrio/numpy/tree/hash-unique A use cases where the hash table is clearly better: In [1]: import numpy as np In [2]: from numpy.lib._compiled_base import _unique, _in1d In [3]: a = np.random.randint(10, size=(1,)) In [4]: %timeit np.unique(a) 1000 loops, best of 3: 258 us per loop In [5]: %timeit _unique(a) 1 loops, best of 3: 143 us per loop In [6]: %timeit np.sort(_unique(a)) 1 loops, best of 3: 149 us per loop It typically performs between 1.5x and 4x faster than sorting. I haven't profiled it properly to know, but there may be quite a bit of performance to dig out: have type specific comparison functions, optimize the starting hash table size based on the size of the array to avoid reinsertions... If getting the elements sorted is a necessity, and the array contains very few or no repeated items, then the hash table approach may even perform worse,: In [8]: a = np.random.randint(1, size=(5000,)) In [9]: %timeit np.unique(a) 1000 loops, best of 3: 277 us per loop In [10]: %timeit np.sort(_unique(a)) 1000 loops, best of 3: 320 us per loop But the hash table still wins in extracting the unique items only: In [11]: %timeit _unique(a) 1 loops, best of 3: 187 us per loop Where the hash table shines is in more elaborate situations. If you keep the first index where it was found, and the number of repeats, in the hash table, you can get return_index and return_counts almost for free, which means you are performing an extra 3x faster than with sorting. return_inverse requires a little more trickery, so I won;t attempt to quantify the improvement. But I wouldn't be surprised if, after fine tuning it, there is close to an order of magnitude overall improvement The spped-up for in1d is also nice: In [16]: a = np.random.randint(1000, size=(1000,)) In [17]: b = np.random.randint(1000, size=(500,)) In [18]: %timeit np.in1d(a, b) 1000 loops, best of 3: 178 us per loop In [19]: %timeit _in1d(a, b) 1 loops, best of 3: 30.1 us per loop Of course, there is no point in Ooops!!! Hit the send button too quick. Not to extend myself too long: if we are going to rethink all of this, we should approach it with an open mind. Still, and this post is not helping with that either, I am afraid that we are discussing implementation details, but are missing a broader vision of what we want to accomplish and why. That vision of what numpy's grouping functionality, if any, should be, and how it complements or conflicts with what pandas is providing, should precede anything else. I now I haven't, but has anyone looked at how pandas implements grouping? Their documentation on the subject is well worth a read: http://pandas.pydata.org/pandas-docs/stable/groupby.html Does numpy need to replicate this? What/why/how can we add to that? Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I would certainly not be opposed to having a hashing based indexing mechanism; I think it would make sense design-wise to have a HashIndex class with the same interface as the rest, and use that subclass in those arraysetops where it makes sense. The 'how to' of indexing and its applications are largely orthogonal I think (with some tiny performance compromises which are worth the abstraction imo). For datasets which are not purely random, have many unique items, and which do not fit into cache, I would expect sorting to come out on top, but indeed it depends on the dataset.
Re: [Numpy-discussion] Does a `mergesorted` function make sense?
On Thu, Sep 4, 2014 at 10:39 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: On Thu, Sep 4, 2014 at 10:31 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: Not sure about the hashing. Indeed one can also build an index of a set by means of a hash table, but its questionable if this leads to improved performance over performing an argsort. Hashing may have better asymptotic time complexity in theory, but many datasets used in practice are very easy to sort (O(N)-ish), and the time-constant of hashing is higher. But more importantly, using a hash-table guarantees poor cache behavior for many operations using this index. By contrast, sorting may (but need not) make one random access pass to build the index, and may (but need not) perform one random access to reorder values for grouping. But insofar as the keys are better behaved than pure random, this coherence will be exploited. If you want to give it a try, these branch of my numpy fork has hash table based implementations of unique (with no extra indices) and in1d: https://github.com/jaimefrio/numpy/tree/hash-unique A use cases where the hash table is clearly better: In [1]: import numpy as np In [2]: from numpy.lib._compiled_base import _unique, _in1d In [3]: a = np.random.randint(10, size=(1,)) In [4]: %timeit np.unique(a) 1000 loops, best of 3: 258 us per loop In [5]: %timeit _unique(a) 1 loops, best of 3: 143 us per loop In [6]: %timeit np.sort(_unique(a)) 1 loops, best of 3: 149 us per loop It typically performs between 1.5x and 4x faster than sorting. I haven't profiled it properly to know, but there may be quite a bit of performance to dig out: have type specific comparison functions, optimize the starting hash table size based on the size of the array to avoid reinsertions... If getting the elements sorted is a necessity, and the array contains very few or no repeated items, then the hash table approach may even perform worse,: In [8]: a = np.random.randint(1, size=(5000,)) In [9]: %timeit np.unique(a) 1000 loops, best of 3: 277 us per loop In [10]: %timeit np.sort(_unique(a)) 1000 loops, best of 3: 320 us per loop But the hash table still wins in extracting the unique items only: In [11]: %timeit _unique(a) 1 loops, best of 3: 187 us per loop Where the hash table shines is in more elaborate situations. If you keep the first index where it was found, and the number of repeats, in the hash table, you can get return_index and return_counts almost for free, which means you are performing an extra 3x faster than with sorting. return_inverse requires a little more trickery, so I won;t attempt to quantify the improvement. But I wouldn't be surprised if, after fine tuning it, there is close to an order of magnitude overall improvement The spped-up for in1d is also nice: In [16]: a = np.random.randint(1000, size=(1000,)) In [17]: b = np.random.randint(1000, size=(500,)) In [18]: %timeit np.in1d(a, b) 1000 loops, best of 3: 178 us per loop In [19]: %timeit _in1d(a, b) 1 loops, best of 3: 30.1 us per loop Of course, there is no point in Ooops!!! Hit the send button too quick. Not to extend myself too long: if we are going to rethink all of this, we should approach it with an open mind. Still, and this post is not helping with that either, I am afraid that we are discussing implementation details, but are missing a broader vision of what we want to accomplish and why. That vision of what numpy's grouping functionality, if any, should be, and how it complements or conflicts with what pandas is providing, should precede anything else. I now I haven't, but has anyone looked at how pandas implements grouping? Their documentation on the subject is well worth a read: http://pandas.pydata.org/pandas-docs/stable/groupby.html Does numpy need to replicate this? What/why/how can we add to that? Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I would certainly not be opposed to having a hashing based indexing mechanism; I think it would make sense design-wise to have a HashIndex class with the same interface as the rest, and use that subclass in those arraysetops where it makes sense. The 'how to' of indexing and its applications are largely orthogonal I think (with some tiny performance compromises which are worth the abstraction imo). For datasets which are not purely random, have many unique items, and which do not fit into
Re: [Numpy-discussion] Does a `mergesorted` function make sense?
I should clarify: I am speaking about my implementation, I havnt looked at the numpy implementation for a while so im not sure what it is up to. Note that by 'almost free', we are still talking about three passes over the whole array plus temp allocations, but I am assuming a use-case where the various sorts involved are the dominant cost, which I imagine they are, for out-of-cache sorts. Perhaps this isn't too realistic an assumption about the average use case though, I don't know. Though I suppose its a reasonable guideline to assume that either the dataset is big, or performance isn't that big a concern in the first place. On Thu, Sep 4, 2014 at 7:55 PM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Thu, Sep 4, 2014 at 10:39 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: On Thu, Sep 4, 2014 at 10:31 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: Not sure about the hashing. Indeed one can also build an index of a set by means of a hash table, but its questionable if this leads to improved performance over performing an argsort. Hashing may have better asymptotic time complexity in theory, but many datasets used in practice are very easy to sort (O(N)-ish), and the time-constant of hashing is higher. But more importantly, using a hash-table guarantees poor cache behavior for many operations using this index. By contrast, sorting may (but need not) make one random access pass to build the index, and may (but need not) perform one random access to reorder values for grouping. But insofar as the keys are better behaved than pure random, this coherence will be exploited. If you want to give it a try, these branch of my numpy fork has hash table based implementations of unique (with no extra indices) and in1d: https://github.com/jaimefrio/numpy/tree/hash-unique A use cases where the hash table is clearly better: In [1]: import numpy as np In [2]: from numpy.lib._compiled_base import _unique, _in1d In [3]: a = np.random.randint(10, size=(1,)) In [4]: %timeit np.unique(a) 1000 loops, best of 3: 258 us per loop In [5]: %timeit _unique(a) 1 loops, best of 3: 143 us per loop In [6]: %timeit np.sort(_unique(a)) 1 loops, best of 3: 149 us per loop It typically performs between 1.5x and 4x faster than sorting. I haven't profiled it properly to know, but there may be quite a bit of performance to dig out: have type specific comparison functions, optimize the starting hash table size based on the size of the array to avoid reinsertions... If getting the elements sorted is a necessity, and the array contains very few or no repeated items, then the hash table approach may even perform worse,: In [8]: a = np.random.randint(1, size=(5000,)) In [9]: %timeit np.unique(a) 1000 loops, best of 3: 277 us per loop In [10]: %timeit np.sort(_unique(a)) 1000 loops, best of 3: 320 us per loop But the hash table still wins in extracting the unique items only: In [11]: %timeit _unique(a) 1 loops, best of 3: 187 us per loop Where the hash table shines is in more elaborate situations. If you keep the first index where it was found, and the number of repeats, in the hash table, you can get return_index and return_counts almost for free, which means you are performing an extra 3x faster than with sorting. return_inverse requires a little more trickery, so I won;t attempt to quantify the improvement. But I wouldn't be surprised if, after fine tuning it, there is close to an order of magnitude overall improvement The spped-up for in1d is also nice: In [16]: a = np.random.randint(1000, size=(1000,)) In [17]: b = np.random.randint(1000, size=(500,)) In [18]: %timeit np.in1d(a, b) 1000 loops, best of 3: 178 us per loop In [19]: %timeit _in1d(a, b) 1 loops, best of 3: 30.1 us per loop Of course, there is no point in Ooops!!! Hit the send button too quick. Not to extend myself too long: if we are going to rethink all of this, we should approach it with an open mind. Still, and this post is not helping with that either, I am afraid that we are discussing implementation details, but are missing a broader vision of what we want to accomplish and why. That vision of what numpy's grouping functionality, if any, should be, and how it complements or conflicts with what pandas is providing, should precede anything else. I now I haven't, but has anyone looked at how pandas implements grouping? Their documentation on the subject is well worth a read: http://pandas.pydata.org/pandas-docs/stable/groupby.html Does numpy need to replicate this? What/why/how can we add to that? Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en
Re: [Numpy-discussion] Does a `mergesorted` function make sense?
On Thu, Sep 4, 2014 at 8:14 PM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: I should clarify: I am speaking about my implementation, I havnt looked at the numpy implementation for a while so im not sure what it is up to. Note that by 'almost free', we are still talking about three passes over the whole array plus temp allocations, but I am assuming a use-case where the various sorts involved are the dominant cost, which I imagine they are, for out-of-cache sorts. Perhaps this isn't too realistic an assumption about the average use case though, I don't know. Though I suppose its a reasonable guideline to assume that either the dataset is big, or performance isn't that big a concern in the first place. On Thu, Sep 4, 2014 at 7:55 PM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Thu, Sep 4, 2014 at 10:39 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: On Thu, Sep 4, 2014 at 10:31 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: Not sure about the hashing. Indeed one can also build an index of a set by means of a hash table, but its questionable if this leads to improved performance over performing an argsort. Hashing may have better asymptotic time complexity in theory, but many datasets used in practice are very easy to sort (O(N)-ish), and the time-constant of hashing is higher. But more importantly, using a hash-table guarantees poor cache behavior for many operations using this index. By contrast, sorting may (but need not) make one random access pass to build the index, and may (but need not) perform one random access to reorder values for grouping. But insofar as the keys are better behaved than pure random, this coherence will be exploited. If you want to give it a try, these branch of my numpy fork has hash table based implementations of unique (with no extra indices) and in1d: https://github.com/jaimefrio/numpy/tree/hash-unique A use cases where the hash table is clearly better: In [1]: import numpy as np In [2]: from numpy.lib._compiled_base import _unique, _in1d In [3]: a = np.random.randint(10, size=(1,)) In [4]: %timeit np.unique(a) 1000 loops, best of 3: 258 us per loop In [5]: %timeit _unique(a) 1 loops, best of 3: 143 us per loop In [6]: %timeit np.sort(_unique(a)) 1 loops, best of 3: 149 us per loop It typically performs between 1.5x and 4x faster than sorting. I haven't profiled it properly to know, but there may be quite a bit of performance to dig out: have type specific comparison functions, optimize the starting hash table size based on the size of the array to avoid reinsertions... If getting the elements sorted is a necessity, and the array contains very few or no repeated items, then the hash table approach may even perform worse,: In [8]: a = np.random.randint(1, size=(5000,)) In [9]: %timeit np.unique(a) 1000 loops, best of 3: 277 us per loop In [10]: %timeit np.sort(_unique(a)) 1000 loops, best of 3: 320 us per loop But the hash table still wins in extracting the unique items only: In [11]: %timeit _unique(a) 1 loops, best of 3: 187 us per loop Where the hash table shines is in more elaborate situations. If you keep the first index where it was found, and the number of repeats, in the hash table, you can get return_index and return_counts almost for free, which means you are performing an extra 3x faster than with sorting. return_inverse requires a little more trickery, so I won;t attempt to quantify the improvement. But I wouldn't be surprised if, after fine tuning it, there is close to an order of magnitude overall improvement The spped-up for in1d is also nice: In [16]: a = np.random.randint(1000, size=(1000,)) In [17]: b = np.random.randint(1000, size=(500,)) In [18]: %timeit np.in1d(a, b) 1000 loops, best of 3: 178 us per loop In [19]: %timeit _in1d(a, b) 1 loops, best of 3: 30.1 us per loop Of course, there is no point in Ooops!!! Hit the send button too quick. Not to extend myself too long: if we are going to rethink all of this, we should approach it with an open mind. Still, and this post is not helping with that either, I am afraid that we are discussing implementation details, but are missing a broader vision of what we want to accomplish and why. That vision of what numpy's grouping functionality, if any, should be, and how it complements or conflicts with what pandas is providing, should precede anything else. I now I haven't, but has anyone looked at how pandas implements grouping? Their documentation on the subject is well worth a read: http://pandas.pydata.org/pandas-docs/stable/groupby.html Does numpy need to replicate this?
Re: [Numpy-discussion] Does a `mergesorted` function make sense?
Naturally, youd want to avoid redoing the indexing where you can, which is another good reason to factor out the indexing mechanisms into separate classes. A factor two performance difference does not get me too excited; again, I think it would be the other way around for an out-of-cache dataset being grouped. But this by itself is ofcourse another argument for factoring out the indexing behind a uniform interface, so we can play around with those implementation details later, and specialize the indexing to serve different scenarios. Also, it really helps with code maintainability; most arraysetops are almost trivial to implement once you have abstracted away the indexing machinery. On Thu, Sep 4, 2014 at 8:36 PM, Jeff Reback jeffreb...@gmail.com wrote: FYI pandas DOES use a very performant hash table impl for unique (and value_counts). Sorted state IS maintained by underlying Index implmentation. https://github.com/pydata/pandas/blob/master/pandas/hashtable.pyx In [8]: a = np.random.randint(10, size=(1,)) In [9]: %timeit np.unique(a) 1000 loops, best of 3: 284 µs per loop In [10]: %timeit Series(a).unique() 1 loops, best of 3: 161 µs per loop In [11]: s = Series(a) # without the creation overhead In [12]: %timeit s.unique() 1 loops, best of 3: 75.3 µs per loop On Thu, Sep 4, 2014 at 2:29 PM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: On Thu, Sep 4, 2014 at 8:14 PM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: I should clarify: I am speaking about my implementation, I havnt looked at the numpy implementation for a while so im not sure what it is up to. Note that by 'almost free', we are still talking about three passes over the whole array plus temp allocations, but I am assuming a use-case where the various sorts involved are the dominant cost, which I imagine they are, for out-of-cache sorts. Perhaps this isn't too realistic an assumption about the average use case though, I don't know. Though I suppose its a reasonable guideline to assume that either the dataset is big, or performance isn't that big a concern in the first place. On Thu, Sep 4, 2014 at 7:55 PM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Thu, Sep 4, 2014 at 10:39 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: On Thu, Sep 4, 2014 at 10:31 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: Not sure about the hashing. Indeed one can also build an index of a set by means of a hash table, but its questionable if this leads to improved performance over performing an argsort. Hashing may have better asymptotic time complexity in theory, but many datasets used in practice are very easy to sort (O(N)-ish), and the time-constant of hashing is higher. But more importantly, using a hash-table guarantees poor cache behavior for many operations using this index. By contrast, sorting may (but need not) make one random access pass to build the index, and may (but need not) perform one random access to reorder values for grouping. But insofar as the keys are better behaved than pure random, this coherence will be exploited. If you want to give it a try, these branch of my numpy fork has hash table based implementations of unique (with no extra indices) and in1d: https://github.com/jaimefrio/numpy/tree/hash-unique A use cases where the hash table is clearly better: In [1]: import numpy as np In [2]: from numpy.lib._compiled_base import _unique, _in1d In [3]: a = np.random.randint(10, size=(1,)) In [4]: %timeit np.unique(a) 1000 loops, best of 3: 258 us per loop In [5]: %timeit _unique(a) 1 loops, best of 3: 143 us per loop In [6]: %timeit np.sort(_unique(a)) 1 loops, best of 3: 149 us per loop It typically performs between 1.5x and 4x faster than sorting. I haven't profiled it properly to know, but there may be quite a bit of performance to dig out: have type specific comparison functions, optimize the starting hash table size based on the size of the array to avoid reinsertions... If getting the elements sorted is a necessity, and the array contains very few or no repeated items, then the hash table approach may even perform worse,: In [8]: a = np.random.randint(1, size=(5000,)) In [9]: %timeit np.unique(a) 1000 loops, best of 3: 277 us per loop In [10]: %timeit np.sort(_unique(a)) 1000 loops, best of 3: 320 us per loop But the hash table still wins in extracting the unique items only: In [11]: %timeit _unique(a) 1 loops, best of 3: 187 us per loop Where the hash table shines is in more elaborate situations. If you keep the first index where it was found, and the number of repeats, in
Re: [Numpy-discussion] Does a `mergesorted` function make sense?
Re: [Numpy-discussion] Does a `mergesorted` function make sense? On 02/09/2014 8:40 PM, Charles R Harris wrote: On Mon, Sep 1, 2014 at 7:58 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com mailto:hoogendoorn.ee...@gmail.com wrote: On Mon, Sep 1, 2014 at 2:05 PM, Charles R Harris charlesr.har...@gmail.com mailto:charlesr.har...@gmail.com wrote: On Mon, Sep 1, 2014 at 1:49 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com mailto:hoogendoorn.ee...@gmail.com wrote: Sure, id like to do the hashing things out, but I would also like some preliminary feedback as to whether this is going in a direction anyone else sees the point of, if it conflicts with other plans, and indeed if we can agree that numpy is the right place for it; a point which I would very much like to defend. If there is some obvious no-go that im missing, I can do without the drudgery of writing proper documentation ;). These are good issues, that need to be discussed and resolved. Python has the benefit of having a BDFL. Numpy has no similar arrangement. In the post-numarray period, Travis Oliphant took that role and advanced the package in many ways. It seems that Travis is no longer available. There is a need for someone, or maybe some group of three, who would be guided by the sort of thinking Charles Harris sets out above, who would make the final decision. Colin W. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Does a `mergesorted` function make sense?
On Wed, Sep 3, 2014 at 4:07 AM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Tue, Sep 2, 2014 at 5:40 PM, Charles R Harris charlesr.har...@gmail.com wrote: What do you think about the suggestion of timsort? One would need to concatenate the arrays before sorting, but it should be fairly efficient. Timsort is very cool, and it would definitely be fun to implement in numpy. It is also a lot more work that merging two sorted arrays! I think +1 if someone else does it, but although I would love to be part of it, I am not sure I will be able to find time to look into it seriously in the next couple of months. From a setops point of view, merging two sorted arrays makes it very straightforward to compute, together with (or instead of) the result of the merge, index arrays that let you calculate things like `in1d` faster. Although perhaps an `argtimsort` could provide the same functionality, not sure. I will probably wrap up what I have, put a lace on it, and submit it as a PR. Even if it is not destined to be merged, it may serve as a warning to others. I have also been thinking lately that one of the problems with all these unique-related computations may be a case of having a hammer and seeing everything as nails. Perhaps the algorithm that needs to be ported from Python is not the sorting one, but the hash table... Jaime Not sure about the hashing. Indeed one can also build an index of a set by means of a hash table, but its questionable if this leads to improved performance over performing an argsort. Hashing may have better asymptotic time complexity in theory, but many datasets used in practice are very easy to sort (O(N)-ish), and the time-constant of hashing is higher. But more importantly, using a hash-table guarantees poor cache behavior for many operations using this index. By contrast, sorting may (but need not) make one random access pass to build the index, and may (but need not) perform one random access to reorder values for grouping. But insofar as the keys are better behaved than pure random, this coherence will be exploited. Also, getting the unique values/keys in sorted order is a nice side-benefit for many applications. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Does a `mergesorted` function make sense?
On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: Not sure about the hashing. Indeed one can also build an index of a set by means of a hash table, but its questionable if this leads to improved performance over performing an argsort. Hashing may have better asymptotic time complexity in theory, but many datasets used in practice are very easy to sort (O(N)-ish), and the time-constant of hashing is higher. But more importantly, using a hash-table guarantees poor cache behavior for many operations using this index. By contrast, sorting may (but need not) make one random access pass to build the index, and may (but need not) perform one random access to reorder values for grouping. But insofar as the keys are better behaved than pure random, this coherence will be exploited. If you want to give it a try, these branch of my numpy fork has hash table based implementations of unique (with no extra indices) and in1d: https://github.com/jaimefrio/numpy/tree/hash-unique A use cases where the hash table is clearly better: In [1]: import numpy as np In [2]: from numpy.lib._compiled_base import _unique, _in1d In [3]: a = np.random.randint(10, size=(1,)) In [4]: %timeit np.unique(a) 1000 loops, best of 3: 258 us per loop In [5]: %timeit _unique(a) 1 loops, best of 3: 143 us per loop In [6]: %timeit np.sort(_unique(a)) 1 loops, best of 3: 149 us per loop It typically performs between 1.5x and 4x faster than sorting. I haven't profiled it properly to know, but there may be quite a bit of performance to dig out: have type specific comparison functions, optimize the starting hash table size based on the size of the array to avoid reinsertions... If getting the elements sorted is a necessity, and the array contains very few or no repeated items, then the hash table approach may even perform worse,: In [8]: a = np.random.randint(1, size=(5000,)) In [9]: %timeit np.unique(a) 1000 loops, best of 3: 277 us per loop In [10]: %timeit np.sort(_unique(a)) 1000 loops, best of 3: 320 us per loop But the hash table still wins in extracting the unique items only: In [11]: %timeit _unique(a) 1 loops, best of 3: 187 us per loop Where the hash table shines is in more elaborate situations. If you keep the first index where it was found, and the number of repeats, in the hash table, you can get return_index and return_counts almost for free, which means you are performing an extra 3x faster than with sorting. return_inverse requires a little more trickery, so I won;t attempt to quantify the improvement. But I wouldn't be surprised if, after fine tuning it, there is close to an order of magnitude overall improvement The spped-up for in1d is also nice: In [16]: a = np.random.randint(1000, size=(1000,)) In [17]: b = np.random.randint(1000, size=(500,)) In [18]: %timeit np.in1d(a, b) 1000 loops, best of 3: 178 us per loop In [19]: %timeit _in1d(a, b) 1 loops, best of 3: 30.1 us per loop Of course, there is no point in ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Does a `mergesorted` function make sense?
On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: Not sure about the hashing. Indeed one can also build an index of a set by means of a hash table, but its questionable if this leads to improved performance over performing an argsort. Hashing may have better asymptotic time complexity in theory, but many datasets used in practice are very easy to sort (O(N)-ish), and the time-constant of hashing is higher. But more importantly, using a hash-table guarantees poor cache behavior for many operations using this index. By contrast, sorting may (but need not) make one random access pass to build the index, and may (but need not) perform one random access to reorder values for grouping. But insofar as the keys are better behaved than pure random, this coherence will be exploited. If you want to give it a try, these branch of my numpy fork has hash table based implementations of unique (with no extra indices) and in1d: https://github.com/jaimefrio/numpy/tree/hash-unique A use cases where the hash table is clearly better: In [1]: import numpy as np In [2]: from numpy.lib._compiled_base import _unique, _in1d In [3]: a = np.random.randint(10, size=(1,)) In [4]: %timeit np.unique(a) 1000 loops, best of 3: 258 us per loop In [5]: %timeit _unique(a) 1 loops, best of 3: 143 us per loop In [6]: %timeit np.sort(_unique(a)) 1 loops, best of 3: 149 us per loop It typically performs between 1.5x and 4x faster than sorting. I haven't profiled it properly to know, but there may be quite a bit of performance to dig out: have type specific comparison functions, optimize the starting hash table size based on the size of the array to avoid reinsertions... If getting the elements sorted is a necessity, and the array contains very few or no repeated items, then the hash table approach may even perform worse,: In [8]: a = np.random.randint(1, size=(5000,)) In [9]: %timeit np.unique(a) 1000 loops, best of 3: 277 us per loop In [10]: %timeit np.sort(_unique(a)) 1000 loops, best of 3: 320 us per loop But the hash table still wins in extracting the unique items only: In [11]: %timeit _unique(a) 1 loops, best of 3: 187 us per loop Where the hash table shines is in more elaborate situations. If you keep the first index where it was found, and the number of repeats, in the hash table, you can get return_index and return_counts almost for free, which means you are performing an extra 3x faster than with sorting. return_inverse requires a little more trickery, so I won;t attempt to quantify the improvement. But I wouldn't be surprised if, after fine tuning it, there is close to an order of magnitude overall improvement The spped-up for in1d is also nice: In [16]: a = np.random.randint(1000, size=(1000,)) In [17]: b = np.random.randint(1000, size=(500,)) In [18]: %timeit np.in1d(a, b) 1000 loops, best of 3: 178 us per loop In [19]: %timeit _in1d(a, b) 1 loops, best of 3: 30.1 us per loop Of course, there is no point in Ooops!!! Hit the send button too quick. Not to extend myself too long: if we are going to rethink all of this, we should approach it with an open mind. Still, and this post is not helping with that either, I am afraid that we are discussing implementation details, but are missing a broader vision of what we want to accomplish and why. That vision of what numpy's grouping functionality, if any, should be, and how it complements or conflicts with what pandas is providing, should precede anything else. I now I haven't, but has anyone looked at how pandas implements grouping? Their documentation on the subject is well worth a read: http://pandas.pydata.org/pandas-docs/stable/groupby.html Does numpy need to replicate this? What/why/how can we add to that? Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Does a `mergesorted` function make sense?
On Wed, Sep 3, 2014 at 5:26 AM, cjw c...@ncf.ca wrote: These are good issues, that need to be discussed and resolved. Python has the benefit of having a BDFL. Numpy has no similar arrangement. In the post-numarray period, Travis Oliphant took that role and advanced the package in many ways. It seems that Travis is no longer available. There is a need for someone, or maybe some group of three, who would be guided by the sort of thinking Charles Harris sets out above, who would make the final decision. We should crown Charles philosopher king, and let him wisely rule over us with the aid of his aristocracy of devs with merge rights. We would make Plato proud! :-) Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Does a `mergesorted` function make sense?
On Wed, Sep 3, 2014 at 10:53 AM, Jaime Fernández del Río jaime.f...@gmail.com wrote: On Wed, Sep 3, 2014 at 5:26 AM, cjw c...@ncf.ca wrote: These are good issues, that need to be discussed and resolved. Python has the benefit of having a BDFL. Numpy has no similar arrangement. In the post-numarray period, Travis Oliphant took that role and advanced the package in many ways. It seems that Travis is no longer available. There is a need for someone, or maybe some group of three, who would be guided by the sort of thinking Charles Harris sets out above, who would make the final decision. We should crown Charles philosopher king, and let him wisely rule over us with the aid of his aristocracy of devs with merge rights. We would make Plato proud! :-) Charles I needs an heir in case of execution by the Parliamentary forces. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Does a `mergesorted` function make sense?
On Mon, Sep 1, 2014 at 7:58 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: On Mon, Sep 1, 2014 at 2:05 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Mon, Sep 1, 2014 at 1:49 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: Sure, id like to do the hashing things out, but I would also like some preliminary feedback as to whether this is going in a direction anyone else sees the point of, if it conflicts with other plans, and indeed if we can agree that numpy is the right place for it; a point which I would very much like to defend. If there is some obvious no-go that im missing, I can do without the drudgery of writing proper documentation ;). As for whether this belongs in numpy: yes, I would say so. There are the extension of functionality to functions already in numpy, which are a no-brainer (it need not cost anything performance wise, and ive needed unique graph edges many many times), and there is the grouping functionality, which is the main novelty. However, note that the grouping functionality itself is a very small addition, just a few 100 lines of pure python, given that the indexing logic has been factored out of the classic arraysetops. At least from a developers perspective, it very much feels like a logical extension of the same 'thing'. But also from a conceptual numpy perspective, grouping is really more an 'elementary manipulation of an ndarray' than a 'special purpose algorithm'. It is useful for literally all kinds of programming; hence there is similar functionality in the python standard library (itertools.groupby); so why not have an efficient vectorized equivalent in numpy? It belongs there more than the linalg module, arguably. Also, from a community perspective, a significant fraction of all stackoverflow numpy questions are (unknowingly) exactly about 'how to do grouping in numpy'. What I'm trying to say is that numpy is a community project. We don't have a central planning committee, the only difference between developers and everyone else is activity and commit rights. Which is to say if you develop and push this topic it is likely to go in. There certainly seems to be interest in this functionality. The reason that I brought up scipy is that there are some graph algorithms there that went in a couple of years ago. Note that the convention on the list is bottom posting. snip Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I understand that numpy is a community project, so that the decision isn't up to any one particular person; but some early stage feedback from those active in the community would be welcome. I am generally confident that this addition makes sense, but I have not contributed to numpy before, and you don't know what you don't know and all... given that there are multiple suggestions for changing arraysetops, some coordination would be useful I think. Note that I use graph edges merely as an example; the proposed functionality is much more general than graphing algorithms specifically. The radial reduction https://github.com/EelcoHoogendoorn/Numpy_arraysetops_EP/blob/master/examples.pyexample I included on github is particularly illustrative of the general utility of grouping functionality I think. Operations like radial reductions are rather common, and a custom implementation is quite lengthy, very bug prone, and potentially very slow. Thanks for the heads up on posting convention; ive always let gmail do my thinking for me, which works well enough for me, but I can see how not following this convention is annoying to others. What do you think about the suggestion of timsort? One would need to concatenate the arrays before sorting, but it should be fairly efficient. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Does a `mergesorted` function make sense?
On Tue, Sep 2, 2014 at 5:40 PM, Charles R Harris charlesr.har...@gmail.com wrote: What do you think about the suggestion of timsort? One would need to concatenate the arrays before sorting, but it should be fairly efficient. Timsort is very cool, and it would definitely be fun to implement in numpy. It is also a lot more work that merging two sorted arrays! I think +1 if someone else does it, but although I would love to be part of it, I am not sure I will be able to find time to look into it seriously in the next couple of months. From a setops point of view, merging two sorted arrays makes it very straightforward to compute, together with (or instead of) the result of the merge, index arrays that let you calculate things like `in1d` faster. Although perhaps an `argtimsort` could provide the same functionality, not sure. I will probably wrap up what I have, put a lace on it, and submit it as a PR. Even if it is not destined to be merged, it may serve as a warning to others. I have also been thinking lately that one of the problems with all these unique-related computations may be a case of having a hammer and seeing everything as nails. Perhaps the algorithm that needs to be ported from Python is not the sorting one, but the hash table... Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Does a `mergesorted` function make sense?
Sure, id like to do the hashing things out, but I would also like some preliminary feedback as to whether this is going in a direction anyone else sees the point of, if it conflicts with other plans, and indeed if we can agree that numpy is the right place for it; a point which I would very much like to defend. If there is some obvious no-go that im missing, I can do without the drudgery of writing proper documentation ;). As for whether this belongs in numpy: yes, I would say so. There are the extension of functionality to functions already in numpy, which are a no-brainer (it need not cost anything performance wise, and ive needed unique graph edges many many times), and there is the grouping functionality, which is the main novelty. However, note that the grouping functionality itself is a very small addition, just a few 100 lines of pure python, given that the indexing logic has been factored out of the classic arraysetops. At least from a developers perspective, it very much feels like a logical extension of the same 'thing'. But also from a conceptual numpy perspective, grouping is really more an 'elementary manipulation of an ndarray' than a 'special purpose algorithm'. It is useful for literally all kinds of programming; hence there is similar functionality in the python standard library (itertools.groupby); so why not have an efficient vectorized equivalent in numpy? It belongs there more than the linalg module, arguably. Also, from a community perspective, a significant fraction of all stackoverflow numpy questions are (unknowingly) exactly about 'how to do grouping in numpy'. On Mon, Sep 1, 2014 at 4:36 AM, Charles R Harris charlesr.har...@gmail.com wrote: On Sun, Aug 31, 2014 at 1:48 PM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: Ive organized all code I had relating to this subject in a github repository https://github.com/EelcoHoogendoorn/Numpy_arraysetops_EP. That should facilitate shooting around ideas. Ive also added more documentation and structure to make it easier to see what is going on. Hopefully we can converge on a common vision, and then improve the documentation and testing to make it worthy of including in the numpy master. Note that there is also a complete rewrite of the classic numpy.arraysetops, such that they are also generalized to more complex input, such as finding unique graph edges, and so on. You mentioned getting the numpy core developers involved; are they not subscribed to this mailing list? I wouldn't be surprised; youd hope there is a channel of discussion concerning development with higher signal to noise There are only about 2.5 of us at the moment. Those for whom this is an itch that need scratching should hash things out and make a PR. The main question for me is if it belongs in numpy, scipy, or somewhere else. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Does a `mergesorted` function make sense?
On Mon, Sep 1, 2014 at 1:49 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: Sure, id like to do the hashing things out, but I would also like some preliminary feedback as to whether this is going in a direction anyone else sees the point of, if it conflicts with other plans, and indeed if we can agree that numpy is the right place for it; a point which I would very much like to defend. If there is some obvious no-go that im missing, I can do without the drudgery of writing proper documentation ;). As for whether this belongs in numpy: yes, I would say so. There are the extension of functionality to functions already in numpy, which are a no-brainer (it need not cost anything performance wise, and ive needed unique graph edges many many times), and there is the grouping functionality, which is the main novelty. However, note that the grouping functionality itself is a very small addition, just a few 100 lines of pure python, given that the indexing logic has been factored out of the classic arraysetops. At least from a developers perspective, it very much feels like a logical extension of the same 'thing'. But also from a conceptual numpy perspective, grouping is really more an 'elementary manipulation of an ndarray' than a 'special purpose algorithm'. It is useful for literally all kinds of programming; hence there is similar functionality in the python standard library (itertools.groupby); so why not have an efficient vectorized equivalent in numpy? It belongs there more than the linalg module, arguably. Also, from a community perspective, a significant fraction of all stackoverflow numpy questions are (unknowingly) exactly about 'how to do grouping in numpy'. What I'm trying to say is that numpy is a community project. We don't have a central planning committee, the only difference between developers and everyone else is activity and commit rights. Which is to say if you develop and push this topic it is likely to go in. There certainly seems to be interest in this functionality. The reason that I brought up scipy is that there are some graph algorithms there that went in a couple of years ago. Note that the convention on the list is bottom posting. snip Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Does a `mergesorted` function make sense?
On Mon, Sep 1, 2014 at 8:49 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: Sure, id like to do the hashing things out, but I would also like some preliminary feedback as to whether this is going in a direction anyone else sees the point of, if it conflicts with other plans, and indeed if we can agree that numpy is the right place for it; a point which I would very much like to defend. If there is some obvious no-go that im missing, I can do without the drudgery of writing proper documentation ;). As for whether this belongs in numpy: yes, I would say so. There are the extension of functionality to functions already in numpy, which are a no-brainer (it need not cost anything performance wise, and ive needed unique graph edges many many times), and there is the grouping functionality, which is the main novelty. However, note that the grouping functionality itself is a very small addition, just a few 100 lines of pure python, given that the indexing logic has been factored out of the classic arraysetops. At least from a developers perspective, it very much feels like a logical extension of the same 'thing'. My 2 cents: I definitely agree that this is very useful fundamental functionality, and it would be great if numpy had a solution for it out of the box. My main concern is that this is a fairly complicated set of functionality and there are a lot of small decisions to be made in setting up the API for it. IME it's very hard to just read through an API like this and reason out the best way to do it by pure logic; usually it needs to get banged on for a bit in real uses before it becomes clear what the right set of trade-offs is. And numpy itself is not a great environment these kinds of iterations. So, IMO the main challenge is: how do we get the functionality into a state where we can convince ourselves that it'll be supportable in numpy indefinitely, and not need to be replaced in a year or two? Some things that might help with this convincing: - releasing it as a small standalone package on pypi and getting some real users to bang on it - any real code written against the APIs - feedback from the pandas community since they've spent a lot of time working on these issues - ...? -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Does a `mergesorted` function make sense?
On Mon, Sep 1, 2014 at 2:05 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Mon, Sep 1, 2014 at 1:49 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: Sure, id like to do the hashing things out, but I would also like some preliminary feedback as to whether this is going in a direction anyone else sees the point of, if it conflicts with other plans, and indeed if we can agree that numpy is the right place for it; a point which I would very much like to defend. If there is some obvious no-go that im missing, I can do without the drudgery of writing proper documentation ;). As for whether this belongs in numpy: yes, I would say so. There are the extension of functionality to functions already in numpy, which are a no-brainer (it need not cost anything performance wise, and ive needed unique graph edges many many times), and there is the grouping functionality, which is the main novelty. However, note that the grouping functionality itself is a very small addition, just a few 100 lines of pure python, given that the indexing logic has been factored out of the classic arraysetops. At least from a developers perspective, it very much feels like a logical extension of the same 'thing'. But also from a conceptual numpy perspective, grouping is really more an 'elementary manipulation of an ndarray' than a 'special purpose algorithm'. It is useful for literally all kinds of programming; hence there is similar functionality in the python standard library (itertools.groupby); so why not have an efficient vectorized equivalent in numpy? It belongs there more than the linalg module, arguably. Also, from a community perspective, a significant fraction of all stackoverflow numpy questions are (unknowingly) exactly about 'how to do grouping in numpy'. What I'm trying to say is that numpy is a community project. We don't have a central planning committee, the only difference between developers and everyone else is activity and commit rights. Which is to say if you develop and push this topic it is likely to go in. There certainly seems to be interest in this functionality. The reason that I brought up scipy is that there are some graph algorithms there that went in a couple of years ago. Note that the convention on the list is bottom posting. snip Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I understand that numpy is a community project, so that the decision isn't up to any one particular person; but some early stage feedback from those active in the community would be welcome. I am generally confident that this addition makes sense, but I have not contributed to numpy before, and you don't know what you don't know and all... given that there are multiple suggestions for changing arraysetops, some coordination would be useful I think. Note that I use graph edges merely as an example; the proposed functionality is much more general than graphing algorithms specifically. The radial reduction https://github.com/EelcoHoogendoorn/Numpy_arraysetops_EP/blob/master/examples.pyexample I included on github is particularly illustrative of the general utility of grouping functionality I think. Operations like radial reductions are rather common, and a custom implementation is quite lengthy, very bug prone, and potentially very slow. Thanks for the heads up on posting convention; ive always let gmail do my thinking for me, which works well enough for me, but I can see how not following this convention is annoying to others. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Does a `mergesorted` function make sense?
Ive organized all code I had relating to this subject in a github repository https://github.com/EelcoHoogendoorn/Numpy_arraysetops_EP. That should facilitate shooting around ideas. Ive also added more documentation and structure to make it easier to see what is going on. Hopefully we can converge on a common vision, and then improve the documentation and testing to make it worthy of including in the numpy master. Note that there is also a complete rewrite of the classic numpy.arraysetops, such that they are also generalized to more complex input, such as finding unique graph edges, and so on. You mentioned getting the numpy core developers involved; are they not subscribed to this mailing list? I wouldn't be surprised; youd hope there is a channel of discussion concerning development with higher signal to noise On Thu, Aug 28, 2014 at 1:49 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: I just checked the docs on ufuncs, and it appears that's a solved problem now, since ufunc.reduceat now comes with an axis argument. Or maybe it already did when I wrote that, but I simply wasn't paying attention. Either way, the code is fully vectorized now, in both grouped and non-grouped axes. Its a lot of code, but all that happens for a grouping other than some O(1) and O(n) stuff is an argsort of the keys, and then the reduction itself, all fully vectorized. Note that I sort the values first, and then use ufunc.reduceat on the groups. It would seem to me that ufunc.at should be more efficient, by avoiding this indirection, but testing very much revealed the opposite, for reasons unclear to me. Perhaps that's changed now as well. On Wed, Aug 27, 2014 at 11:32 PM, Jaime Fernández del Río jaime.f...@gmail.com wrote: Yes, I was aware of that. But the point would be to provide true vectorization on those operations. The way I see it, numpy may not have to have a GroupBy implementation, but it should at least enable implementing one that is fast and efficient over any axis. On Wed, Aug 27, 2014 at 12:38 PM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: i.e, if the grouped axis is small but the other axes are not, you could write this, which avoids the python loop over the long axis that np.vectorize would otherwise perform. import numpy as np from grouping import group_by keys = np.random.randint(0,4,10) values = np.random.rand(10,2000) for k,g in zip(*group_by(keys)(values)): print k, g.mean(0) On Wed, Aug 27, 2014 at 9:29 PM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: f.i., this works as expected as well (100 keys of 1d int arrays and 100 values of 1d float arrays): group_by(randint(0,4,(100,2))).mean(rand(100,2)) On Wed, Aug 27, 2014 at 9:27 PM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: If I understand you correctly, the current implementation supports these operations. All reductions over groups (except for median) are performed through the corresponding ufunc (see GroupBy.reduce). This works on multidimensional arrays as well, although this broadcasting over the non-grouping axes is accomplished using np.vectorize. Actual vectorization only happens over the axis being grouped over, but this is usually a long axis. If it isn't, it is more efficient to perform a reduction by means of splitting the array by its groups first, and then map the iterable of groups over some reduction operation (as noted in the docstring of GroupBy.reduce). On Wed, Aug 27, 2014 at 8:29 PM, Jaime Fernández del Río jaime.f...@gmail.com wrote: Hi Eelco, I took a deeper look into your code a couple of weeks back. I don't think I have fully grasped what it allows completely, but I agree that some form of what you have there is highly desirable. Along the same lines, for sometime I have been thinking that the right place for a `groupby` in numpy is as a method of ufuncs, so that `np.add.groupby(arr, groups)` would do a multidimensional version of `np.bincount(groups, weights=arr)`. You would then need a more powerful version of `np.unique` to produce the `groups`, but that is something that Joe Kington's old PR was very close to achieving, that should probably be resurrected as well. But yes, there seems to be material for a NEP here, and some guidance from one of the numpy devs would be helpful in getting this somewhere. Jaime On Wed, Aug 27, 2014 at 10:35 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: It wouldn't hurt to have this function, but my intuition is that its use will be minimal. If you are already working with sorted arrays, you already have a flop cost on that order of magnitude, and the optimized merge saves you a factor two at the very most. Using numpy means you are sacrificing factors of two and beyond relative to pure C left right and center anyway, so if this kind of thing matters to you, you probably wont be working in numpy in the first place. That said,
[Numpy-discussion] Does a `mergesorted` function make sense?
A request was open in github to add a `merge` function to numpy that would merge two sorted 1d arrays into a single sorted 1d array. I have been playing around with that idea for a while, and have a branch in my numpy fork that adds a `mergesorted` function to `numpy.lib`: https://github.com/jaimefrio/numpy/commit/ce5d480afecc989a36e5d2bf4ea1d1ba58a83b0a I drew inspiration from C++ STL algorithms, and merged into a single function what merge, set_union, set_intersection, set_difference and set_symmetric_difference do there. My first thought when implementing this was to not make it a public function, but use it under the hood to speed-up some of the functions of `arraysetops.py`, which are now merging two already sorted functions by doing `np.sort(np.concatenate((a, b)))`. I would need to revisit my testing, but the speed-ups weren't that great. One other thing I saw value in for some of the `arraysetops.py` functions, but couldn't fully figure out, was in providing extra output aside from the merged arrays, either in the form of indices, or of boolean masks, indicating which items of the original arrays made it into the merged one, and/or where did they end up in it. Since there is at least one other person out there that likes it, is there any more interest in such a function? If yes, any comments on what the proper interface for extra output should be? Although perhaps the best is to leave that out for starters and see what use people make of it, if any. Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Does a `mergesorted` function make sense?
It wouldn't hurt to have this function, but my intuition is that its use will be minimal. If you are already working with sorted arrays, you already have a flop cost on that order of magnitude, and the optimized merge saves you a factor two at the very most. Using numpy means you are sacrificing factors of two and beyond relative to pure C left right and center anyway, so if this kind of thing matters to you, you probably wont be working in numpy in the first place. That said, I share your interest in overhauling arraysetops. I see many opportunities for expanding its functionality. There is a question that amounts to 'how do I do group-by in numpy' on stackoverflow almost every week. That would have my top priority, but also things like extending np.unique to things like graph edges, or other more complex input, is very often useful to me. Ive written up a draft http://pastebin.com/c5WLWPbpa while ago which accomplishes all of the above and more. It reimplements functions like np.unique around a common Index object. This index object encapsulates the precomputation (sorting) required for efficient set-ops on different datatypes, and provides a common interface to obtain the kind of information you are talking about (which is used extensively internally in the implementation of group_by, for instance). ie, this functionality allows you to write neat things like group_by(randint(0,9,(100,2))).median(rand(100)) But I have the feeling much more could be done in this direction, and I feel this draft could really use a bit of back and forth. If we are going to completely rewrite arraysetops, we might as well do it right. On Wed, Aug 27, 2014 at 7:02 PM, Jaime Fernández del Río jaime.f...@gmail.com wrote: A request was open in github to add a `merge` function to numpy that would merge two sorted 1d arrays into a single sorted 1d array. I have been playing around with that idea for a while, and have a branch in my numpy fork that adds a `mergesorted` function to `numpy.lib`: https://github.com/jaimefrio/numpy/commit/ce5d480afecc989a36e5d2bf4ea1d1ba58a83b0a I drew inspiration from C++ STL algorithms, and merged into a single function what merge, set_union, set_intersection, set_difference and set_symmetric_difference do there. My first thought when implementing this was to not make it a public function, but use it under the hood to speed-up some of the functions of `arraysetops.py`, which are now merging two already sorted functions by doing `np.sort(np.concatenate((a, b)))`. I would need to revisit my testing, but the speed-ups weren't that great. One other thing I saw value in for some of the `arraysetops.py` functions, but couldn't fully figure out, was in providing extra output aside from the merged arrays, either in the form of indices, or of boolean masks, indicating which items of the original arrays made it into the merged one, and/or where did they end up in it. Since there is at least one other person out there that likes it, is there any more interest in such a function? If yes, any comments on what the proper interface for extra output should be? Although perhaps the best is to leave that out for starters and see what use people make of it, if any. Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Does a `mergesorted` function make sense?
Hi Eelco, I took a deeper look into your code a couple of weeks back. I don't think I have fully grasped what it allows completely, but I agree that some form of what you have there is highly desirable. Along the same lines, for sometime I have been thinking that the right place for a `groupby` in numpy is as a method of ufuncs, so that `np.add.groupby(arr, groups)` would do a multidimensional version of `np.bincount(groups, weights=arr)`. You would then need a more powerful version of `np.unique` to produce the `groups`, but that is something that Joe Kington's old PR was very close to achieving, that should probably be resurrected as well. But yes, there seems to be material for a NEP here, and some guidance from one of the numpy devs would be helpful in getting this somewhere. Jaime On Wed, Aug 27, 2014 at 10:35 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: It wouldn't hurt to have this function, but my intuition is that its use will be minimal. If you are already working with sorted arrays, you already have a flop cost on that order of magnitude, and the optimized merge saves you a factor two at the very most. Using numpy means you are sacrificing factors of two and beyond relative to pure C left right and center anyway, so if this kind of thing matters to you, you probably wont be working in numpy in the first place. That said, I share your interest in overhauling arraysetops. I see many opportunities for expanding its functionality. There is a question that amounts to 'how do I do group-by in numpy' on stackoverflow almost every week. That would have my top priority, but also things like extending np.unique to things like graph edges, or other more complex input, is very often useful to me. Ive written up a draft http://pastebin.com/c5WLWPbpa while ago which accomplishes all of the above and more. It reimplements functions like np.unique around a common Index object. This index object encapsulates the precomputation (sorting) required for efficient set-ops on different datatypes, and provides a common interface to obtain the kind of information you are talking about (which is used extensively internally in the implementation of group_by, for instance). ie, this functionality allows you to write neat things like group_by(randint(0,9,(100,2))).median(rand(100)) But I have the feeling much more could be done in this direction, and I feel this draft could really use a bit of back and forth. If we are going to completely rewrite arraysetops, we might as well do it right. On Wed, Aug 27, 2014 at 7:02 PM, Jaime Fernández del Río jaime.f...@gmail.com wrote: A request was open in github to add a `merge` function to numpy that would merge two sorted 1d arrays into a single sorted 1d array. I have been playing around with that idea for a while, and have a branch in my numpy fork that adds a `mergesorted` function to `numpy.lib`: https://github.com/jaimefrio/numpy/commit/ce5d480afecc989a36e5d2bf4ea1d1ba58a83b0a I drew inspiration from C++ STL algorithms, and merged into a single function what merge, set_union, set_intersection, set_difference and set_symmetric_difference do there. My first thought when implementing this was to not make it a public function, but use it under the hood to speed-up some of the functions of `arraysetops.py`, which are now merging two already sorted functions by doing `np.sort(np.concatenate((a, b)))`. I would need to revisit my testing, but the speed-ups weren't that great. One other thing I saw value in for some of the `arraysetops.py` functions, but couldn't fully figure out, was in providing extra output aside from the merged arrays, either in the form of indices, or of boolean masks, indicating which items of the original arrays made it into the merged one, and/or where did they end up in it. Since there is at least one other person out there that likes it, is there any more interest in such a function? If yes, any comments on what the proper interface for extra output should be? Although perhaps the best is to leave that out for starters and see what use people make of it, if any. Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Does a `mergesorted` function make sense?
If I understand you correctly, the current implementation supports these operations. All reductions over groups (except for median) are performed through the corresponding ufunc (see GroupBy.reduce). This works on multidimensional arrays as well, although this broadcasting over the non-grouping axes is accomplished using np.vectorize. Actual vectorization only happens over the axis being grouped over, but this is usually a long axis. If it isn't, it is more efficient to perform a reduction by means of splitting the array by its groups first, and then map the iterable of groups over some reduction operation (as noted in the docstring of GroupBy.reduce). On Wed, Aug 27, 2014 at 8:29 PM, Jaime Fernández del Río jaime.f...@gmail.com wrote: Hi Eelco, I took a deeper look into your code a couple of weeks back. I don't think I have fully grasped what it allows completely, but I agree that some form of what you have there is highly desirable. Along the same lines, for sometime I have been thinking that the right place for a `groupby` in numpy is as a method of ufuncs, so that `np.add.groupby(arr, groups)` would do a multidimensional version of `np.bincount(groups, weights=arr)`. You would then need a more powerful version of `np.unique` to produce the `groups`, but that is something that Joe Kington's old PR was very close to achieving, that should probably be resurrected as well. But yes, there seems to be material for a NEP here, and some guidance from one of the numpy devs would be helpful in getting this somewhere. Jaime On Wed, Aug 27, 2014 at 10:35 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: It wouldn't hurt to have this function, but my intuition is that its use will be minimal. If you are already working with sorted arrays, you already have a flop cost on that order of magnitude, and the optimized merge saves you a factor two at the very most. Using numpy means you are sacrificing factors of two and beyond relative to pure C left right and center anyway, so if this kind of thing matters to you, you probably wont be working in numpy in the first place. That said, I share your interest in overhauling arraysetops. I see many opportunities for expanding its functionality. There is a question that amounts to 'how do I do group-by in numpy' on stackoverflow almost every week. That would have my top priority, but also things like extending np.unique to things like graph edges, or other more complex input, is very often useful to me. Ive written up a draft http://pastebin.com/c5WLWPbpa while ago which accomplishes all of the above and more. It reimplements functions like np.unique around a common Index object. This index object encapsulates the precomputation (sorting) required for efficient set-ops on different datatypes, and provides a common interface to obtain the kind of information you are talking about (which is used extensively internally in the implementation of group_by, for instance). ie, this functionality allows you to write neat things like group_by(randint(0,9,(100,2))).median(rand(100)) But I have the feeling much more could be done in this direction, and I feel this draft could really use a bit of back and forth. If we are going to completely rewrite arraysetops, we might as well do it right. On Wed, Aug 27, 2014 at 7:02 PM, Jaime Fernández del Río jaime.f...@gmail.com wrote: A request was open in github to add a `merge` function to numpy that would merge two sorted 1d arrays into a single sorted 1d array. I have been playing around with that idea for a while, and have a branch in my numpy fork that adds a `mergesorted` function to `numpy.lib`: https://github.com/jaimefrio/numpy/commit/ce5d480afecc989a36e5d2bf4ea1d1ba58a83b0a I drew inspiration from C++ STL algorithms, and merged into a single function what merge, set_union, set_intersection, set_difference and set_symmetric_difference do there. My first thought when implementing this was to not make it a public function, but use it under the hood to speed-up some of the functions of `arraysetops.py`, which are now merging two already sorted functions by doing `np.sort(np.concatenate((a, b)))`. I would need to revisit my testing, but the speed-ups weren't that great. One other thing I saw value in for some of the `arraysetops.py` functions, but couldn't fully figure out, was in providing extra output aside from the merged arrays, either in the form of indices, or of boolean masks, indicating which items of the original arrays made it into the merged one, and/or where did they end up in it. Since there is at least one other person out there that likes it, is there any more interest in such a function? If yes, any comments on what the proper interface for extra output should be? Although perhaps the best is to leave that out for starters and see what use people make of it, if any. Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en
Re: [Numpy-discussion] Does a `mergesorted` function make sense?
i.e, if the grouped axis is small but the other axes are not, you could write this, which avoids the python loop over the long axis that np.vectorize would otherwise perform. import numpy as np from grouping import group_by keys = np.random.randint(0,4,10) values = np.random.rand(10,2000) for k,g in zip(*group_by(keys)(values)): print k, g.mean(0) On Wed, Aug 27, 2014 at 9:29 PM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: f.i., this works as expected as well (100 keys of 1d int arrays and 100 values of 1d float arrays): group_by(randint(0,4,(100,2))).mean(rand(100,2)) On Wed, Aug 27, 2014 at 9:27 PM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: If I understand you correctly, the current implementation supports these operations. All reductions over groups (except for median) are performed through the corresponding ufunc (see GroupBy.reduce). This works on multidimensional arrays as well, although this broadcasting over the non-grouping axes is accomplished using np.vectorize. Actual vectorization only happens over the axis being grouped over, but this is usually a long axis. If it isn't, it is more efficient to perform a reduction by means of splitting the array by its groups first, and then map the iterable of groups over some reduction operation (as noted in the docstring of GroupBy.reduce). On Wed, Aug 27, 2014 at 8:29 PM, Jaime Fernández del Río jaime.f...@gmail.com wrote: Hi Eelco, I took a deeper look into your code a couple of weeks back. I don't think I have fully grasped what it allows completely, but I agree that some form of what you have there is highly desirable. Along the same lines, for sometime I have been thinking that the right place for a `groupby` in numpy is as a method of ufuncs, so that `np.add.groupby(arr, groups)` would do a multidimensional version of `np.bincount(groups, weights=arr)`. You would then need a more powerful version of `np.unique` to produce the `groups`, but that is something that Joe Kington's old PR was very close to achieving, that should probably be resurrected as well. But yes, there seems to be material for a NEP here, and some guidance from one of the numpy devs would be helpful in getting this somewhere. Jaime On Wed, Aug 27, 2014 at 10:35 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: It wouldn't hurt to have this function, but my intuition is that its use will be minimal. If you are already working with sorted arrays, you already have a flop cost on that order of magnitude, and the optimized merge saves you a factor two at the very most. Using numpy means you are sacrificing factors of two and beyond relative to pure C left right and center anyway, so if this kind of thing matters to you, you probably wont be working in numpy in the first place. That said, I share your interest in overhauling arraysetops. I see many opportunities for expanding its functionality. There is a question that amounts to 'how do I do group-by in numpy' on stackoverflow almost every week. That would have my top priority, but also things like extending np.unique to things like graph edges, or other more complex input, is very often useful to me. Ive written up a draft http://pastebin.com/c5WLWPbpa while ago which accomplishes all of the above and more. It reimplements functions like np.unique around a common Index object. This index object encapsulates the precomputation (sorting) required for efficient set-ops on different datatypes, and provides a common interface to obtain the kind of information you are talking about (which is used extensively internally in the implementation of group_by, for instance). ie, this functionality allows you to write neat things like group_by(randint(0,9,(100,2))).median(rand(100)) But I have the feeling much more could be done in this direction, and I feel this draft could really use a bit of back and forth. If we are going to completely rewrite arraysetops, we might as well do it right. On Wed, Aug 27, 2014 at 7:02 PM, Jaime Fernández del Río jaime.f...@gmail.com wrote: A request was open in github to add a `merge` function to numpy that would merge two sorted 1d arrays into a single sorted 1d array. I have been playing around with that idea for a while, and have a branch in my numpy fork that adds a `mergesorted` function to `numpy.lib`: https://github.com/jaimefrio/numpy/commit/ce5d480afecc989a36e5d2bf4ea1d1ba58a83b0a I drew inspiration from C++ STL algorithms, and merged into a single function what merge, set_union, set_intersection, set_difference and set_symmetric_difference do there. My first thought when implementing this was to not make it a public function, but use it under the hood to speed-up some of the functions of `arraysetops.py`, which are now merging two already sorted functions by doing `np.sort(np.concatenate((a, b)))`. I would need to revisit my testing, but the speed-ups weren't that
Re: [Numpy-discussion] Does a `mergesorted` function make sense?
Yes, I was aware of that. But the point would be to provide true vectorization on those operations. The way I see it, numpy may not have to have a GroupBy implementation, but it should at least enable implementing one that is fast and efficient over any axis. On Wed, Aug 27, 2014 at 12:38 PM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: i.e, if the grouped axis is small but the other axes are not, you could write this, which avoids the python loop over the long axis that np.vectorize would otherwise perform. import numpy as np from grouping import group_by keys = np.random.randint(0,4,10) values = np.random.rand(10,2000) for k,g in zip(*group_by(keys)(values)): print k, g.mean(0) On Wed, Aug 27, 2014 at 9:29 PM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: f.i., this works as expected as well (100 keys of 1d int arrays and 100 values of 1d float arrays): group_by(randint(0,4,(100,2))).mean(rand(100,2)) On Wed, Aug 27, 2014 at 9:27 PM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: If I understand you correctly, the current implementation supports these operations. All reductions over groups (except for median) are performed through the corresponding ufunc (see GroupBy.reduce). This works on multidimensional arrays as well, although this broadcasting over the non-grouping axes is accomplished using np.vectorize. Actual vectorization only happens over the axis being grouped over, but this is usually a long axis. If it isn't, it is more efficient to perform a reduction by means of splitting the array by its groups first, and then map the iterable of groups over some reduction operation (as noted in the docstring of GroupBy.reduce). On Wed, Aug 27, 2014 at 8:29 PM, Jaime Fernández del Río jaime.f...@gmail.com wrote: Hi Eelco, I took a deeper look into your code a couple of weeks back. I don't think I have fully grasped what it allows completely, but I agree that some form of what you have there is highly desirable. Along the same lines, for sometime I have been thinking that the right place for a `groupby` in numpy is as a method of ufuncs, so that `np.add.groupby(arr, groups)` would do a multidimensional version of `np.bincount(groups, weights=arr)`. You would then need a more powerful version of `np.unique` to produce the `groups`, but that is something that Joe Kington's old PR was very close to achieving, that should probably be resurrected as well. But yes, there seems to be material for a NEP here, and some guidance from one of the numpy devs would be helpful in getting this somewhere. Jaime On Wed, Aug 27, 2014 at 10:35 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: It wouldn't hurt to have this function, but my intuition is that its use will be minimal. If you are already working with sorted arrays, you already have a flop cost on that order of magnitude, and the optimized merge saves you a factor two at the very most. Using numpy means you are sacrificing factors of two and beyond relative to pure C left right and center anyway, so if this kind of thing matters to you, you probably wont be working in numpy in the first place. That said, I share your interest in overhauling arraysetops. I see many opportunities for expanding its functionality. There is a question that amounts to 'how do I do group-by in numpy' on stackoverflow almost every week. That would have my top priority, but also things like extending np.unique to things like graph edges, or other more complex input, is very often useful to me. Ive written up a draft http://pastebin.com/c5WLWPbpa while ago which accomplishes all of the above and more. It reimplements functions like np.unique around a common Index object. This index object encapsulates the precomputation (sorting) required for efficient set-ops on different datatypes, and provides a common interface to obtain the kind of information you are talking about (which is used extensively internally in the implementation of group_by, for instance). ie, this functionality allows you to write neat things like group_by(randint(0,9,(100,2))).median(rand(100)) But I have the feeling much more could be done in this direction, and I feel this draft could really use a bit of back and forth. If we are going to completely rewrite arraysetops, we might as well do it right. On Wed, Aug 27, 2014 at 7:02 PM, Jaime Fernández del Río jaime.f...@gmail.com wrote: A request was open in github to add a `merge` function to numpy that would merge two sorted 1d arrays into a single sorted 1d array. I have been playing around with that idea for a while, and have a branch in my numpy fork that adds a `mergesorted` function to `numpy.lib`: https://github.com/jaimefrio/numpy/commit/ce5d480afecc989a36e5d2bf4ea1d1ba58a83b0a I drew inspiration from C++ STL algorithms, and merged into a single function what merge, set_union, set_intersection, set_difference and
Re: [Numpy-discussion] Does a `mergesorted` function make sense?
On 27 August 2014 19:02, Jaime Fernández del Río jaime.f...@gmail.com wrote: Since there is at least one other person out there that likes it, is there any more interest in such a function? If yes, any comments on what the proper interface for extra output should be? Although perhaps the best is to leave that out for starters and see what use people make of it, if any. I think a perhaps more useful thing would be to implement timsort. I understand it is capable of take full advantage of the partially sorted arrays, with the extra safety of not making the assumption that the individual arrays are sorted. This will also be useful for other real world cases where the data is already partially sorted. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Does a `mergesorted` function make sense?
I just checked the docs on ufuncs, and it appears that's a solved problem now, since ufunc.reduceat now comes with an axis argument. Or maybe it already did when I wrote that, but I simply wasn't paying attention. Either way, the code is fully vectorized now, in both grouped and non-grouped axes. Its a lot of code, but all that happens for a grouping other than some O(1) and O(n) stuff is an argsort of the keys, and then the reduction itself, all fully vectorized. Note that I sort the values first, and then use ufunc.reduceat on the groups. It would seem to me that ufunc.at should be more efficient, by avoiding this indirection, but testing very much revealed the opposite, for reasons unclear to me. Perhaps that's changed now as well. On Wed, Aug 27, 2014 at 11:32 PM, Jaime Fernández del Río jaime.f...@gmail.com wrote: Yes, I was aware of that. But the point would be to provide true vectorization on those operations. The way I see it, numpy may not have to have a GroupBy implementation, but it should at least enable implementing one that is fast and efficient over any axis. On Wed, Aug 27, 2014 at 12:38 PM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: i.e, if the grouped axis is small but the other axes are not, you could write this, which avoids the python loop over the long axis that np.vectorize would otherwise perform. import numpy as np from grouping import group_by keys = np.random.randint(0,4,10) values = np.random.rand(10,2000) for k,g in zip(*group_by(keys)(values)): print k, g.mean(0) On Wed, Aug 27, 2014 at 9:29 PM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: f.i., this works as expected as well (100 keys of 1d int arrays and 100 values of 1d float arrays): group_by(randint(0,4,(100,2))).mean(rand(100,2)) On Wed, Aug 27, 2014 at 9:27 PM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: If I understand you correctly, the current implementation supports these operations. All reductions over groups (except for median) are performed through the corresponding ufunc (see GroupBy.reduce). This works on multidimensional arrays as well, although this broadcasting over the non-grouping axes is accomplished using np.vectorize. Actual vectorization only happens over the axis being grouped over, but this is usually a long axis. If it isn't, it is more efficient to perform a reduction by means of splitting the array by its groups first, and then map the iterable of groups over some reduction operation (as noted in the docstring of GroupBy.reduce). On Wed, Aug 27, 2014 at 8:29 PM, Jaime Fernández del Río jaime.f...@gmail.com wrote: Hi Eelco, I took a deeper look into your code a couple of weeks back. I don't think I have fully grasped what it allows completely, but I agree that some form of what you have there is highly desirable. Along the same lines, for sometime I have been thinking that the right place for a `groupby` in numpy is as a method of ufuncs, so that `np.add.groupby(arr, groups)` would do a multidimensional version of `np.bincount(groups, weights=arr)`. You would then need a more powerful version of `np.unique` to produce the `groups`, but that is something that Joe Kington's old PR was very close to achieving, that should probably be resurrected as well. But yes, there seems to be material for a NEP here, and some guidance from one of the numpy devs would be helpful in getting this somewhere. Jaime On Wed, Aug 27, 2014 at 10:35 AM, Eelco Hoogendoorn hoogendoorn.ee...@gmail.com wrote: It wouldn't hurt to have this function, but my intuition is that its use will be minimal. If you are already working with sorted arrays, you already have a flop cost on that order of magnitude, and the optimized merge saves you a factor two at the very most. Using numpy means you are sacrificing factors of two and beyond relative to pure C left right and center anyway, so if this kind of thing matters to you, you probably wont be working in numpy in the first place. That said, I share your interest in overhauling arraysetops. I see many opportunities for expanding its functionality. There is a question that amounts to 'how do I do group-by in numpy' on stackoverflow almost every week. That would have my top priority, but also things like extending np.unique to things like graph edges, or other more complex input, is very often useful to me. Ive written up a draft http://pastebin.com/c5WLWPbpa while ago which accomplishes all of the above and more. It reimplements functions like np.unique around a common Index object. This index object encapsulates the precomputation (sorting) required for efficient set-ops on different datatypes, and provides a common interface to obtain the kind of information you are talking about (which is used extensively internally in the implementation of group_by, for instance). ie, this functionality allows you to write neat things like