Re: [Numpy-discussion] Does a `mergesorted` function make sense?

2014-09-04 Thread Eelco Hoogendoorn
On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fernández del Río 
jaime.f...@gmail.com wrote:

 On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fernández del Río 
 jaime.f...@gmail.com wrote:

 On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:

  Not sure about the hashing. Indeed one can also build an index of a set
 by means of a hash table, but its questionable if this leads to improved
 performance over performing an argsort. Hashing may have better asymptotic
 time complexity in theory, but many datasets used in practice are very easy
 to sort (O(N)-ish), and the time-constant of hashing is higher. But more
 importantly, using a hash-table guarantees poor cache behavior for many
 operations using this index. By contrast, sorting may (but need not) make
 one random access pass to build the index, and may (but need not) perform
 one random access to reorder values for grouping. But insofar as the keys
 are better behaved than pure random, this coherence will be exploited.


 If you want to give it a try, these branch of my numpy fork has hash
 table based implementations of unique (with no extra indices) and in1d:

  https://github.com/jaimefrio/numpy/tree/hash-unique

 A use cases where the hash table is clearly better:

 In [1]: import numpy as np
 In [2]: from numpy.lib._compiled_base import _unique, _in1d

 In [3]: a = np.random.randint(10, size=(1,))
 In [4]: %timeit np.unique(a)
 1000 loops, best of 3: 258 us per loop
 In [5]: %timeit _unique(a)
 1 loops, best of 3: 143 us per loop
 In [6]: %timeit np.sort(_unique(a))
 1 loops, best of 3: 149 us per loop

 It typically performs between 1.5x and 4x faster than sorting. I haven't
 profiled it properly to know, but there may be quite a bit of performance
 to dig out: have type specific comparison functions, optimize the starting
 hash table size based on the size of the array to avoid reinsertions...

 If getting the elements sorted is a necessity, and the array contains
 very few or no repeated items, then the hash table approach may even
 perform worse,:

 In [8]: a = np.random.randint(1, size=(5000,))
 In [9]: %timeit np.unique(a)
 1000 loops, best of 3: 277 us per loop
 In [10]: %timeit np.sort(_unique(a))
 1000 loops, best of 3: 320 us per loop

 But the hash table still wins in extracting the unique items only:

 In [11]: %timeit _unique(a)
 1 loops, best of 3: 187 us per loop

 Where the hash table shines is in more elaborate situations. If you keep
 the first index where it was found, and the number of repeats, in the hash
 table, you can get return_index and return_counts almost for free, which
 means you are performing an extra 3x faster than with sorting.
 return_inverse requires a little more trickery, so I won;t attempt to
 quantify the improvement. But I wouldn't be surprised if, after fine tuning
 it, there is close to an order of magnitude overall improvement

 The spped-up for in1d is also nice:

 In [16]: a = np.random.randint(1000, size=(1000,))
 In [17]: b = np.random.randint(1000, size=(500,))
 In [18]: %timeit np.in1d(a, b)
 1000 loops, best of 3: 178 us per loop
 In [19]: %timeit _in1d(a, b)
 1 loops, best of 3: 30.1 us per loop

 Of course, there is no point in


 Ooops!!! Hit the send button too quick. Not to extend myself too long: if
 we are going to rethink all of this, we should approach it with an open
 mind. Still, and this post is not helping with that either, I am afraid
 that we are discussing implementation details, but are missing a broader
 vision of what we want to accomplish and why. That vision of what numpy's
 grouping functionality, if any, should be, and how it complements or
 conflicts with what pandas is providing, should precede anything else. I
 now I haven't, but has anyone looked at how pandas implements grouping?
 Their documentation on the subject is well worth a read:

 http://pandas.pydata.org/pandas-docs/stable/groupby.html

 Does numpy need to replicate this? What/why/how can we add to that?

 Jaime

 --
 (\__/)
 ( O.o)
 (  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
 de dominación mundial.

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

 I would certainly not be opposed to having a hashing based indexing
mechanism; I think it would make sense design-wise to have a HashIndex
class with the same interface as the rest, and use that subclass in those
arraysetops where it makes sense. The 'how to' of indexing and its
applications are largely orthogonal I think (with some tiny performance
compromises which are worth the abstraction imo). For datasets which are
not purely random, have many unique items, and which do not fit into cache,
I would expect sorting to come out on top, but indeed it depends on the
dataset.

Yeah, the question how pandas does grouping, and whether we can do better,
is a relevant one.

From what 

Re: [Numpy-discussion] Does a `mergesorted` function make sense?

2014-09-04 Thread Eelco Hoogendoorn
On Thu, Sep 4, 2014 at 10:31 AM, Eelco Hoogendoorn 
hoogendoorn.ee...@gmail.com wrote:


 On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fernández del Río 
 jaime.f...@gmail.com wrote:

 On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fernández del Río 
 jaime.f...@gmail.com wrote:

 On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:

  Not sure about the hashing. Indeed one can also build an index of a
 set by means of a hash table, but its questionable if this leads to
 improved performance over performing an argsort. Hashing may have better
 asymptotic time complexity in theory, but many datasets used in practice
 are very easy to sort (O(N)-ish), and the time-constant of hashing is
 higher. But more importantly, using a hash-table guarantees poor cache
 behavior for many operations using this index. By contrast, sorting may
 (but need not) make one random access pass to build the index, and may (but
 need not) perform one random access to reorder values for grouping. But
 insofar as the keys are better behaved than pure random, this coherence
 will be exploited.


 If you want to give it a try, these branch of my numpy fork has hash
 table based implementations of unique (with no extra indices) and in1d:

  https://github.com/jaimefrio/numpy/tree/hash-unique

 A use cases where the hash table is clearly better:

 In [1]: import numpy as np
 In [2]: from numpy.lib._compiled_base import _unique, _in1d

 In [3]: a = np.random.randint(10, size=(1,))
 In [4]: %timeit np.unique(a)
 1000 loops, best of 3: 258 us per loop
 In [5]: %timeit _unique(a)
 1 loops, best of 3: 143 us per loop
 In [6]: %timeit np.sort(_unique(a))
 1 loops, best of 3: 149 us per loop

 It typically performs between 1.5x and 4x faster than sorting. I haven't
 profiled it properly to know, but there may be quite a bit of performance
 to dig out: have type specific comparison functions, optimize the starting
 hash table size based on the size of the array to avoid reinsertions...

 If getting the elements sorted is a necessity, and the array contains
 very few or no repeated items, then the hash table approach may even
 perform worse,:

 In [8]: a = np.random.randint(1, size=(5000,))
 In [9]: %timeit np.unique(a)
 1000 loops, best of 3: 277 us per loop
 In [10]: %timeit np.sort(_unique(a))
 1000 loops, best of 3: 320 us per loop

 But the hash table still wins in extracting the unique items only:

 In [11]: %timeit _unique(a)
 1 loops, best of 3: 187 us per loop

 Where the hash table shines is in more elaborate situations. If you keep
 the first index where it was found, and the number of repeats, in the hash
 table, you can get return_index and return_counts almost for free, which
 means you are performing an extra 3x faster than with sorting.
 return_inverse requires a little more trickery, so I won;t attempt to
 quantify the improvement. But I wouldn't be surprised if, after fine tuning
 it, there is close to an order of magnitude overall improvement

 The spped-up for in1d is also nice:

 In [16]: a = np.random.randint(1000, size=(1000,))
 In [17]: b = np.random.randint(1000, size=(500,))
 In [18]: %timeit np.in1d(a, b)
 1000 loops, best of 3: 178 us per loop
 In [19]: %timeit _in1d(a, b)
 1 loops, best of 3: 30.1 us per loop

 Of course, there is no point in


 Ooops!!! Hit the send button too quick. Not to extend myself too long: if
 we are going to rethink all of this, we should approach it with an open
 mind. Still, and this post is not helping with that either, I am afraid
 that we are discussing implementation details, but are missing a broader
 vision of what we want to accomplish and why. That vision of what numpy's
 grouping functionality, if any, should be, and how it complements or
 conflicts with what pandas is providing, should precede anything else. I
 now I haven't, but has anyone looked at how pandas implements grouping?
 Their documentation on the subject is well worth a read:

 http://pandas.pydata.org/pandas-docs/stable/groupby.html

 Does numpy need to replicate this? What/why/how can we add to that?

 Jaime

 --
 (\__/)
 ( O.o)
 (  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
 de dominación mundial.

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

 I would certainly not be opposed to having a hashing based indexing
 mechanism; I think it would make sense design-wise to have a HashIndex
 class with the same interface as the rest, and use that subclass in those
 arraysetops where it makes sense. The 'how to' of indexing and its
 applications are largely orthogonal I think (with some tiny performance
 compromises which are worth the abstraction imo). For datasets which are
 not purely random, have many unique items, and which do not fit into cache,
 I would expect sorting to come out on top, but indeed it depends on the
 dataset.

 

Re: [Numpy-discussion] Does a `mergesorted` function make sense?

2014-09-04 Thread Jaime Fernández del Río
On Thu, Sep 4, 2014 at 10:39 AM, Eelco Hoogendoorn 
hoogendoorn.ee...@gmail.com wrote:


 On Thu, Sep 4, 2014 at 10:31 AM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:


 On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fernández del Río 
 jaime.f...@gmail.com wrote:

  On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fernández del Río 
 jaime.f...@gmail.com wrote:

 On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:

  Not sure about the hashing. Indeed one can also build an index of a
 set by means of a hash table, but its questionable if this leads to
 improved performance over performing an argsort. Hashing may have better
 asymptotic time complexity in theory, but many datasets used in practice
 are very easy to sort (O(N)-ish), and the time-constant of hashing is
 higher. But more importantly, using a hash-table guarantees poor cache
 behavior for many operations using this index. By contrast, sorting may
 (but need not) make one random access pass to build the index, and may 
 (but
 need not) perform one random access to reorder values for grouping. But
 insofar as the keys are better behaved than pure random, this coherence
 will be exploited.


 If you want to give it a try, these branch of my numpy fork has hash
 table based implementations of unique (with no extra indices) and in1d:

  https://github.com/jaimefrio/numpy/tree/hash-unique

 A use cases where the hash table is clearly better:

 In [1]: import numpy as np
 In [2]: from numpy.lib._compiled_base import _unique, _in1d

 In [3]: a = np.random.randint(10, size=(1,))
 In [4]: %timeit np.unique(a)
 1000 loops, best of 3: 258 us per loop
 In [5]: %timeit _unique(a)
 1 loops, best of 3: 143 us per loop
 In [6]: %timeit np.sort(_unique(a))
 1 loops, best of 3: 149 us per loop

 It typically performs between 1.5x and 4x faster than sorting. I
 haven't profiled it properly to know, but there may be quite a bit of
 performance to dig out: have type specific comparison functions, optimize
 the starting hash table size based on the size of the array to avoid
 reinsertions...

 If getting the elements sorted is a necessity, and the array contains
 very few or no repeated items, then the hash table approach may even
 perform worse,:

 In [8]: a = np.random.randint(1, size=(5000,))
 In [9]: %timeit np.unique(a)
 1000 loops, best of 3: 277 us per loop
 In [10]: %timeit np.sort(_unique(a))
 1000 loops, best of 3: 320 us per loop

 But the hash table still wins in extracting the unique items only:

 In [11]: %timeit _unique(a)
 1 loops, best of 3: 187 us per loop

 Where the hash table shines is in more elaborate situations. If you
 keep the first index where it was found, and the number of repeats, in the
 hash table, you can get return_index and return_counts almost for free,
 which means you are performing an extra 3x faster than with sorting.
 return_inverse requires a little more trickery, so I won;t attempt to
 quantify the improvement. But I wouldn't be surprised if, after fine tuning
 it, there is close to an order of magnitude overall improvement

 The spped-up for in1d is also nice:

 In [16]: a = np.random.randint(1000, size=(1000,))
 In [17]: b = np.random.randint(1000, size=(500,))
 In [18]: %timeit np.in1d(a, b)
 1000 loops, best of 3: 178 us per loop
 In [19]: %timeit _in1d(a, b)
 1 loops, best of 3: 30.1 us per loop

 Of course, there is no point in


 Ooops!!! Hit the send button too quick. Not to extend myself too long:
 if we are going to rethink all of this, we should approach it with an open
 mind. Still, and this post is not helping with that either, I am afraid
 that we are discussing implementation details, but are missing a broader
 vision of what we want to accomplish and why. That vision of what numpy's
 grouping functionality, if any, should be, and how it complements or
 conflicts with what pandas is providing, should precede anything else. I
 now I haven't, but has anyone looked at how pandas implements grouping?
 Their documentation on the subject is well worth a read:

 http://pandas.pydata.org/pandas-docs/stable/groupby.html

 Does numpy need to replicate this? What/why/how can we add to that?

 Jaime

 --
 (\__/)
 ( O.o)
 (  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus
 planes de dominación mundial.

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

 I would certainly not be opposed to having a hashing based indexing
 mechanism; I think it would make sense design-wise to have a HashIndex
 class with the same interface as the rest, and use that subclass in those
 arraysetops where it makes sense. The 'how to' of indexing and its
 applications are largely orthogonal I think (with some tiny performance
 compromises which are worth the abstraction imo). For datasets which are
 not purely random, have many unique items, and which do not fit into 

Re: [Numpy-discussion] Does a `mergesorted` function make sense?

2014-09-04 Thread Eelco Hoogendoorn
I should clarify: I am speaking about my implementation, I havnt looked at
the numpy implementation for a while so im not sure what it is up to. Note
that by 'almost free', we are still talking about three passes over the
whole array plus temp allocations, but I am assuming a use-case where the
various sorts involved are the dominant cost, which I imagine they are, for
out-of-cache sorts. Perhaps this isn't too realistic an assumption about
the average use case though, I don't know. Though I suppose its a
reasonable guideline to assume that either the dataset is big, or
performance isn't that big a concern in the first place.


On Thu, Sep 4, 2014 at 7:55 PM, Jaime Fernández del Río 
jaime.f...@gmail.com wrote:

 On Thu, Sep 4, 2014 at 10:39 AM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:


 On Thu, Sep 4, 2014 at 10:31 AM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:


 On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fernández del Río 
 jaime.f...@gmail.com wrote:

  On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fernández del Río 
 jaime.f...@gmail.com wrote:

 On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:

  Not sure about the hashing. Indeed one can also build an index of a
 set by means of a hash table, but its questionable if this leads to
 improved performance over performing an argsort. Hashing may have better
 asymptotic time complexity in theory, but many datasets used in practice
 are very easy to sort (O(N)-ish), and the time-constant of hashing is
 higher. But more importantly, using a hash-table guarantees poor cache
 behavior for many operations using this index. By contrast, sorting may
 (but need not) make one random access pass to build the index, and may 
 (but
 need not) perform one random access to reorder values for grouping. But
 insofar as the keys are better behaved than pure random, this coherence
 will be exploited.


 If you want to give it a try, these branch of my numpy fork has hash
 table based implementations of unique (with no extra indices) and in1d:

  https://github.com/jaimefrio/numpy/tree/hash-unique

 A use cases where the hash table is clearly better:

 In [1]: import numpy as np
 In [2]: from numpy.lib._compiled_base import _unique, _in1d

 In [3]: a = np.random.randint(10, size=(1,))
 In [4]: %timeit np.unique(a)
 1000 loops, best of 3: 258 us per loop
 In [5]: %timeit _unique(a)
 1 loops, best of 3: 143 us per loop
 In [6]: %timeit np.sort(_unique(a))
 1 loops, best of 3: 149 us per loop

 It typically performs between 1.5x and 4x faster than sorting. I
 haven't profiled it properly to know, but there may be quite a bit of
 performance to dig out: have type specific comparison functions, optimize
 the starting hash table size based on the size of the array to avoid
 reinsertions...

 If getting the elements sorted is a necessity, and the array contains
 very few or no repeated items, then the hash table approach may even
 perform worse,:

 In [8]: a = np.random.randint(1, size=(5000,))
 In [9]: %timeit np.unique(a)
 1000 loops, best of 3: 277 us per loop
 In [10]: %timeit np.sort(_unique(a))
 1000 loops, best of 3: 320 us per loop

 But the hash table still wins in extracting the unique items only:

 In [11]: %timeit _unique(a)
 1 loops, best of 3: 187 us per loop

 Where the hash table shines is in more elaborate situations. If you
 keep the first index where it was found, and the number of repeats, in the
 hash table, you can get return_index and return_counts almost for free,
 which means you are performing an extra 3x faster than with sorting.
 return_inverse requires a little more trickery, so I won;t attempt to
 quantify the improvement. But I wouldn't be surprised if, after fine 
 tuning
 it, there is close to an order of magnitude overall improvement

 The spped-up for in1d is also nice:

 In [16]: a = np.random.randint(1000, size=(1000,))
 In [17]: b = np.random.randint(1000, size=(500,))
 In [18]: %timeit np.in1d(a, b)
 1000 loops, best of 3: 178 us per loop
 In [19]: %timeit _in1d(a, b)
 1 loops, best of 3: 30.1 us per loop

 Of course, there is no point in


 Ooops!!! Hit the send button too quick. Not to extend myself too long:
 if we are going to rethink all of this, we should approach it with an open
 mind. Still, and this post is not helping with that either, I am afraid
 that we are discussing implementation details, but are missing a broader
 vision of what we want to accomplish and why. That vision of what numpy's
 grouping functionality, if any, should be, and how it complements or
 conflicts with what pandas is providing, should precede anything else. I
 now I haven't, but has anyone looked at how pandas implements grouping?
 Their documentation on the subject is well worth a read:

 http://pandas.pydata.org/pandas-docs/stable/groupby.html

 Does numpy need to replicate this? What/why/how can we add to that?

 Jaime

 --
 (\__/)
 ( O.o)
 (  ) Este es Conejo. Copia a Conejo en 

Re: [Numpy-discussion] Does a `mergesorted` function make sense?

2014-09-04 Thread Eelco Hoogendoorn
On Thu, Sep 4, 2014 at 8:14 PM, Eelco Hoogendoorn 
hoogendoorn.ee...@gmail.com wrote:

 I should clarify: I am speaking about my implementation, I havnt looked at
 the numpy implementation for a while so im not sure what it is up to. Note
 that by 'almost free', we are still talking about three passes over the
 whole array plus temp allocations, but I am assuming a use-case where the
 various sorts involved are the dominant cost, which I imagine they are, for
 out-of-cache sorts. Perhaps this isn't too realistic an assumption about
 the average use case though, I don't know. Though I suppose its a
 reasonable guideline to assume that either the dataset is big, or
 performance isn't that big a concern in the first place.


 On Thu, Sep 4, 2014 at 7:55 PM, Jaime Fernández del Río 
 jaime.f...@gmail.com wrote:

 On Thu, Sep 4, 2014 at 10:39 AM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:


 On Thu, Sep 4, 2014 at 10:31 AM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:


 On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fernández del Río 
 jaime.f...@gmail.com wrote:

  On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fernández del Río 
 jaime.f...@gmail.com wrote:

 On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:

  Not sure about the hashing. Indeed one can also build an index of a
 set by means of a hash table, but its questionable if this leads to
 improved performance over performing an argsort. Hashing may have better
 asymptotic time complexity in theory, but many datasets used in practice
 are very easy to sort (O(N)-ish), and the time-constant of hashing is
 higher. But more importantly, using a hash-table guarantees poor cache
 behavior for many operations using this index. By contrast, sorting may
 (but need not) make one random access pass to build the index, and may 
 (but
 need not) perform one random access to reorder values for grouping. But
 insofar as the keys are better behaved than pure random, this coherence
 will be exploited.


 If you want to give it a try, these branch of my numpy fork has hash
 table based implementations of unique (with no extra indices) and in1d:

  https://github.com/jaimefrio/numpy/tree/hash-unique

 A use cases where the hash table is clearly better:

 In [1]: import numpy as np
 In [2]: from numpy.lib._compiled_base import _unique, _in1d

 In [3]: a = np.random.randint(10, size=(1,))
 In [4]: %timeit np.unique(a)
 1000 loops, best of 3: 258 us per loop
 In [5]: %timeit _unique(a)
 1 loops, best of 3: 143 us per loop
 In [6]: %timeit np.sort(_unique(a))
 1 loops, best of 3: 149 us per loop

 It typically performs between 1.5x and 4x faster than sorting. I
 haven't profiled it properly to know, but there may be quite a bit of
 performance to dig out: have type specific comparison functions, optimize
 the starting hash table size based on the size of the array to avoid
 reinsertions...

 If getting the elements sorted is a necessity, and the array contains
 very few or no repeated items, then the hash table approach may even
 perform worse,:

 In [8]: a = np.random.randint(1, size=(5000,))
 In [9]: %timeit np.unique(a)
 1000 loops, best of 3: 277 us per loop
 In [10]: %timeit np.sort(_unique(a))
 1000 loops, best of 3: 320 us per loop

 But the hash table still wins in extracting the unique items only:

 In [11]: %timeit _unique(a)
 1 loops, best of 3: 187 us per loop

 Where the hash table shines is in more elaborate situations. If you
 keep the first index where it was found, and the number of repeats, in 
 the
 hash table, you can get return_index and return_counts almost for free,
 which means you are performing an extra 3x faster than with sorting.
 return_inverse requires a little more trickery, so I won;t attempt to
 quantify the improvement. But I wouldn't be surprised if, after fine 
 tuning
 it, there is close to an order of magnitude overall improvement

 The spped-up for in1d is also nice:

 In [16]: a = np.random.randint(1000, size=(1000,))
 In [17]: b = np.random.randint(1000, size=(500,))
 In [18]: %timeit np.in1d(a, b)
 1000 loops, best of 3: 178 us per loop
 In [19]: %timeit _in1d(a, b)
 1 loops, best of 3: 30.1 us per loop

 Of course, there is no point in


 Ooops!!! Hit the send button too quick. Not to extend myself too long:
 if we are going to rethink all of this, we should approach it with an open
 mind. Still, and this post is not helping with that either, I am afraid
 that we are discussing implementation details, but are missing a broader
 vision of what we want to accomplish and why. That vision of what numpy's
 grouping functionality, if any, should be, and how it complements or
 conflicts with what pandas is providing, should precede anything else. I
 now I haven't, but has anyone looked at how pandas implements grouping?
 Their documentation on the subject is well worth a read:

 http://pandas.pydata.org/pandas-docs/stable/groupby.html

 Does numpy need to replicate this? 

Re: [Numpy-discussion] Does a `mergesorted` function make sense?

2014-09-04 Thread Eelco Hoogendoorn
Naturally, youd want to avoid redoing the indexing where you can, which is
another good reason to factor out the indexing mechanisms into separate
classes. A factor two performance difference does not get me too excited;
again, I think it would be the other way around for an out-of-cache dataset
being grouped. But this by itself is ofcourse another argument for
factoring out the indexing behind a uniform interface, so we can play
around with those implementation details later, and specialize the indexing
to serve different scenarios. Also, it really helps with code
maintainability; most arraysetops are almost trivial to implement once you
have abstracted away the indexing machinery.


On Thu, Sep 4, 2014 at 8:36 PM, Jeff Reback jeffreb...@gmail.com wrote:

 FYI pandas DOES use a very performant hash table impl for unique (and
 value_counts). Sorted state IS maintained
 by underlying Index implmentation.
 https://github.com/pydata/pandas/blob/master/pandas/hashtable.pyx

 In [8]: a = np.random.randint(10, size=(1,))

 In [9]: %timeit np.unique(a)
 1000 loops, best of 3: 284 µs per loop

 In [10]: %timeit Series(a).unique()
 1 loops, best of 3: 161 µs per loop

 In [11]: s = Series(a)

 # without the creation overhead
 In [12]: %timeit s.unique()
 1 loops, best of 3: 75.3 µs per loop



 On Thu, Sep 4, 2014 at 2:29 PM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:


 On Thu, Sep 4, 2014 at 8:14 PM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:

 I should clarify: I am speaking about my implementation, I havnt looked
 at the numpy implementation for a while so im not sure what it is up to.
 Note that by 'almost free', we are still talking about three passes over
 the whole array plus temp allocations, but I am assuming a use-case where
 the various sorts involved are the dominant cost, which I imagine they are,
 for out-of-cache sorts. Perhaps this isn't too realistic an assumption
 about the average use case though, I don't know. Though I suppose its a
 reasonable guideline to assume that either the dataset is big, or
 performance isn't that big a concern in the first place.


 On Thu, Sep 4, 2014 at 7:55 PM, Jaime Fernández del Río 
 jaime.f...@gmail.com wrote:

 On Thu, Sep 4, 2014 at 10:39 AM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:


 On Thu, Sep 4, 2014 at 10:31 AM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:


 On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fernández del Río 
 jaime.f...@gmail.com wrote:

  On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fernández del Río 
 jaime.f...@gmail.com wrote:

 On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:

  Not sure about the hashing. Indeed one can also build an index of
 a set by means of a hash table, but its questionable if this leads to
 improved performance over performing an argsort. Hashing may have 
 better
 asymptotic time complexity in theory, but many datasets used in 
 practice
 are very easy to sort (O(N)-ish), and the time-constant of hashing is
 higher. But more importantly, using a hash-table guarantees poor cache
 behavior for many operations using this index. By contrast, sorting 
 may
 (but need not) make one random access pass to build the index, and 
 may (but
 need not) perform one random access to reorder values for grouping. 
 But
 insofar as the keys are better behaved than pure random, this 
 coherence
 will be exploited.


 If you want to give it a try, these branch of my numpy fork has
 hash table based implementations of unique (with no extra indices) and 
 in1d:

  https://github.com/jaimefrio/numpy/tree/hash-unique

 A use cases where the hash table is clearly better:

 In [1]: import numpy as np
 In [2]: from numpy.lib._compiled_base import _unique, _in1d

 In [3]: a = np.random.randint(10, size=(1,))
 In [4]: %timeit np.unique(a)
 1000 loops, best of 3: 258 us per loop
 In [5]: %timeit _unique(a)
 1 loops, best of 3: 143 us per loop
 In [6]: %timeit np.sort(_unique(a))
 1 loops, best of 3: 149 us per loop

 It typically performs between 1.5x and 4x faster than sorting. I
 haven't profiled it properly to know, but there may be quite a bit of
 performance to dig out: have type specific comparison functions, 
 optimize
 the starting hash table size based on the size of the array to avoid
 reinsertions...

 If getting the elements sorted is a necessity, and the array
 contains very few or no repeated items, then the hash table approach 
 may
 even perform worse,:

 In [8]: a = np.random.randint(1, size=(5000,))
 In [9]: %timeit np.unique(a)
 1000 loops, best of 3: 277 us per loop
 In [10]: %timeit np.sort(_unique(a))
 1000 loops, best of 3: 320 us per loop

 But the hash table still wins in extracting the unique items only:

 In [11]: %timeit _unique(a)
 1 loops, best of 3: 187 us per loop

 Where the hash table shines is in more elaborate situations. If you
 keep the first index where it was found, and the number of repeats, in 
 

Re: [Numpy-discussion] Does a `mergesorted` function make sense?

2014-09-03 Thread cjw

Re: [Numpy-discussion] Does a `mergesorted` function make sense?

On 02/09/2014 8:40 PM, Charles R Harris wrote:




On Mon, Sep 1, 2014 at 7:58 AM, Eelco Hoogendoorn 
hoogendoorn.ee...@gmail.com mailto:hoogendoorn.ee...@gmail.com wrote:


On Mon, Sep 1, 2014 at 2:05 PM, Charles R Harris
charlesr.har...@gmail.com mailto:charlesr.har...@gmail.com wrote:




On Mon, Sep 1, 2014 at 1:49 AM, Eelco Hoogendoorn
hoogendoorn.ee...@gmail.com
mailto:hoogendoorn.ee...@gmail.com wrote:

Sure, id like to do the hashing things out, but I would
also like some preliminary feedback as to whether this is
going in a direction anyone else sees the point of, if it
conflicts with other plans, and indeed if we can agree
that numpy is the right place for it; a point which I
would very much like to defend. If there is some obvious
no-go that im missing, I can do without the drudgery of
writing proper documentation ;).

These are good issues, that need to be discussed and resolved. Python 
has the benefit of having a BDFL.  Numpy has no similar arrangement.
In the post-numarray period, Travis Oliphant took that role and advanced 
the package in many ways.


It seems that Travis is no longer available.  There is a need for 
someone, or maybe some group of three, who would be guided by the sort 
of thinking Charles Harris sets out above, who would make the final 
decision.


Colin W.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does a `mergesorted` function make sense?

2014-09-03 Thread Eelco Hoogendoorn
On Wed, Sep 3, 2014 at 4:07 AM, Jaime Fernández del Río 
jaime.f...@gmail.com wrote:

 On Tue, Sep 2, 2014 at 5:40 PM, Charles R Harris 
 charlesr.har...@gmail.com wrote:


 What do you think about the suggestion of timsort? One would need to
 concatenate the arrays before sorting, but it should be fairly efficient.


 Timsort is very cool, and it would definitely be fun to implement in
 numpy. It is also a lot more work that merging two sorted arrays! I think
 +1 if someone else does it, but although I would love to be part of it, I
 am not sure I will be able to find time to look into it seriously in the
 next couple of months.

 From a setops point of view, merging two sorted arrays makes it very
 straightforward to compute, together with (or instead of) the result of the
 merge, index arrays that let you calculate things like `in1d` faster.
 Although perhaps an `argtimsort` could provide the same functionality, not
 sure. I will probably wrap up what I have, put a lace on it, and submit it
 as a PR. Even if it is not destined to be merged, it may serve as a warning
 to others.

 I have also been thinking lately that one of the problems with all these
 unique-related computations may be a case of having a hammer and seeing
 everything as nails. Perhaps the algorithm that needs to be ported from
 Python is not the sorting one, but the hash table...

 Jaime


 Not sure about the hashing. Indeed one can also build an index of a set by
means of a hash table, but its questionable if this leads to improved
performance over performing an argsort. Hashing may have better asymptotic
time complexity in theory, but many datasets used in practice are very easy
to sort (O(N)-ish), and the time-constant of hashing is higher. But more
importantly, using a hash-table guarantees poor cache behavior for many
operations using this index. By contrast, sorting may (but need not) make
one random access pass to build the index, and may (but need not) perform
one random access to reorder values for grouping. But insofar as the keys
are better behaved than pure random, this coherence will be exploited.

Also, getting the unique values/keys in sorted order is a nice side-benefit
for many applications.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does a `mergesorted` function make sense?

2014-09-03 Thread Jaime Fernández del Río
On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn 
hoogendoorn.ee...@gmail.com wrote:

  Not sure about the hashing. Indeed one can also build an index of a set
 by means of a hash table, but its questionable if this leads to improved
 performance over performing an argsort. Hashing may have better asymptotic
 time complexity in theory, but many datasets used in practice are very easy
 to sort (O(N)-ish), and the time-constant of hashing is higher. But more
 importantly, using a hash-table guarantees poor cache behavior for many
 operations using this index. By contrast, sorting may (but need not) make
 one random access pass to build the index, and may (but need not) perform
 one random access to reorder values for grouping. But insofar as the keys
 are better behaved than pure random, this coherence will be exploited.


If you want to give it a try, these branch of my numpy fork has hash table
based implementations of unique (with no extra indices) and in1d:

https://github.com/jaimefrio/numpy/tree/hash-unique

A use cases where the hash table is clearly better:

In [1]: import numpy as np
In [2]: from numpy.lib._compiled_base import _unique, _in1d

In [3]: a = np.random.randint(10, size=(1,))
In [4]: %timeit np.unique(a)
1000 loops, best of 3: 258 us per loop
In [5]: %timeit _unique(a)
1 loops, best of 3: 143 us per loop
In [6]: %timeit np.sort(_unique(a))
1 loops, best of 3: 149 us per loop

It typically performs between 1.5x and 4x faster than sorting. I haven't
profiled it properly to know, but there may be quite a bit of performance
to dig out: have type specific comparison functions, optimize the starting
hash table size based on the size of the array to avoid reinsertions...

If getting the elements sorted is a necessity, and the array contains very
few or no repeated items, then the hash table approach may even perform
worse,:

In [8]: a = np.random.randint(1, size=(5000,))
In [9]: %timeit np.unique(a)
1000 loops, best of 3: 277 us per loop
In [10]: %timeit np.sort(_unique(a))
1000 loops, best of 3: 320 us per loop

But the hash table still wins in extracting the unique items only:

In [11]: %timeit _unique(a)
1 loops, best of 3: 187 us per loop

Where the hash table shines is in more elaborate situations. If you keep
the first index where it was found, and the number of repeats, in the hash
table, you can get return_index and return_counts almost for free, which
means you are performing an extra 3x faster than with sorting.
return_inverse requires a little more trickery, so I won;t attempt to
quantify the improvement. But I wouldn't be surprised if, after fine tuning
it, there is close to an order of magnitude overall improvement

The spped-up for in1d is also nice:

In [16]: a = np.random.randint(1000, size=(1000,))
In [17]: b = np.random.randint(1000, size=(500,))
In [18]: %timeit np.in1d(a, b)
1000 loops, best of 3: 178 us per loop
In [19]: %timeit _in1d(a, b)
1 loops, best of 3: 30.1 us per loop

Of course, there is no point in
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does a `mergesorted` function make sense?

2014-09-03 Thread Jaime Fernández del Río
On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fernández del Río 
jaime.f...@gmail.com wrote:

 On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:

  Not sure about the hashing. Indeed one can also build an index of a set
 by means of a hash table, but its questionable if this leads to improved
 performance over performing an argsort. Hashing may have better asymptotic
 time complexity in theory, but many datasets used in practice are very easy
 to sort (O(N)-ish), and the time-constant of hashing is higher. But more
 importantly, using a hash-table guarantees poor cache behavior for many
 operations using this index. By contrast, sorting may (but need not) make
 one random access pass to build the index, and may (but need not) perform
 one random access to reorder values for grouping. But insofar as the keys
 are better behaved than pure random, this coherence will be exploited.


 If you want to give it a try, these branch of my numpy fork has hash table
 based implementations of unique (with no extra indices) and in1d:

 https://github.com/jaimefrio/numpy/tree/hash-unique

 A use cases where the hash table is clearly better:

 In [1]: import numpy as np
 In [2]: from numpy.lib._compiled_base import _unique, _in1d

 In [3]: a = np.random.randint(10, size=(1,))
 In [4]: %timeit np.unique(a)
 1000 loops, best of 3: 258 us per loop
 In [5]: %timeit _unique(a)
 1 loops, best of 3: 143 us per loop
 In [6]: %timeit np.sort(_unique(a))
 1 loops, best of 3: 149 us per loop

 It typically performs between 1.5x and 4x faster than sorting. I haven't
 profiled it properly to know, but there may be quite a bit of performance
 to dig out: have type specific comparison functions, optimize the starting
 hash table size based on the size of the array to avoid reinsertions...

 If getting the elements sorted is a necessity, and the array contains very
 few or no repeated items, then the hash table approach may even perform
 worse,:

 In [8]: a = np.random.randint(1, size=(5000,))
 In [9]: %timeit np.unique(a)
 1000 loops, best of 3: 277 us per loop
 In [10]: %timeit np.sort(_unique(a))
 1000 loops, best of 3: 320 us per loop

 But the hash table still wins in extracting the unique items only:

 In [11]: %timeit _unique(a)
 1 loops, best of 3: 187 us per loop

 Where the hash table shines is in more elaborate situations. If you keep
 the first index where it was found, and the number of repeats, in the hash
 table, you can get return_index and return_counts almost for free, which
 means you are performing an extra 3x faster than with sorting.
 return_inverse requires a little more trickery, so I won;t attempt to
 quantify the improvement. But I wouldn't be surprised if, after fine tuning
 it, there is close to an order of magnitude overall improvement

 The spped-up for in1d is also nice:

 In [16]: a = np.random.randint(1000, size=(1000,))
 In [17]: b = np.random.randint(1000, size=(500,))
 In [18]: %timeit np.in1d(a, b)
 1000 loops, best of 3: 178 us per loop
 In [19]: %timeit _in1d(a, b)
 1 loops, best of 3: 30.1 us per loop

 Of course, there is no point in


Ooops!!! Hit the send button too quick. Not to extend myself too long: if
we are going to rethink all of this, we should approach it with an open
mind. Still, and this post is not helping with that either, I am afraid
that we are discussing implementation details, but are missing a broader
vision of what we want to accomplish and why. That vision of what numpy's
grouping functionality, if any, should be, and how it complements or
conflicts with what pandas is providing, should precede anything else. I
now I haven't, but has anyone looked at how pandas implements grouping?
Their documentation on the subject is well worth a read:

http://pandas.pydata.org/pandas-docs/stable/groupby.html

Does numpy need to replicate this? What/why/how can we add to that?

Jaime

-- 
(\__/)
( O.o)
(  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does a `mergesorted` function make sense?

2014-09-03 Thread Jaime Fernández del Río
On Wed, Sep 3, 2014 at 5:26 AM, cjw c...@ncf.ca wrote:

 These are good issues, that need to be discussed and resolved.  Python has
 the benefit of having a BDFL.  Numpy has no similar arrangement.
 In the post-numarray period, Travis Oliphant took that role and advanced
 the package in many ways.

 It seems that Travis is no longer available.  There is a need for someone,
 or maybe some group of three, who would be guided by the sort of thinking
 Charles Harris sets out above, who would make the final decision.


We should crown Charles philosopher king, and let him wisely rule over us
with the aid of his aristocracy of devs with merge rights. We would make
Plato proud! :-)

Jaime

-- 
(\__/)
( O.o)
(  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does a `mergesorted` function make sense?

2014-09-03 Thread Charles R Harris
On Wed, Sep 3, 2014 at 10:53 AM, Jaime Fernández del Río 
jaime.f...@gmail.com wrote:

 On Wed, Sep 3, 2014 at 5:26 AM, cjw c...@ncf.ca wrote:

 These are good issues, that need to be discussed and resolved.  Python
 has the benefit of having a BDFL.  Numpy has no similar arrangement.
 In the post-numarray period, Travis Oliphant took that role and advanced
 the package in many ways.

 It seems that Travis is no longer available.  There is a need for
 someone, or maybe some group of three, who would be guided by the sort of
 thinking Charles Harris sets out above, who would make the final decision.


 We should crown Charles philosopher king, and let him wisely rule over us
 with the aid of his aristocracy of devs with merge rights. We would make
 Plato proud! :-)


Charles I needs an heir in case of execution by the Parliamentary forces.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does a `mergesorted` function make sense?

2014-09-02 Thread Charles R Harris
On Mon, Sep 1, 2014 at 7:58 AM, Eelco Hoogendoorn 
hoogendoorn.ee...@gmail.com wrote:

 On Mon, Sep 1, 2014 at 2:05 PM, Charles R Harris 
 charlesr.har...@gmail.com wrote:




 On Mon, Sep 1, 2014 at 1:49 AM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:

 Sure, id like to do the hashing things out, but I would also like some
 preliminary feedback as to whether this is going in a direction anyone else
 sees the point of, if it conflicts with other plans, and indeed if we can
 agree that numpy is the right place for it; a point which I would very much
 like to defend. If there is some obvious no-go that im missing, I can do
 without the drudgery of writing proper documentation ;).

 As for whether this belongs in numpy: yes, I would say so. There are the
 extension of functionality to functions already in numpy, which are a
 no-brainer (it need not cost anything performance wise, and ive needed
 unique graph edges many many times), and there is the grouping
 functionality, which is the main novelty.

 However, note that the grouping functionality itself is a very small
 addition, just a few 100 lines of pure python, given that the indexing
 logic has been factored out of the classic arraysetops. At least from a
 developers perspective, it very much feels like a logical extension of the
 same 'thing'.

 But also from a conceptual numpy perspective, grouping is really more an
 'elementary manipulation of an ndarray' than a 'special purpose algorithm'.
 It is useful for literally all kinds of programming; hence there is similar
 functionality in the python standard library (itertools.groupby); so why
 not have an efficient vectorized equivalent in numpy? It belongs there more
 than the linalg module, arguably.

 Also, from a community perspective, a significant fraction of all
 stackoverflow numpy questions are (unknowingly) exactly about 'how to do
 grouping in numpy'.


 What I'm trying to say is that numpy is a community project. We don't
 have a central planning committee, the only difference between developers
 and everyone else is activity and commit rights. Which is to say if you
 develop and push this topic it is likely to go in. There certainly seems to
 be interest in this functionality. The reason that I brought up scipy is
 that there are some graph algorithms there that went in a couple of years
 ago.

 Note that the convention on the list is bottom posting.

 snip

 Chuck


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 I understand that numpy is a community project, so that the decision isn't
 up to any one particular person; but some early stage feedback from those
 active in the community would be welcome. I am generally confident that
 this addition makes sense, but I have not contributed to numpy before,
 and you don't know what you don't know and all... given that there are
 multiple suggestions for changing arraysetops, some coordination would be
 useful I think.

 Note that I use graph edges merely as an example; the proposed
 functionality is much more general than graphing algorithms specifically.
 The radial reduction
 https://github.com/EelcoHoogendoorn/Numpy_arraysetops_EP/blob/master/examples.pyexample
 I included on github is particularly illustrative of the general utility of
 grouping functionality I think. Operations like radial reductions are
 rather common, and a custom implementation is quite lengthy, very bug
 prone, and potentially very slow.

 Thanks for the heads up on posting convention; ive always let gmail do my
 thinking for me, which works well enough for me, but I can see how not
 following this convention is annoying to others.


What do you think about the suggestion of timsort? One would need to
concatenate the arrays before sorting, but it should be fairly efficient.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does a `mergesorted` function make sense?

2014-09-02 Thread Jaime Fernández del Río
On Tue, Sep 2, 2014 at 5:40 PM, Charles R Harris charlesr.har...@gmail.com
wrote:


 What do you think about the suggestion of timsort? One would need to
 concatenate the arrays before sorting, but it should be fairly efficient.


Timsort is very cool, and it would definitely be fun to implement in numpy.
It is also a lot more work that merging two sorted arrays! I think +1 if
someone else does it, but although I would love to be part of it, I am not
sure I will be able to find time to look into it seriously in the next
couple of months.

From a setops point of view, merging two sorted arrays makes it very
straightforward to compute, together with (or instead of) the result of the
merge, index arrays that let you calculate things like `in1d` faster.
Although perhaps an `argtimsort` could provide the same functionality, not
sure. I will probably wrap up what I have, put a lace on it, and submit it
as a PR. Even if it is not destined to be merged, it may serve as a warning
to others.

I have also been thinking lately that one of the problems with all these
unique-related computations may be a case of having a hammer and seeing
everything as nails. Perhaps the algorithm that needs to be ported from
Python is not the sorting one, but the hash table...

Jaime

-- 
(\__/)
( O.o)
(  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does a `mergesorted` function make sense?

2014-09-01 Thread Eelco Hoogendoorn
Sure, id like to do the hashing things out, but I would also like some
preliminary feedback as to whether this is going in a direction anyone else
sees the point of, if it conflicts with other plans, and indeed if we can
agree that numpy is the right place for it; a point which I would very much
like to defend. If there is some obvious no-go that im missing, I can do
without the drudgery of writing proper documentation ;).

As for whether this belongs in numpy: yes, I would say so. There are the
extension of functionality to functions already in numpy, which are a
no-brainer (it need not cost anything performance wise, and ive needed
unique graph edges many many times), and there is the grouping
functionality, which is the main novelty.

However, note that the grouping functionality itself is a very small
addition, just a few 100 lines of pure python, given that the indexing
logic has been factored out of the classic arraysetops. At least from a
developers perspective, it very much feels like a logical extension of the
same 'thing'.

But also from a conceptual numpy perspective, grouping is really more an
'elementary manipulation of an ndarray' than a 'special purpose algorithm'.
It is useful for literally all kinds of programming; hence there is similar
functionality in the python standard library (itertools.groupby); so why
not have an efficient vectorized equivalent in numpy? It belongs there more
than the linalg module, arguably.

Also, from a community perspective, a significant fraction of all
stackoverflow numpy questions are (unknowingly) exactly about 'how to do
grouping in numpy'.


On Mon, Sep 1, 2014 at 4:36 AM, Charles R Harris charlesr.har...@gmail.com
wrote:




 On Sun, Aug 31, 2014 at 1:48 PM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:

 Ive organized all code I had relating to this subject in a github
 repository https://github.com/EelcoHoogendoorn/Numpy_arraysetops_EP.
 That should facilitate shooting around ideas. Ive also added more
 documentation and structure to make it easier to see what is going on.

 Hopefully we can converge on a common vision, and then improve the
 documentation and testing to make it worthy of including in the numpy
 master.

 Note that there is also a complete rewrite of the classic
 numpy.arraysetops, such that they are also generalized to more complex
 input, such as finding unique graph edges, and so on.

 You mentioned getting the numpy core developers involved; are they not
 subscribed to this mailing list? I wouldn't be surprised; youd hope there
 is a channel of discussion concerning development with higher signal to
 noise


 There are only about 2.5 of us at the moment. Those for whom this is an
 itch that need scratching should hash things out and make a PR. The main
 question for me is if it belongs in numpy, scipy, or somewhere else.

 Chuck

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does a `mergesorted` function make sense?

2014-09-01 Thread Charles R Harris
On Mon, Sep 1, 2014 at 1:49 AM, Eelco Hoogendoorn 
hoogendoorn.ee...@gmail.com wrote:

 Sure, id like to do the hashing things out, but I would also like some
 preliminary feedback as to whether this is going in a direction anyone else
 sees the point of, if it conflicts with other plans, and indeed if we can
 agree that numpy is the right place for it; a point which I would very much
 like to defend. If there is some obvious no-go that im missing, I can do
 without the drudgery of writing proper documentation ;).

 As for whether this belongs in numpy: yes, I would say so. There are the
 extension of functionality to functions already in numpy, which are a
 no-brainer (it need not cost anything performance wise, and ive needed
 unique graph edges many many times), and there is the grouping
 functionality, which is the main novelty.

 However, note that the grouping functionality itself is a very small
 addition, just a few 100 lines of pure python, given that the indexing
 logic has been factored out of the classic arraysetops. At least from a
 developers perspective, it very much feels like a logical extension of the
 same 'thing'.

 But also from a conceptual numpy perspective, grouping is really more an
 'elementary manipulation of an ndarray' than a 'special purpose algorithm'.
 It is useful for literally all kinds of programming; hence there is similar
 functionality in the python standard library (itertools.groupby); so why
 not have an efficient vectorized equivalent in numpy? It belongs there more
 than the linalg module, arguably.

 Also, from a community perspective, a significant fraction of all
 stackoverflow numpy questions are (unknowingly) exactly about 'how to do
 grouping in numpy'.


What I'm trying to say is that numpy is a community project. We don't have
a central planning committee, the only difference between developers and
everyone else is activity and commit rights. Which is to say if you develop
and push this topic it is likely to go in. There certainly seems to be
interest in this functionality. The reason that I brought up scipy is that
there are some graph algorithms there that went in a couple of years ago.

Note that the convention on the list is bottom posting.

snip

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does a `mergesorted` function make sense?

2014-09-01 Thread Nathaniel Smith
On Mon, Sep 1, 2014 at 8:49 AM, Eelco Hoogendoorn
hoogendoorn.ee...@gmail.com wrote:
 Sure, id like to do the hashing things out, but I would also like some
 preliminary feedback as to whether this is going in a direction anyone else
 sees the point of, if it conflicts with other plans, and indeed if we can
 agree that numpy is the right place for it; a point which I would very much
 like to defend. If there is some obvious no-go that im missing, I can do
 without the drudgery of writing proper documentation ;).

 As for whether this belongs in numpy: yes, I would say so. There are the
 extension of functionality to functions already in numpy, which are a
 no-brainer (it need not cost anything performance wise, and ive needed
 unique graph edges many many times), and there is the grouping
 functionality, which is the main novelty.

 However, note that the grouping functionality itself is a very small
 addition, just a few 100 lines of pure python, given that the indexing logic
 has been factored out of the classic arraysetops. At least from a developers
 perspective, it very much feels like a logical extension of the same
 'thing'.

My 2 cents: I definitely agree that this is very useful fundamental
functionality, and it would be great if numpy had a solution for it
out of the box. My main concern is that this is a fairly complicated
set of functionality and there are a lot of small decisions to be made
in setting up the API for it. IME it's very hard to just read through
an API like this and reason out the best way to do it by pure logic;
usually it needs to get banged on for a bit in real uses before it
becomes clear what the right set of trade-offs is. And numpy itself is
not a great environment these kinds of iterations. So, IMO the main
challenge is: how do we get the functionality into a state where we
can convince ourselves that it'll be supportable in numpy
indefinitely, and not need to be replaced in a year or two?

Some things that might help with this convincing:
- releasing it as a small standalone package on pypi and getting some
real users to bang on it
- any real code written against the APIs
- feedback from the pandas community since they've spent a lot of time
working on these issues
- ...?

-n

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does a `mergesorted` function make sense?

2014-09-01 Thread Eelco Hoogendoorn
On Mon, Sep 1, 2014 at 2:05 PM, Charles R Harris charlesr.har...@gmail.com
wrote:




 On Mon, Sep 1, 2014 at 1:49 AM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:

 Sure, id like to do the hashing things out, but I would also like some
 preliminary feedback as to whether this is going in a direction anyone else
 sees the point of, if it conflicts with other plans, and indeed if we can
 agree that numpy is the right place for it; a point which I would very much
 like to defend. If there is some obvious no-go that im missing, I can do
 without the drudgery of writing proper documentation ;).

 As for whether this belongs in numpy: yes, I would say so. There are the
 extension of functionality to functions already in numpy, which are a
 no-brainer (it need not cost anything performance wise, and ive needed
 unique graph edges many many times), and there is the grouping
 functionality, which is the main novelty.

 However, note that the grouping functionality itself is a very small
 addition, just a few 100 lines of pure python, given that the indexing
 logic has been factored out of the classic arraysetops. At least from a
 developers perspective, it very much feels like a logical extension of the
 same 'thing'.

 But also from a conceptual numpy perspective, grouping is really more an
 'elementary manipulation of an ndarray' than a 'special purpose algorithm'.
 It is useful for literally all kinds of programming; hence there is similar
 functionality in the python standard library (itertools.groupby); so why
 not have an efficient vectorized equivalent in numpy? It belongs there more
 than the linalg module, arguably.

 Also, from a community perspective, a significant fraction of all
 stackoverflow numpy questions are (unknowingly) exactly about 'how to do
 grouping in numpy'.


 What I'm trying to say is that numpy is a community project. We don't have
 a central planning committee, the only difference between developers and
 everyone else is activity and commit rights. Which is to say if you develop
 and push this topic it is likely to go in. There certainly seems to be
 interest in this functionality. The reason that I brought up scipy is that
 there are some graph algorithms there that went in a couple of years ago.

 Note that the convention on the list is bottom posting.

 snip

 Chuck


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


I understand that numpy is a community project, so that the decision isn't
up to any one particular person; but some early stage feedback from those
active in the community would be welcome. I am generally confident that
this addition makes sense, but I have not contributed to numpy before,
and you don't know what you don't know and all... given that there are
multiple suggestions for changing arraysetops, some coordination would be
useful I think.

Note that I use graph edges merely as an example; the proposed
functionality is much more general than graphing algorithms specifically.
The radial reduction
https://github.com/EelcoHoogendoorn/Numpy_arraysetops_EP/blob/master/examples.pyexample
I included on github is particularly illustrative of the general utility of
grouping functionality I think. Operations like radial reductions are
rather common, and a custom implementation is quite lengthy, very bug
prone, and potentially very slow.

Thanks for the heads up on posting convention; ive always let gmail do my
thinking for me, which works well enough for me, but I can see how not
following this convention is annoying to others.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does a `mergesorted` function make sense?

2014-08-31 Thread Eelco Hoogendoorn
Ive organized all code I had relating to this subject in a github repository
https://github.com/EelcoHoogendoorn/Numpy_arraysetops_EP. That should
facilitate shooting around ideas. Ive also added more documentation and
structure to make it easier to see what is going on.

Hopefully we can converge on a common vision, and then improve the
documentation and testing to make it worthy of including in the numpy
master.

Note that there is also a complete rewrite of the classic
numpy.arraysetops, such that they are also generalized to more complex
input, such as finding unique graph edges, and so on.

You mentioned getting the numpy core developers involved; are they not
subscribed to this mailing list? I wouldn't be surprised; youd hope there
is a channel of discussion concerning development with higher signal to
noise


On Thu, Aug 28, 2014 at 1:49 AM, Eelco Hoogendoorn 
hoogendoorn.ee...@gmail.com wrote:

 I just checked the docs on ufuncs, and it appears that's a solved problem
 now, since ufunc.reduceat now comes with an axis argument. Or maybe it
 already did when I wrote that, but I simply wasn't paying attention. Either
 way, the code is fully vectorized now, in both grouped and non-grouped
 axes. Its a lot of code, but all that happens for a grouping other than
 some O(1) and O(n) stuff is an argsort of the keys, and then the reduction
 itself, all fully vectorized.

 Note that I sort the values first, and then use ufunc.reduceat on the
 groups. It would seem to me that ufunc.at should be more efficient, by
 avoiding this indirection, but testing very much revealed the opposite, for
 reasons unclear to me. Perhaps that's changed now as well.


 On Wed, Aug 27, 2014 at 11:32 PM, Jaime Fernández del Río 
 jaime.f...@gmail.com wrote:

 Yes, I was aware of that. But the point would be to provide true
 vectorization on those operations.

 The way I see it, numpy may not have to have a GroupBy implementation,
 but it should at least enable implementing one that is fast and efficient
 over any axis.


 On Wed, Aug 27, 2014 at 12:38 PM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:

 i.e, if the grouped axis is small but the other axes are not, you could
 write this, which avoids the python loop over the long axis that
 np.vectorize would otherwise perform.

 import numpy as np
 from grouping import group_by
 keys = np.random.randint(0,4,10)
 values = np.random.rand(10,2000)
 for k,g in zip(*group_by(keys)(values)):
 print k, g.mean(0)




 On Wed, Aug 27, 2014 at 9:29 PM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:

 f.i., this works as expected as well (100 keys of 1d int arrays and 100
 values of 1d float arrays):

 group_by(randint(0,4,(100,2))).mean(rand(100,2))


 On Wed, Aug 27, 2014 at 9:27 PM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:

 If I understand you correctly, the current implementation supports
 these operations. All reductions over groups (except for median) are
 performed through the corresponding ufunc (see GroupBy.reduce). This works
 on multidimensional arrays as well, although this broadcasting over the
 non-grouping axes is accomplished using np.vectorize. Actual vectorization
 only happens over the axis being grouped over, but this is usually a long
 axis. If it isn't, it is more efficient to perform a reduction by means of
 splitting the array by its groups first, and then map the iterable of
 groups over some reduction operation (as noted in the docstring of
 GroupBy.reduce).


 On Wed, Aug 27, 2014 at 8:29 PM, Jaime Fernández del Río 
 jaime.f...@gmail.com wrote:

 Hi Eelco,

 I took a deeper look into your code a couple of weeks back. I don't
 think I have fully grasped what it allows completely, but I agree that 
 some
 form of what you have there is highly desirable. Along the same lines, 
 for
 sometime I have been thinking that the right place for a `groupby` in 
 numpy
 is as a method of ufuncs, so that `np.add.groupby(arr, groups)` would do 
 a
 multidimensional version of `np.bincount(groups, weights=arr)`. You would
 then need a more powerful version of `np.unique` to produce the `groups`,
 but that is something that Joe Kington's old PR was very close to
 achieving, that should probably be resurrected as well. But yes, there
 seems to be material for a NEP here, and some guidance from one of the
 numpy devs would be helpful in getting this somewhere.

 Jaime


 On Wed, Aug 27, 2014 at 10:35 AM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:

 It wouldn't hurt to have this function, but my intuition is that its
 use will be minimal. If you are already working with sorted arrays, you
 already have a flop cost on that order of magnitude, and the optimized
 merge saves you a factor two at the very most. Using numpy means you are
 sacrificing factors of two and beyond relative to pure C left right and
 center anyway, so if this kind of thing matters to you, you probably 
 wont
 be working in numpy in the first place.

 That said, 

[Numpy-discussion] Does a `mergesorted` function make sense?

2014-08-27 Thread Jaime Fernández del Río
A request was open in github to add a `merge` function to numpy that would
merge two sorted 1d arrays into a single sorted 1d array. I have been
playing around with that idea for a while, and have a branch in my numpy
fork that adds a `mergesorted` function to `numpy.lib`:

https://github.com/jaimefrio/numpy/commit/ce5d480afecc989a36e5d2bf4ea1d1ba58a83b0a

I drew inspiration from C++ STL algorithms, and merged into a single
function what merge, set_union, set_intersection, set_difference and
set_symmetric_difference do there.

My first thought when implementing this was to not make it a public
function, but use it under the hood to speed-up some of the functions of
`arraysetops.py`, which are now merging two already sorted functions by
doing `np.sort(np.concatenate((a, b)))`. I would need to revisit my
testing, but the speed-ups weren't that great.

One other thing I saw value in for some of the `arraysetops.py` functions,
but couldn't fully figure out, was in providing extra output aside from the
merged arrays, either in the form of indices, or of boolean masks,
indicating which items of the original arrays made it into the merged one,
and/or where did they end up in it.

Since there is at least one other person out there that likes it, is there
any more interest in such a function? If yes, any comments on what the
proper interface for extra output should be? Although perhaps the best is
to leave that out for starters and see what use people make of it, if any.

Jaime

-- 
(\__/)
( O.o)
(  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does a `mergesorted` function make sense?

2014-08-27 Thread Eelco Hoogendoorn
It wouldn't hurt to have this function, but my intuition is that its use
will be minimal. If you are already working with sorted arrays, you already
have a flop cost on that order of magnitude, and the optimized merge saves
you a factor two at the very most. Using numpy means you are sacrificing
factors of two and beyond relative to pure C left right and center anyway,
so if this kind of thing matters to you, you probably wont be working in
numpy in the first place.

That said, I share your interest in overhauling arraysetops. I see many
opportunities for expanding its functionality. There is a question that
amounts to 'how do I do group-by in numpy' on stackoverflow almost every
week. That would have my top priority, but also things like extending
np.unique to things like graph edges, or other more complex input, is very
often useful to me.

Ive written up a draft http://pastebin.com/c5WLWPbpa while ago which
accomplishes all of the above and more. It reimplements functions like
np.unique around a common Index object. This index object encapsulates the
precomputation (sorting) required for efficient set-ops on different
datatypes, and provides a common interface to obtain the kind of
information you are talking about (which is used extensively internally in
the implementation of group_by, for instance).

ie, this functionality allows you to write neat things like
group_by(randint(0,9,(100,2))).median(rand(100))

But I have the feeling much more could be done in this direction, and I
feel this draft could really use a bit of back and forth. If we are going
to completely rewrite arraysetops, we might as well do it right.


On Wed, Aug 27, 2014 at 7:02 PM, Jaime Fernández del Río 
jaime.f...@gmail.com wrote:

 A request was open in github to add a `merge` function to numpy that would
 merge two sorted 1d arrays into a single sorted 1d array. I have been
 playing around with that idea for a while, and have a branch in my numpy
 fork that adds a `mergesorted` function to `numpy.lib`:


 https://github.com/jaimefrio/numpy/commit/ce5d480afecc989a36e5d2bf4ea1d1ba58a83b0a

 I drew inspiration from C++ STL algorithms, and merged into a single
 function what merge, set_union, set_intersection, set_difference and
 set_symmetric_difference do there.

 My first thought when implementing this was to not make it a public
 function, but use it under the hood to speed-up some of the functions of
 `arraysetops.py`, which are now merging two already sorted functions by
 doing `np.sort(np.concatenate((a, b)))`. I would need to revisit my
 testing, but the speed-ups weren't that great.

 One other thing I saw value in for some of the `arraysetops.py` functions,
 but couldn't fully figure out, was in providing extra output aside from the
 merged arrays, either in the form of indices, or of boolean masks,
 indicating which items of the original arrays made it into the merged one,
 and/or where did they end up in it.

 Since there is at least one other person out there that likes it, is there
 any more interest in such a function? If yes, any comments on what the
 proper interface for extra output should be? Although perhaps the best is
 to leave that out for starters and see what use people make of it, if any.

 Jaime

 --
 (\__/)
 ( O.o)
 (  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
 de dominación mundial.

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does a `mergesorted` function make sense?

2014-08-27 Thread Jaime Fernández del Río
Hi Eelco,

I took a deeper look into your code a couple of weeks back. I don't think I
have fully grasped what it allows completely, but I agree that some form of
what you have there is highly desirable. Along the same lines, for sometime
I have been thinking that the right place for a `groupby` in numpy is as a
method of ufuncs, so that `np.add.groupby(arr, groups)` would do a
multidimensional version of `np.bincount(groups, weights=arr)`. You would
then need a more powerful version of `np.unique` to produce the `groups`,
but that is something that Joe Kington's old PR was very close to
achieving, that should probably be resurrected as well. But yes, there
seems to be material for a NEP here, and some guidance from one of the
numpy devs would be helpful in getting this somewhere.

Jaime


On Wed, Aug 27, 2014 at 10:35 AM, Eelco Hoogendoorn 
hoogendoorn.ee...@gmail.com wrote:

 It wouldn't hurt to have this function, but my intuition is that its use
 will be minimal. If you are already working with sorted arrays, you already
 have a flop cost on that order of magnitude, and the optimized merge saves
 you a factor two at the very most. Using numpy means you are sacrificing
 factors of two and beyond relative to pure C left right and center anyway,
 so if this kind of thing matters to you, you probably wont be working in
 numpy in the first place.

 That said, I share your interest in overhauling arraysetops. I see many
 opportunities for expanding its functionality. There is a question that
 amounts to 'how do I do group-by in numpy' on stackoverflow almost every
 week. That would have my top priority, but also things like extending
 np.unique to things like graph edges, or other more complex input, is very
 often useful to me.

 Ive written up a draft http://pastebin.com/c5WLWPbpa while ago which
 accomplishes all of the above and more. It reimplements functions like
 np.unique around a common Index object. This index object encapsulates the
 precomputation (sorting) required for efficient set-ops on different
 datatypes, and provides a common interface to obtain the kind of
 information you are talking about (which is used extensively internally in
 the implementation of group_by, for instance).

 ie, this functionality allows you to write neat things like
 group_by(randint(0,9,(100,2))).median(rand(100))

 But I have the feeling much more could be done in this direction, and I
 feel this draft could really use a bit of back and forth. If we are going
 to completely rewrite arraysetops, we might as well do it right.


 On Wed, Aug 27, 2014 at 7:02 PM, Jaime Fernández del Río 
 jaime.f...@gmail.com wrote:

 A request was open in github to add a `merge` function to numpy that
 would merge two sorted 1d arrays into a single sorted 1d array. I have been
 playing around with that idea for a while, and have a branch in my numpy
 fork that adds a `mergesorted` function to `numpy.lib`:


 https://github.com/jaimefrio/numpy/commit/ce5d480afecc989a36e5d2bf4ea1d1ba58a83b0a

 I drew inspiration from C++ STL algorithms, and merged into a single
 function what merge, set_union, set_intersection, set_difference and
 set_symmetric_difference do there.

 My first thought when implementing this was to not make it a public
 function, but use it under the hood to speed-up some of the functions of
 `arraysetops.py`, which are now merging two already sorted functions by
 doing `np.sort(np.concatenate((a, b)))`. I would need to revisit my
 testing, but the speed-ups weren't that great.

 One other thing I saw value in for some of the `arraysetops.py`
 functions, but couldn't fully figure out, was in providing extra output
 aside from the merged arrays, either in the form of indices, or of boolean
 masks, indicating which items of the original arrays made it into the
 merged one, and/or where did they end up in it.

 Since there is at least one other person out there that likes it, is
 there any more interest in such a function? If yes, any comments on what
 the proper interface for extra output should be? Although perhaps the best
 is to leave that out for starters and see what use people make of it, if
 any.

 Jaime

 --
 (\__/)
 ( O.o)
 (  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
 de dominación mundial.

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion




-- 
(\__/)
( O.o)
(  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does a `mergesorted` function make sense?

2014-08-27 Thread Eelco Hoogendoorn
If I understand you correctly, the current implementation supports these
operations. All reductions over groups (except for median) are performed
through the corresponding ufunc (see GroupBy.reduce). This works on
multidimensional arrays as well, although this broadcasting over the
non-grouping axes is accomplished using np.vectorize. Actual vectorization
only happens over the axis being grouped over, but this is usually a long
axis. If it isn't, it is more efficient to perform a reduction by means of
splitting the array by its groups first, and then map the iterable of
groups over some reduction operation (as noted in the docstring of
GroupBy.reduce).


On Wed, Aug 27, 2014 at 8:29 PM, Jaime Fernández del Río 
jaime.f...@gmail.com wrote:

 Hi Eelco,

 I took a deeper look into your code a couple of weeks back. I don't think
 I have fully grasped what it allows completely, but I agree that some form
 of what you have there is highly desirable. Along the same lines, for
 sometime I have been thinking that the right place for a `groupby` in numpy
 is as a method of ufuncs, so that `np.add.groupby(arr, groups)` would do a
 multidimensional version of `np.bincount(groups, weights=arr)`. You would
 then need a more powerful version of `np.unique` to produce the `groups`,
 but that is something that Joe Kington's old PR was very close to
 achieving, that should probably be resurrected as well. But yes, there
 seems to be material for a NEP here, and some guidance from one of the
 numpy devs would be helpful in getting this somewhere.

 Jaime


 On Wed, Aug 27, 2014 at 10:35 AM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:

 It wouldn't hurt to have this function, but my intuition is that its use
 will be minimal. If you are already working with sorted arrays, you already
 have a flop cost on that order of magnitude, and the optimized merge saves
 you a factor two at the very most. Using numpy means you are sacrificing
 factors of two and beyond relative to pure C left right and center anyway,
 so if this kind of thing matters to you, you probably wont be working in
 numpy in the first place.

 That said, I share your interest in overhauling arraysetops. I see many
 opportunities for expanding its functionality. There is a question that
 amounts to 'how do I do group-by in numpy' on stackoverflow almost every
 week. That would have my top priority, but also things like extending
 np.unique to things like graph edges, or other more complex input, is very
 often useful to me.

 Ive written up a draft http://pastebin.com/c5WLWPbpa while ago which
 accomplishes all of the above and more. It reimplements functions like
 np.unique around a common Index object. This index object encapsulates the
 precomputation (sorting) required for efficient set-ops on different
 datatypes, and provides a common interface to obtain the kind of
 information you are talking about (which is used extensively internally in
 the implementation of group_by, for instance).

 ie, this functionality allows you to write neat things like
 group_by(randint(0,9,(100,2))).median(rand(100))

 But I have the feeling much more could be done in this direction, and I
 feel this draft could really use a bit of back and forth. If we are going
 to completely rewrite arraysetops, we might as well do it right.


 On Wed, Aug 27, 2014 at 7:02 PM, Jaime Fernández del Río 
 jaime.f...@gmail.com wrote:

 A request was open in github to add a `merge` function to numpy that
 would merge two sorted 1d arrays into a single sorted 1d array. I have been
 playing around with that idea for a while, and have a branch in my numpy
 fork that adds a `mergesorted` function to `numpy.lib`:


 https://github.com/jaimefrio/numpy/commit/ce5d480afecc989a36e5d2bf4ea1d1ba58a83b0a

 I drew inspiration from C++ STL algorithms, and merged into a single
 function what merge, set_union, set_intersection, set_difference and
 set_symmetric_difference do there.

 My first thought when implementing this was to not make it a public
 function, but use it under the hood to speed-up some of the functions of
 `arraysetops.py`, which are now merging two already sorted functions by
 doing `np.sort(np.concatenate((a, b)))`. I would need to revisit my
 testing, but the speed-ups weren't that great.

 One other thing I saw value in for some of the `arraysetops.py`
 functions, but couldn't fully figure out, was in providing extra output
 aside from the merged arrays, either in the form of indices, or of boolean
 masks, indicating which items of the original arrays made it into the
 merged one, and/or where did they end up in it.

 Since there is at least one other person out there that likes it, is
 there any more interest in such a function? If yes, any comments on what
 the proper interface for extra output should be? Although perhaps the best
 is to leave that out for starters and see what use people make of it, if
 any.

 Jaime

 --
 (\__/)
 ( O.o)
 (  ) Este es Conejo. Copia a Conejo en 

Re: [Numpy-discussion] Does a `mergesorted` function make sense?

2014-08-27 Thread Eelco Hoogendoorn
i.e, if the grouped axis is small but the other axes are not, you could
write this, which avoids the python loop over the long axis that
np.vectorize would otherwise perform.

import numpy as np
from grouping import group_by
keys = np.random.randint(0,4,10)
values = np.random.rand(10,2000)
for k,g in zip(*group_by(keys)(values)):
print k, g.mean(0)




On Wed, Aug 27, 2014 at 9:29 PM, Eelco Hoogendoorn 
hoogendoorn.ee...@gmail.com wrote:

 f.i., this works as expected as well (100 keys of 1d int arrays and 100
 values of 1d float arrays):

 group_by(randint(0,4,(100,2))).mean(rand(100,2))


 On Wed, Aug 27, 2014 at 9:27 PM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:

 If I understand you correctly, the current implementation supports these
 operations. All reductions over groups (except for median) are performed
 through the corresponding ufunc (see GroupBy.reduce). This works on
 multidimensional arrays as well, although this broadcasting over the
 non-grouping axes is accomplished using np.vectorize. Actual vectorization
 only happens over the axis being grouped over, but this is usually a long
 axis. If it isn't, it is more efficient to perform a reduction by means of
 splitting the array by its groups first, and then map the iterable of
 groups over some reduction operation (as noted in the docstring of
 GroupBy.reduce).


 On Wed, Aug 27, 2014 at 8:29 PM, Jaime Fernández del Río 
 jaime.f...@gmail.com wrote:

 Hi Eelco,

 I took a deeper look into your code a couple of weeks back. I don't
 think I have fully grasped what it allows completely, but I agree that some
 form of what you have there is highly desirable. Along the same lines, for
 sometime I have been thinking that the right place for a `groupby` in numpy
 is as a method of ufuncs, so that `np.add.groupby(arr, groups)` would do a
 multidimensional version of `np.bincount(groups, weights=arr)`. You would
 then need a more powerful version of `np.unique` to produce the `groups`,
 but that is something that Joe Kington's old PR was very close to
 achieving, that should probably be resurrected as well. But yes, there
 seems to be material for a NEP here, and some guidance from one of the
 numpy devs would be helpful in getting this somewhere.

 Jaime


 On Wed, Aug 27, 2014 at 10:35 AM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:

 It wouldn't hurt to have this function, but my intuition is that its
 use will be minimal. If you are already working with sorted arrays, you
 already have a flop cost on that order of magnitude, and the optimized
 merge saves you a factor two at the very most. Using numpy means you are
 sacrificing factors of two and beyond relative to pure C left right and
 center anyway, so if this kind of thing matters to you, you probably wont
 be working in numpy in the first place.

 That said, I share your interest in overhauling arraysetops. I see many
 opportunities for expanding its functionality. There is a question that
 amounts to 'how do I do group-by in numpy' on stackoverflow almost every
 week. That would have my top priority, but also things like extending
 np.unique to things like graph edges, or other more complex input, is very
 often useful to me.

 Ive written up a draft http://pastebin.com/c5WLWPbpa while ago which
 accomplishes all of the above and more. It reimplements functions like
 np.unique around a common Index object. This index object encapsulates the
 precomputation (sorting) required for efficient set-ops on different
 datatypes, and provides a common interface to obtain the kind of
 information you are talking about (which is used extensively internally in
 the implementation of group_by, for instance).

 ie, this functionality allows you to write neat things like
 group_by(randint(0,9,(100,2))).median(rand(100))

 But I have the feeling much more could be done in this direction, and I
 feel this draft could really use a bit of back and forth. If we are going
 to completely rewrite arraysetops, we might as well do it right.


 On Wed, Aug 27, 2014 at 7:02 PM, Jaime Fernández del Río 
 jaime.f...@gmail.com wrote:

 A request was open in github to add a `merge` function to numpy that
 would merge two sorted 1d arrays into a single sorted 1d array. I have 
 been
 playing around with that idea for a while, and have a branch in my numpy
 fork that adds a `mergesorted` function to `numpy.lib`:


 https://github.com/jaimefrio/numpy/commit/ce5d480afecc989a36e5d2bf4ea1d1ba58a83b0a

 I drew inspiration from C++ STL algorithms, and merged into a single
 function what merge, set_union, set_intersection, set_difference and
 set_symmetric_difference do there.

 My first thought when implementing this was to not make it a public
 function, but use it under the hood to speed-up some of the functions of
 `arraysetops.py`, which are now merging two already sorted functions by
 doing `np.sort(np.concatenate((a, b)))`. I would need to revisit my
 testing, but the speed-ups weren't that 

Re: [Numpy-discussion] Does a `mergesorted` function make sense?

2014-08-27 Thread Jaime Fernández del Río
Yes, I was aware of that. But the point would be to provide true
vectorization on those operations.

The way I see it, numpy may not have to have a GroupBy implementation, but
it should at least enable implementing one that is fast and efficient over
any axis.


On Wed, Aug 27, 2014 at 12:38 PM, Eelco Hoogendoorn 
hoogendoorn.ee...@gmail.com wrote:

 i.e, if the grouped axis is small but the other axes are not, you could
 write this, which avoids the python loop over the long axis that
 np.vectorize would otherwise perform.

 import numpy as np
 from grouping import group_by
 keys = np.random.randint(0,4,10)
 values = np.random.rand(10,2000)
 for k,g in zip(*group_by(keys)(values)):
 print k, g.mean(0)




 On Wed, Aug 27, 2014 at 9:29 PM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:

 f.i., this works as expected as well (100 keys of 1d int arrays and 100
 values of 1d float arrays):

 group_by(randint(0,4,(100,2))).mean(rand(100,2))


 On Wed, Aug 27, 2014 at 9:27 PM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:

 If I understand you correctly, the current implementation supports these
 operations. All reductions over groups (except for median) are performed
 through the corresponding ufunc (see GroupBy.reduce). This works on
 multidimensional arrays as well, although this broadcasting over the
 non-grouping axes is accomplished using np.vectorize. Actual vectorization
 only happens over the axis being grouped over, but this is usually a long
 axis. If it isn't, it is more efficient to perform a reduction by means of
 splitting the array by its groups first, and then map the iterable of
 groups over some reduction operation (as noted in the docstring of
 GroupBy.reduce).


 On Wed, Aug 27, 2014 at 8:29 PM, Jaime Fernández del Río 
 jaime.f...@gmail.com wrote:

 Hi Eelco,

 I took a deeper look into your code a couple of weeks back. I don't
 think I have fully grasped what it allows completely, but I agree that some
 form of what you have there is highly desirable. Along the same lines, for
 sometime I have been thinking that the right place for a `groupby` in numpy
 is as a method of ufuncs, so that `np.add.groupby(arr, groups)` would do a
 multidimensional version of `np.bincount(groups, weights=arr)`. You would
 then need a more powerful version of `np.unique` to produce the `groups`,
 but that is something that Joe Kington's old PR was very close to
 achieving, that should probably be resurrected as well. But yes, there
 seems to be material for a NEP here, and some guidance from one of the
 numpy devs would be helpful in getting this somewhere.

 Jaime


 On Wed, Aug 27, 2014 at 10:35 AM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:

 It wouldn't hurt to have this function, but my intuition is that its
 use will be minimal. If you are already working with sorted arrays, you
 already have a flop cost on that order of magnitude, and the optimized
 merge saves you a factor two at the very most. Using numpy means you are
 sacrificing factors of two and beyond relative to pure C left right and
 center anyway, so if this kind of thing matters to you, you probably wont
 be working in numpy in the first place.

 That said, I share your interest in overhauling arraysetops. I see
 many opportunities for expanding its functionality. There is a question
 that amounts to 'how do I do group-by in numpy' on stackoverflow almost
 every week. That would have my top priority, but also things like 
 extending
 np.unique to things like graph edges, or other more complex input, is very
 often useful to me.

 Ive written up a draft http://pastebin.com/c5WLWPbpa while ago
 which accomplishes all of the above and more. It reimplements functions
 like np.unique around a common Index object. This index object 
 encapsulates
 the precomputation (sorting) required for efficient set-ops on different
 datatypes, and provides a common interface to obtain the kind of
 information you are talking about (which is used extensively internally in
 the implementation of group_by, for instance).

 ie, this functionality allows you to write neat things like
 group_by(randint(0,9,(100,2))).median(rand(100))

 But I have the feeling much more could be done in this direction, and
 I feel this draft could really use a bit of back and forth. If we are 
 going
 to completely rewrite arraysetops, we might as well do it right.


 On Wed, Aug 27, 2014 at 7:02 PM, Jaime Fernández del Río 
 jaime.f...@gmail.com wrote:

 A request was open in github to add a `merge` function to numpy that
 would merge two sorted 1d arrays into a single sorted 1d array. I have 
 been
 playing around with that idea for a while, and have a branch in my numpy
 fork that adds a `mergesorted` function to `numpy.lib`:


 https://github.com/jaimefrio/numpy/commit/ce5d480afecc989a36e5d2bf4ea1d1ba58a83b0a

 I drew inspiration from C++ STL algorithms, and merged into a single
 function what merge, set_union, set_intersection, set_difference and

Re: [Numpy-discussion] Does a `mergesorted` function make sense?

2014-08-27 Thread Daπid
On 27 August 2014 19:02, Jaime Fernández del Río jaime.f...@gmail.com
wrote:


 Since there is at least one other person out there that likes it, is there
 any more interest in such a function? If yes, any comments on what the
 proper interface for extra output should be? Although perhaps the best is
 to leave that out for starters and see what use people make of it, if any.


I think a perhaps more useful thing would be to implement timsort. I
understand it is capable of take full advantage of the partially sorted
arrays, with the extra safety of not making the assumption that the
individual arrays are sorted. This will also be useful for other real world
cases where the data is already partially sorted.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does a `mergesorted` function make sense?

2014-08-27 Thread Eelco Hoogendoorn
I just checked the docs on ufuncs, and it appears that's a solved problem
now, since ufunc.reduceat now comes with an axis argument. Or maybe it
already did when I wrote that, but I simply wasn't paying attention. Either
way, the code is fully vectorized now, in both grouped and non-grouped
axes. Its a lot of code, but all that happens for a grouping other than
some O(1) and O(n) stuff is an argsort of the keys, and then the reduction
itself, all fully vectorized.

Note that I sort the values first, and then use ufunc.reduceat on the
groups. It would seem to me that ufunc.at should be more efficient, by
avoiding this indirection, but testing very much revealed the opposite, for
reasons unclear to me. Perhaps that's changed now as well.


On Wed, Aug 27, 2014 at 11:32 PM, Jaime Fernández del Río 
jaime.f...@gmail.com wrote:

 Yes, I was aware of that. But the point would be to provide true
 vectorization on those operations.

 The way I see it, numpy may not have to have a GroupBy implementation, but
 it should at least enable implementing one that is fast and efficient over
 any axis.


 On Wed, Aug 27, 2014 at 12:38 PM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:

 i.e, if the grouped axis is small but the other axes are not, you could
 write this, which avoids the python loop over the long axis that
 np.vectorize would otherwise perform.

 import numpy as np
 from grouping import group_by
 keys = np.random.randint(0,4,10)
 values = np.random.rand(10,2000)
 for k,g in zip(*group_by(keys)(values)):
 print k, g.mean(0)




 On Wed, Aug 27, 2014 at 9:29 PM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:

 f.i., this works as expected as well (100 keys of 1d int arrays and 100
 values of 1d float arrays):

 group_by(randint(0,4,(100,2))).mean(rand(100,2))


 On Wed, Aug 27, 2014 at 9:27 PM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:

 If I understand you correctly, the current implementation supports
 these operations. All reductions over groups (except for median) are
 performed through the corresponding ufunc (see GroupBy.reduce). This works
 on multidimensional arrays as well, although this broadcasting over the
 non-grouping axes is accomplished using np.vectorize. Actual vectorization
 only happens over the axis being grouped over, but this is usually a long
 axis. If it isn't, it is more efficient to perform a reduction by means of
 splitting the array by its groups first, and then map the iterable of
 groups over some reduction operation (as noted in the docstring of
 GroupBy.reduce).


 On Wed, Aug 27, 2014 at 8:29 PM, Jaime Fernández del Río 
 jaime.f...@gmail.com wrote:

 Hi Eelco,

 I took a deeper look into your code a couple of weeks back. I don't
 think I have fully grasped what it allows completely, but I agree that 
 some
 form of what you have there is highly desirable. Along the same lines, for
 sometime I have been thinking that the right place for a `groupby` in 
 numpy
 is as a method of ufuncs, so that `np.add.groupby(arr, groups)` would do a
 multidimensional version of `np.bincount(groups, weights=arr)`. You would
 then need a more powerful version of `np.unique` to produce the `groups`,
 but that is something that Joe Kington's old PR was very close to
 achieving, that should probably be resurrected as well. But yes, there
 seems to be material for a NEP here, and some guidance from one of the
 numpy devs would be helpful in getting this somewhere.

 Jaime


 On Wed, Aug 27, 2014 at 10:35 AM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:

 It wouldn't hurt to have this function, but my intuition is that its
 use will be minimal. If you are already working with sorted arrays, you
 already have a flop cost on that order of magnitude, and the optimized
 merge saves you a factor two at the very most. Using numpy means you are
 sacrificing factors of two and beyond relative to pure C left right and
 center anyway, so if this kind of thing matters to you, you probably wont
 be working in numpy in the first place.

 That said, I share your interest in overhauling arraysetops. I see
 many opportunities for expanding its functionality. There is a question
 that amounts to 'how do I do group-by in numpy' on stackoverflow almost
 every week. That would have my top priority, but also things like 
 extending
 np.unique to things like graph edges, or other more complex input, is 
 very
 often useful to me.

 Ive written up a draft http://pastebin.com/c5WLWPbpa while ago
 which accomplishes all of the above and more. It reimplements functions
 like np.unique around a common Index object. This index object 
 encapsulates
 the precomputation (sorting) required for efficient set-ops on different
 datatypes, and provides a common interface to obtain the kind of
 information you are talking about (which is used extensively internally 
 in
 the implementation of group_by, for instance).

 ie, this functionality allows you to write neat things like