Re: [Numpy-discussion] Does a `mergesorted` function make sense?

2014-09-03 Thread cjw

Re: [Numpy-discussion] Does a `mergesorted` function make sense?

On 02/09/2014 8:40 PM, Charles R Harris wrote:




On Mon, Sep 1, 2014 at 7:58 AM, Eelco Hoogendoorn 
hoogendoorn.ee...@gmail.com mailto:hoogendoorn.ee...@gmail.com wrote:


On Mon, Sep 1, 2014 at 2:05 PM, Charles R Harris
charlesr.har...@gmail.com mailto:charlesr.har...@gmail.com wrote:




On Mon, Sep 1, 2014 at 1:49 AM, Eelco Hoogendoorn
hoogendoorn.ee...@gmail.com
mailto:hoogendoorn.ee...@gmail.com wrote:

Sure, id like to do the hashing things out, but I would
also like some preliminary feedback as to whether this is
going in a direction anyone else sees the point of, if it
conflicts with other plans, and indeed if we can agree
that numpy is the right place for it; a point which I
would very much like to defend. If there is some obvious
no-go that im missing, I can do without the drudgery of
writing proper documentation ;).

These are good issues, that need to be discussed and resolved. Python 
has the benefit of having a BDFL.  Numpy has no similar arrangement.
In the post-numarray period, Travis Oliphant took that role and advanced 
the package in many ways.


It seems that Travis is no longer available.  There is a need for 
someone, or maybe some group of three, who would be guided by the sort 
of thinking Charles Harris sets out above, who would make the final 
decision.


Colin W.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does a `mergesorted` function make sense?

2014-09-03 Thread Eelco Hoogendoorn
On Wed, Sep 3, 2014 at 4:07 AM, Jaime Fernández del Río 
jaime.f...@gmail.com wrote:

 On Tue, Sep 2, 2014 at 5:40 PM, Charles R Harris 
 charlesr.har...@gmail.com wrote:


 What do you think about the suggestion of timsort? One would need to
 concatenate the arrays before sorting, but it should be fairly efficient.


 Timsort is very cool, and it would definitely be fun to implement in
 numpy. It is also a lot more work that merging two sorted arrays! I think
 +1 if someone else does it, but although I would love to be part of it, I
 am not sure I will be able to find time to look into it seriously in the
 next couple of months.

 From a setops point of view, merging two sorted arrays makes it very
 straightforward to compute, together with (or instead of) the result of the
 merge, index arrays that let you calculate things like `in1d` faster.
 Although perhaps an `argtimsort` could provide the same functionality, not
 sure. I will probably wrap up what I have, put a lace on it, and submit it
 as a PR. Even if it is not destined to be merged, it may serve as a warning
 to others.

 I have also been thinking lately that one of the problems with all these
 unique-related computations may be a case of having a hammer and seeing
 everything as nails. Perhaps the algorithm that needs to be ported from
 Python is not the sorting one, but the hash table...

 Jaime


 Not sure about the hashing. Indeed one can also build an index of a set by
means of a hash table, but its questionable if this leads to improved
performance over performing an argsort. Hashing may have better asymptotic
time complexity in theory, but many datasets used in practice are very easy
to sort (O(N)-ish), and the time-constant of hashing is higher. But more
importantly, using a hash-table guarantees poor cache behavior for many
operations using this index. By contrast, sorting may (but need not) make
one random access pass to build the index, and may (but need not) perform
one random access to reorder values for grouping. But insofar as the keys
are better behaved than pure random, this coherence will be exploited.

Also, getting the unique values/keys in sorted order is a nice side-benefit
for many applications.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does a `mergesorted` function make sense?

2014-09-03 Thread Jaime Fernández del Río
On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn 
hoogendoorn.ee...@gmail.com wrote:

  Not sure about the hashing. Indeed one can also build an index of a set
 by means of a hash table, but its questionable if this leads to improved
 performance over performing an argsort. Hashing may have better asymptotic
 time complexity in theory, but many datasets used in practice are very easy
 to sort (O(N)-ish), and the time-constant of hashing is higher. But more
 importantly, using a hash-table guarantees poor cache behavior for many
 operations using this index. By contrast, sorting may (but need not) make
 one random access pass to build the index, and may (but need not) perform
 one random access to reorder values for grouping. But insofar as the keys
 are better behaved than pure random, this coherence will be exploited.


If you want to give it a try, these branch of my numpy fork has hash table
based implementations of unique (with no extra indices) and in1d:

https://github.com/jaimefrio/numpy/tree/hash-unique

A use cases where the hash table is clearly better:

In [1]: import numpy as np
In [2]: from numpy.lib._compiled_base import _unique, _in1d

In [3]: a = np.random.randint(10, size=(1,))
In [4]: %timeit np.unique(a)
1000 loops, best of 3: 258 us per loop
In [5]: %timeit _unique(a)
1 loops, best of 3: 143 us per loop
In [6]: %timeit np.sort(_unique(a))
1 loops, best of 3: 149 us per loop

It typically performs between 1.5x and 4x faster than sorting. I haven't
profiled it properly to know, but there may be quite a bit of performance
to dig out: have type specific comparison functions, optimize the starting
hash table size based on the size of the array to avoid reinsertions...

If getting the elements sorted is a necessity, and the array contains very
few or no repeated items, then the hash table approach may even perform
worse,:

In [8]: a = np.random.randint(1, size=(5000,))
In [9]: %timeit np.unique(a)
1000 loops, best of 3: 277 us per loop
In [10]: %timeit np.sort(_unique(a))
1000 loops, best of 3: 320 us per loop

But the hash table still wins in extracting the unique items only:

In [11]: %timeit _unique(a)
1 loops, best of 3: 187 us per loop

Where the hash table shines is in more elaborate situations. If you keep
the first index where it was found, and the number of repeats, in the hash
table, you can get return_index and return_counts almost for free, which
means you are performing an extra 3x faster than with sorting.
return_inverse requires a little more trickery, so I won;t attempt to
quantify the improvement. But I wouldn't be surprised if, after fine tuning
it, there is close to an order of magnitude overall improvement

The spped-up for in1d is also nice:

In [16]: a = np.random.randint(1000, size=(1000,))
In [17]: b = np.random.randint(1000, size=(500,))
In [18]: %timeit np.in1d(a, b)
1000 loops, best of 3: 178 us per loop
In [19]: %timeit _in1d(a, b)
1 loops, best of 3: 30.1 us per loop

Of course, there is no point in
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does a `mergesorted` function make sense?

2014-09-03 Thread Jaime Fernández del Río
On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fernández del Río 
jaime.f...@gmail.com wrote:

 On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn 
 hoogendoorn.ee...@gmail.com wrote:

  Not sure about the hashing. Indeed one can also build an index of a set
 by means of a hash table, but its questionable if this leads to improved
 performance over performing an argsort. Hashing may have better asymptotic
 time complexity in theory, but many datasets used in practice are very easy
 to sort (O(N)-ish), and the time-constant of hashing is higher. But more
 importantly, using a hash-table guarantees poor cache behavior for many
 operations using this index. By contrast, sorting may (but need not) make
 one random access pass to build the index, and may (but need not) perform
 one random access to reorder values for grouping. But insofar as the keys
 are better behaved than pure random, this coherence will be exploited.


 If you want to give it a try, these branch of my numpy fork has hash table
 based implementations of unique (with no extra indices) and in1d:

 https://github.com/jaimefrio/numpy/tree/hash-unique

 A use cases where the hash table is clearly better:

 In [1]: import numpy as np
 In [2]: from numpy.lib._compiled_base import _unique, _in1d

 In [3]: a = np.random.randint(10, size=(1,))
 In [4]: %timeit np.unique(a)
 1000 loops, best of 3: 258 us per loop
 In [5]: %timeit _unique(a)
 1 loops, best of 3: 143 us per loop
 In [6]: %timeit np.sort(_unique(a))
 1 loops, best of 3: 149 us per loop

 It typically performs between 1.5x and 4x faster than sorting. I haven't
 profiled it properly to know, but there may be quite a bit of performance
 to dig out: have type specific comparison functions, optimize the starting
 hash table size based on the size of the array to avoid reinsertions...

 If getting the elements sorted is a necessity, and the array contains very
 few or no repeated items, then the hash table approach may even perform
 worse,:

 In [8]: a = np.random.randint(1, size=(5000,))
 In [9]: %timeit np.unique(a)
 1000 loops, best of 3: 277 us per loop
 In [10]: %timeit np.sort(_unique(a))
 1000 loops, best of 3: 320 us per loop

 But the hash table still wins in extracting the unique items only:

 In [11]: %timeit _unique(a)
 1 loops, best of 3: 187 us per loop

 Where the hash table shines is in more elaborate situations. If you keep
 the first index where it was found, and the number of repeats, in the hash
 table, you can get return_index and return_counts almost for free, which
 means you are performing an extra 3x faster than with sorting.
 return_inverse requires a little more trickery, so I won;t attempt to
 quantify the improvement. But I wouldn't be surprised if, after fine tuning
 it, there is close to an order of magnitude overall improvement

 The spped-up for in1d is also nice:

 In [16]: a = np.random.randint(1000, size=(1000,))
 In [17]: b = np.random.randint(1000, size=(500,))
 In [18]: %timeit np.in1d(a, b)
 1000 loops, best of 3: 178 us per loop
 In [19]: %timeit _in1d(a, b)
 1 loops, best of 3: 30.1 us per loop

 Of course, there is no point in


Ooops!!! Hit the send button too quick. Not to extend myself too long: if
we are going to rethink all of this, we should approach it with an open
mind. Still, and this post is not helping with that either, I am afraid
that we are discussing implementation details, but are missing a broader
vision of what we want to accomplish and why. That vision of what numpy's
grouping functionality, if any, should be, and how it complements or
conflicts with what pandas is providing, should precede anything else. I
now I haven't, but has anyone looked at how pandas implements grouping?
Their documentation on the subject is well worth a read:

http://pandas.pydata.org/pandas-docs/stable/groupby.html

Does numpy need to replicate this? What/why/how can we add to that?

Jaime

-- 
(\__/)
( O.o)
(  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does a `mergesorted` function make sense?

2014-09-03 Thread Jaime Fernández del Río
On Wed, Sep 3, 2014 at 5:26 AM, cjw c...@ncf.ca wrote:

 These are good issues, that need to be discussed and resolved.  Python has
 the benefit of having a BDFL.  Numpy has no similar arrangement.
 In the post-numarray period, Travis Oliphant took that role and advanced
 the package in many ways.

 It seems that Travis is no longer available.  There is a need for someone,
 or maybe some group of three, who would be guided by the sort of thinking
 Charles Harris sets out above, who would make the final decision.


We should crown Charles philosopher king, and let him wisely rule over us
with the aid of his aristocracy of devs with merge rights. We would make
Plato proud! :-)

Jaime

-- 
(\__/)
( O.o)
(  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Does a `mergesorted` function make sense?

2014-09-03 Thread Charles R Harris
On Wed, Sep 3, 2014 at 10:53 AM, Jaime Fernández del Río 
jaime.f...@gmail.com wrote:

 On Wed, Sep 3, 2014 at 5:26 AM, cjw c...@ncf.ca wrote:

 These are good issues, that need to be discussed and resolved.  Python
 has the benefit of having a BDFL.  Numpy has no similar arrangement.
 In the post-numarray period, Travis Oliphant took that role and advanced
 the package in many ways.

 It seems that Travis is no longer available.  There is a need for
 someone, or maybe some group of three, who would be guided by the sort of
 thinking Charles Harris sets out above, who would make the final decision.


 We should crown Charles philosopher king, and let him wisely rule over us
 with the aid of his aristocracy of devs with merge rights. We would make
 Plato proud! :-)


Charles I needs an heir in case of execution by the Parliamentary forces.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] odd (?) behavior: negative integer scalar in exponent

2014-09-03 Thread Alan G Isaac
What should be the value of `2**np.int_(-32)`?
It is apparently currently computed as `1. / (2**np.int_(32))`,
so the computation overflows (when a C long is 32 bits).
I would have hoped for it to be computed as `1./(2.**np.int_(32))`.

Cheers,
Alan Isaac
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Give Jaime Fernandez commit rights.

2014-09-03 Thread Charles R Harris
Hi All,

I'd like to give Jaime commit rights. Having at three active developers
with commit rights is the goal and Jaime has been pretty consistent with
code submissions and discussion participation.

Thoughts?

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Give Jaime Fernandez commit rights.

2014-09-03 Thread Robert Kern
On Wed, Sep 3, 2014 at 10:47 PM, Charles R Harris
charlesr.har...@gmail.com wrote:
 Hi All,

 I'd like to give Jaime commit rights. Having at three active developers with
 commit rights is the goal and Jaime has been pretty consistent with code
 submissions and discussion participation.

+1

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Give Jaime Fernandez commit rights.

2014-09-03 Thread Ralf Gommers
On Wed, Sep 3, 2014 at 11:48 PM, Robert Kern robert.k...@gmail.com wrote:

 On Wed, Sep 3, 2014 at 10:47 PM, Charles R Harris
 charlesr.har...@gmail.com wrote:
  Hi All,
 
  I'd like to give Jaime commit rights. Having at three active developers
 with
  commit rights is the goal and Jaime has been pretty consistent with code
  submissions and discussion participation.

 +1


+1 excellent idea

Ralf
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] odd (?) behavior: negative integer scalar in exponent

2014-09-03 Thread Charles R Harris
On Wed, Sep 3, 2014 at 3:19 PM, Alan G Isaac alan.is...@gmail.com wrote:

 What should be the value of `2**np.int_(-32)`?
 It is apparently currently computed as `1. / (2**np.int_(32))`,
 so the computation overflows (when a C long is 32 bits).
 I would have hoped for it to be computed as `1./(2.**np.int_(32))`.


Looks like a bug to me.

Chuck.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Give Jaime Fernandez commit rights.

2014-09-03 Thread Eelco Hoogendoorn
+1; though I am relatively new to the scene, Jaime's contributions have
always stood out to me as thoughtful.


On Thu, Sep 4, 2014 at 12:42 AM, Ralf Gommers ralf.gomm...@gmail.com
wrote:




 On Wed, Sep 3, 2014 at 11:48 PM, Robert Kern robert.k...@gmail.com
 wrote:

 On Wed, Sep 3, 2014 at 10:47 PM, Charles R Harris
 charlesr.har...@gmail.com wrote:
  Hi All,
 
  I'd like to give Jaime commit rights. Having at three active developers
 with
  commit rights is the goal and Jaime has been pretty consistent with code
  submissions and discussion participation.

 +1


 +1 excellent idea

 Ralf



 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Give Jaime Fernandez commit rights.

2014-09-03 Thread Charles R Harris
On Wed, Sep 3, 2014 at 5:48 PM, Eelco Hoogendoorn 
hoogendoorn.ee...@gmail.com wrote:

 +1; though I am relatively new to the scene, Jaime's contributions have
 always stood out to me as thoughtful.


 On Thu, Sep 4, 2014 at 12:42 AM, Ralf Gommers ralf.gomm...@gmail.com
 wrote:




 On Wed, Sep 3, 2014 at 11:48 PM, Robert Kern robert.k...@gmail.com
 wrote:

 On Wed, Sep 3, 2014 at 10:47 PM, Charles R Harris
 charlesr.har...@gmail.com wrote:
  Hi All,
 
  I'd like to give Jaime commit rights. Having at three active
 developers with
  commit rights is the goal and Jaime has been pretty consistent with
 code
  submissions and discussion participation.

 +1


 +1 excellent idea


I think the ayes will have it.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Give Jaime Fernandez commit rights.

2014-09-03 Thread Jaime Fernández del Río
On Wed, Sep 3, 2014 at 5:47 PM, Charles R Harris charlesr.har...@gmail.com
wrote:


 I think the ayes will have it.


As I told Chuck (because I now get to call Charles Chuck, right? :-)), I am
not sure I am fully qualified for the job: looking at the names on that
list is a humbling experience. But even if I am the idiot brother, it feels
good to be part of the family.

Numpy has provided me with countless hours of learning and enjoyment. and I
really look forward to giving back, even if only a fraction of that.

Thanks a lot for the trust!

Jaime

-- 
(\__/)
( O.o)
(  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion