Re: [Numpy-discussion] [Suggestion] Labelled Array

2016-02-19 Thread Benjamin Root
matplotlib would be more than happy if numpy could take those functions off
our hands! They don't get nearly the correct visibility in matplotlib
because no one is expecting them to be in a plotting library, and they
don't have any useful unit-tests. None of us made them, so we are very
hesitant to update them because of that.

Cheers!
Ben Root

On Fri, Feb 19, 2016 at 1:39 PM,  wrote:

>
>
> On Fri, Feb 19, 2016 at 12:08 PM, Allan Haldane 
> wrote:
>
>> I also want to add a historical note here, that 'groupby' has been
>> discussed a couple times before.
>>
>> Travis Oliphant even made an NEP for it, and Wes McKinney lightly hinted
>> at adding it to numpy.
>>
>>
>> http://thread.gmane.org/gmane.comp.python.numeric.general/37480/focus=37480
>>
>> http://thread.gmane.org/gmane.comp.python.numeric.general/38272/focus=38299
>> http://docs.scipy.org/doc/numpy-1.10.1/neps/groupby_additions.html
>>
>> Travis's idea for a ufunc method 'reduceby' is more along the lines of
>> what I was originally thinking. Just musing about it, it might cover few
>> small cases pandas groupby might not: It could work on arbitrary ufuncs,
>> and over particular axes of multidimensional data. Eg, to sum over
>> pixels from NxNx3 image data. But maybe pandas can cover the
>> multidimensional case through additional index columns or with Panel.
>>
>
> xarray is now covering that area.
>
> There are also recfunctions in numpy.lib that never got a lot of attention
> and expansion.
> There were plans to cover more of the matplotlib versions in numpy, but I
> have no idea and didn't check what happened to it..
>
> Josef
>
>
>
>>
>> Cheers,
>> Allan
>>
>> On 02/15/2016 05:31 PM, Paul Hobson wrote:
>> > Just for posterity -- any future readers to this thread who need to do
>> > pandas-like on record arrays should look at matplotlib's mlab submodule.
>> >
>> > I've been in situations (::cough:: Esri production ::cough::) where I've
>> > had one hand tied behind my back and unable to install pandas. mlab was
>> > a big help there.
>> >
>> > https://goo.gl/M7Mi8B
>> >
>> > -paul
>> >
>> >
>> >
>> > On Mon, Feb 15, 2016 at 1:28 PM, Lluís Vilanova > > > wrote:
>> >
>> > Benjamin Root writes:
>> >
>> > > Seems like you are talking about xarray:
>> https://github.com/pydata/xarray
>> >
>> > Oh, I wasn't aware of xarray, but there's also this:
>> >
>> >
>> >
>> https://people.gso.ac.upc.edu/vilanova/doc/sciexp2/user_guide/data.html#basic-indexing
>> >
>> >
>> https://people.gso.ac.upc.edu/vilanova/doc/sciexp2/user_guide/data.html#dimension-oblivious-indexing
>> >
>> >
>> > Cheers,
>> >   Lluis
>> >
>> >
>> >
>> > > Cheers!
>> > > Ben Root
>> >
>> > > On Fri, Feb 12, 2016 at 9:40 AM, Sérgio > > > wrote:
>> >
>> > > Hello,
>> >
>> >
>> > > This is my first e-mail, I will try to make the idea simple.
>> >
>> >
>> > > Similar to masked array it would be interesting to use a label
>> > array to
>> > > guide operations.
>> >
>> >
>> > > Ex.:
>> >  x
>> > > labelled_array(data =
>> >
>> > > [[0 1 2]
>> > > [3 4 5]
>> > > [6 7 8]],
>> > > label =
>> > > [[0 1 2]
>> > > [0 1 2]
>> > > [0 1 2]])
>> >
>> >
>> >  sum(x)
>> > > array([9, 12, 15])
>> >
>> >
>> > > The operations would create a new axis for label indexing.
>> >
>> >
>> > > You could think of it as a collection of masks, one for each
>> > label.
>> >
>> >
>> > > I don't know a way to make something like this efficiently
>> > without a loop.
>> > > Just wondering...
>> >
>> >
>> > > Sérgio.
>> >
>> > > ___
>> > > NumPy-Discussion mailing list
>> > > NumPy-Discussion@scipy.org > >
>> > > https://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >
>> >
>> >
>> >
>> > > ___
>> > > NumPy-Discussion mailing list
>> > > NumPy-Discussion@scipy.org 
>> > > https://mail.scipy.org/mailman/listinfo/numpy-discussion
>> > ___
>> > NumPy-Discussion mailing list
>> > NumPy-Discussion@scipy.org 
>> > https://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >
>> >
>> >
>> >
>> > ___
>> > NumPy-Discussion mailing list
>> > NumPy-Discussion@scipy.org
>> > https://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >
>>
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>
>
> 

Re: [Numpy-discussion] [Suggestion] Labelled Array

2016-02-19 Thread josef.pktd
On Fri, Feb 19, 2016 at 12:08 PM, Allan Haldane 
wrote:

> I also want to add a historical note here, that 'groupby' has been
> discussed a couple times before.
>
> Travis Oliphant even made an NEP for it, and Wes McKinney lightly hinted
> at adding it to numpy.
>
> http://thread.gmane.org/gmane.comp.python.numeric.general/37480/focus=37480
> http://thread.gmane.org/gmane.comp.python.numeric.general/38272/focus=38299
> http://docs.scipy.org/doc/numpy-1.10.1/neps/groupby_additions.html
>
> Travis's idea for a ufunc method 'reduceby' is more along the lines of
> what I was originally thinking. Just musing about it, it might cover few
> small cases pandas groupby might not: It could work on arbitrary ufuncs,
> and over particular axes of multidimensional data. Eg, to sum over
> pixels from NxNx3 image data. But maybe pandas can cover the
> multidimensional case through additional index columns or with Panel.
>

xarray is now covering that area.

There are also recfunctions in numpy.lib that never got a lot of attention
and expansion.
There were plans to cover more of the matplotlib versions in numpy, but I
have no idea and didn't check what happened to it..

Josef



>
> Cheers,
> Allan
>
> On 02/15/2016 05:31 PM, Paul Hobson wrote:
> > Just for posterity -- any future readers to this thread who need to do
> > pandas-like on record arrays should look at matplotlib's mlab submodule.
> >
> > I've been in situations (::cough:: Esri production ::cough::) where I've
> > had one hand tied behind my back and unable to install pandas. mlab was
> > a big help there.
> >
> > https://goo.gl/M7Mi8B
> >
> > -paul
> >
> >
> >
> > On Mon, Feb 15, 2016 at 1:28 PM, Lluís Vilanova  > > wrote:
> >
> > Benjamin Root writes:
> >
> > > Seems like you are talking about xarray:
> https://github.com/pydata/xarray
> >
> > Oh, I wasn't aware of xarray, but there's also this:
> >
> >
> >
> https://people.gso.ac.upc.edu/vilanova/doc/sciexp2/user_guide/data.html#basic-indexing
> >
> >
> https://people.gso.ac.upc.edu/vilanova/doc/sciexp2/user_guide/data.html#dimension-oblivious-indexing
> >
> >
> > Cheers,
> >   Lluis
> >
> >
> >
> > > Cheers!
> > > Ben Root
> >
> > > On Fri, Feb 12, 2016 at 9:40 AM, Sérgio  > > wrote:
> >
> > > Hello,
> >
> >
> > > This is my first e-mail, I will try to make the idea simple.
> >
> >
> > > Similar to masked array it would be interesting to use a label
> > array to
> > > guide operations.
> >
> >
> > > Ex.:
> >  x
> > > labelled_array(data =
> >
> > > [[0 1 2]
> > > [3 4 5]
> > > [6 7 8]],
> > > label =
> > > [[0 1 2]
> > > [0 1 2]
> > > [0 1 2]])
> >
> >
> >  sum(x)
> > > array([9, 12, 15])
> >
> >
> > > The operations would create a new axis for label indexing.
> >
> >
> > > You could think of it as a collection of masks, one for each
> > label.
> >
> >
> > > I don't know a way to make something like this efficiently
> > without a loop.
> > > Just wondering...
> >
> >
> > > Sérgio.
> >
> > > ___
> > > NumPy-Discussion mailing list
> > > NumPy-Discussion@scipy.org 
> > > https://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
> >
> >
> >
> > > ___
> > > NumPy-Discussion mailing list
> > > NumPy-Discussion@scipy.org 
> > > https://mail.scipy.org/mailman/listinfo/numpy-discussion
> > ___
> > NumPy-Discussion mailing list
> > NumPy-Discussion@scipy.org 
> > https://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
> >
> >
> >
> > ___
> > NumPy-Discussion mailing list
> > NumPy-Discussion@scipy.org
> > https://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [Suggestion] Labelled Array

2016-02-19 Thread Allan Haldane
I also want to add a historical note here, that 'groupby' has been
discussed a couple times before.

Travis Oliphant even made an NEP for it, and Wes McKinney lightly hinted
at adding it to numpy.

http://thread.gmane.org/gmane.comp.python.numeric.general/37480/focus=37480
http://thread.gmane.org/gmane.comp.python.numeric.general/38272/focus=38299
http://docs.scipy.org/doc/numpy-1.10.1/neps/groupby_additions.html

Travis's idea for a ufunc method 'reduceby' is more along the lines of
what I was originally thinking. Just musing about it, it might cover few
small cases pandas groupby might not: It could work on arbitrary ufuncs,
and over particular axes of multidimensional data. Eg, to sum over
pixels from NxNx3 image data. But maybe pandas can cover the
multidimensional case through additional index columns or with Panel.

Cheers,
Allan

On 02/15/2016 05:31 PM, Paul Hobson wrote:
> Just for posterity -- any future readers to this thread who need to do
> pandas-like on record arrays should look at matplotlib's mlab submodule. 
> 
> I've been in situations (::cough:: Esri production ::cough::) where I've
> had one hand tied behind my back and unable to install pandas. mlab was
> a big help there.
> 
> https://goo.gl/M7Mi8B
> 
> -paul
> 
> 
> 
> On Mon, Feb 15, 2016 at 1:28 PM, Lluís Vilanova  > wrote:
> 
> Benjamin Root writes:
> 
> > Seems like you are talking about xarray: 
> https://github.com/pydata/xarray
> 
> Oh, I wasn't aware of xarray, but there's also this:
> 
>  
> 
> https://people.gso.ac.upc.edu/vilanova/doc/sciexp2/user_guide/data.html#basic-indexing
>  
> 
> https://people.gso.ac.upc.edu/vilanova/doc/sciexp2/user_guide/data.html#dimension-oblivious-indexing
> 
> 
> Cheers,
>   Lluis
> 
> 
> 
> > Cheers!
> > Ben Root
> 
> > On Fri, Feb 12, 2016 at 9:40 AM, Sérgio  > wrote:
> 
> > Hello,
> 
> 
> > This is my first e-mail, I will try to make the idea simple.
> 
> 
> > Similar to masked array it would be interesting to use a label
> array to
> > guide operations.
> 
> 
> > Ex.:
>  x
> > labelled_array(data =
> 
> > [[0 1 2]
> > [3 4 5]
> > [6 7 8]],
> > label =
> > [[0 1 2]
> > [0 1 2]
> > [0 1 2]])
> 
> 
>  sum(x)
> > array([9, 12, 15])
> 
> 
> > The operations would create a new axis for label indexing.
> 
> 
> > You could think of it as a collection of masks, one for each
> label.
> 
> 
> > I don't know a way to make something like this efficiently
> without a loop.
> > Just wondering...
> 
> 
> > Sérgio.
> 
> > ___
> > NumPy-Discussion mailing list
> > NumPy-Discussion@scipy.org 
> > https://mail.scipy.org/mailman/listinfo/numpy-discussion
> 
> 
> 
> 
> > ___
> > NumPy-Discussion mailing list
> > NumPy-Discussion@scipy.org 
> > https://mail.scipy.org/mailman/listinfo/numpy-discussion
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org 
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
> 
> 
> 
> 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
> 

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [Suggestion] Labelled Array

2016-02-16 Thread Sérgio
Just something I tried with pandas:

>>> image
array([[[ 0,  1,  2,  3,  4],
[ 5,  6,  7,  8,  9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]],

   [[20, 21, 22, 23, 24],
[25, 26, 27, 28, 29],
[30, 31, 32, 33, 34],
[35, 36, 37, 38, 39]],

   [[40, 41, 42, 43, 44],
[45, 46, 47, 48, 49],
[50, 51, 52, 53, 54],
[55, 56, 57, 58, 59]]])

>>> label
array([[0, 1, 2, 3, 4],
   [1, 2, 3, 4, 5],
   [2, 3, 4, 5, 6],
   [3, 4, 5, 6, 7]])

>>> dt = pd.DataFrame(np.vstack((label.ravel(), image.reshape(3, 20))).T)
>>> labelled_image = dt.groupby(0)

>>> labelled_image.mean().values
array([[ 0, 20, 40],
   [ 3, 23, 43],
   [ 6, 26, 46],
   [ 9, 29, 49],
   [10, 30, 50],
   [13, 33, 53],
   [16, 36, 56],
   [19, 39, 59]])

Sergio


> Date: Sat, 13 Feb 2016 22:41:13 -0500
> From: Allan Haldane <allanhald...@gmail.com>
> To: numpy-discussion@scipy.org
> Subject: Re: [Numpy-discussion] [Suggestion] Labelled Array
> Message-ID: <56bff759.7010...@gmail.com>
> Content-Type: text/plain; charset=windows-1252; format=flowed
>
> Impressive!
>
> Possibly there's still a case for including a 'groupby' function in
> numpy itself since it's a generally useful operation, but I do see less
> of a need given the nice pandas functionality.
>
> At least, next time someone asks a stackoverflow question like the ones
> below someone should tell them to use pandas!
>
> (copied from my gist for future list reference).
>
> http://stackoverflow.com/questions/4373631/sum-array-by-number-in-numpy
>
> http://stackoverflow.com/questions/31483912/split-numpy-array-according-to-values-in-the-array-a-condition/31484134#31484134
>
> http://stackoverflow.com/questions/31863083/python-split-numpy-array-based-on-values-in-the-array
>
> http://stackoverflow.com/questions/28599405/splitting-an-array-into-two-smaller-arrays-in-python
>
> http://stackoverflow.com/questions/7662458/how-to-split-an-array-according-to-a-condition-in-numpy
>
> Allan
>
>
> On 02/13/2016 01:39 PM, Jeff Reback wrote:
> > In [10]: pd.options.display.max_rows=10
> >
> > In [13]: np.random.seed(1234)
> >
> > In [14]: c = np.random.randint(0,32,size=10)
> >
> > In [15]: v = np.arange(10)
> >
> > In [16]: df = DataFrame({'v' : v, 'c' : c})
> >
> > In [17]: df
> > Out[17]:
> >  c  v
> > 0  15  0
> > 1  19  1
> > 2   6  2
> > 3  21  3
> > 4  12  4
> > ........
> > 5   7  5
> > 6   2  6
> > 7  27  7
> > 8  28  8
> > 9   7  9
> >
> > [10 rows x 2 columns]
> >
> > In [19]: df.groupby('c').count()
> > Out[19]:
> > v
> > c
> > 0   3136
> > 1   3229
> > 2   3093
> > 3   3121
> > 4   3041
> > ..   ...
> > 27  3128
> > 28  3063
> > 29  3147
> > 30  3073
> > 31  3090
> >
> > [32 rows x 1 columns]
> >
> > In [20]: %timeit df.groupby('c').count()
> > 100 loops, best of 3: 2 ms per loop
> >
> > In [21]: %timeit df.groupby('c').mean()
> > 100 loops, best of 3: 2.39 ms per loop
> >
> > In [22]: df.groupby('c').mean()
> > Out[22]:
> > v
> > c
> > 0   49883.384885
> > 1   50233.692165
> > 2   48634.116069
> > 3   50811.743992
> > 4   50505.368629
> > ..   ...
> > 27  49715.349425
> > 28  50363.501469
> > 29  50485.395933
> > 30  50190.155223
> > 31  50691.041748
> >
> > [32 rows x 1 columns]
> >
> >
> > On Sat, Feb 13, 2016 at 1:29 PM, <josef.p...@gmail.com
> > <mailto:josef.p...@gmail.com>> wrote:
> >
> >
> >
> > On Sat, Feb 13, 2016 at 1:01 PM, Allan Haldane
> > <allanhald...@gmail.com <mailto:allanhald...@gmail.com>> wrote:
> >
> > Sorry, to reply to myself here, but looking at it with fresh
> > eyes maybe the performance of the naive version isn't too bad.
> > Here's a comparison of the naive vs a better implementation:
> >
> > def split_classes_naive(c, v):
> >  return [v[c == u] for u in unique(c)]
> >
> > def split_classes(c, v):
> >  perm = c.argsort()
> >  csrt = c[perm]
> >  div = where(csrt[1:] != csrt[:-1])[0] + 1
> >  return [v[x] for x in split(perm, div)]
> >
> > >>> c = randint(0,32,size=10

Re: [Numpy-discussion] [Suggestion] Labelled Array

2016-02-15 Thread Paul Hobson
Just for posterity -- any future readers to this thread who need to do
pandas-like on record arrays should look at matplotlib's mlab submodule.

I've been in situations (::cough:: Esri production ::cough::) where I've
had one hand tied behind my back and unable to install pandas. mlab was a
big help there.

https://goo.gl/M7Mi8B

-paul



On Mon, Feb 15, 2016 at 1:28 PM, Lluís Vilanova  wrote:

> Benjamin Root writes:
>
> > Seems like you are talking about xarray:
> https://github.com/pydata/xarray
>
> Oh, I wasn't aware of xarray, but there's also this:
>
>
> https://people.gso.ac.upc.edu/vilanova/doc/sciexp2/user_guide/data.html#basic-indexing
>
> https://people.gso.ac.upc.edu/vilanova/doc/sciexp2/user_guide/data.html#dimension-oblivious-indexing
>
>
> Cheers,
>   Lluis
>
>
>
> > Cheers!
> > Ben Root
>
> > On Fri, Feb 12, 2016 at 9:40 AM, Sérgio  wrote:
>
> > Hello,
>
>
> > This is my first e-mail, I will try to make the idea simple.
>
>
> > Similar to masked array it would be interesting to use a label array
> to
> > guide operations.
>
>
> > Ex.:
>  x
> > labelled_array(data =
>
> > [[0 1 2]
> > [3 4 5]
> > [6 7 8]],
> > label =
> > [[0 1 2]
> > [0 1 2]
> > [0 1 2]])
>
>
>  sum(x)
> > array([9, 12, 15])
>
>
> > The operations would create a new axis for label indexing.
>
>
> > You could think of it as a collection of masks, one for each label.
>
>
> > I don't know a way to make something like this efficiently without a
> loop.
> > Just wondering...
>
>
> > Sérgio.
>
> > ___
> > NumPy-Discussion mailing list
> > NumPy-Discussion@scipy.org
> > https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
>
>
> > ___
> > NumPy-Discussion mailing list
> > NumPy-Discussion@scipy.org
> > https://mail.scipy.org/mailman/listinfo/numpy-discussion
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [Suggestion] Labelled Array

2016-02-15 Thread Lluís Vilanova
Benjamin Root writes:

> Seems like you are talking about xarray: https://github.com/pydata/xarray

Oh, I wasn't aware of xarray, but there's also this:

  
https://people.gso.ac.upc.edu/vilanova/doc/sciexp2/user_guide/data.html#basic-indexing
  
https://people.gso.ac.upc.edu/vilanova/doc/sciexp2/user_guide/data.html#dimension-oblivious-indexing


Cheers,
  Lluis



> Cheers!
> Ben Root

> On Fri, Feb 12, 2016 at 9:40 AM, Sérgio  wrote:

> Hello,


> This is my first e-mail, I will try to make the idea simple.


> Similar to masked array it would be interesting to use a label array to
> guide operations.


> Ex.:
 x
> labelled_array(data = 

> [[0 1 2]
> [3 4 5]
> [6 7 8]],
> label =
> [[0 1 2]
> [0 1 2]
> [0 1 2]])


 sum(x)
> array([9, 12, 15])


> The operations would create a new axis for label indexing.


> You could think of it as a collection of masks, one for each label.


> I don't know a way to make something like this efficiently without a loop.
> Just wondering...


> Sérgio.

> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion




> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [Suggestion] Labelled Array

2016-02-13 Thread Allan Haldane
I've had a pretty similar idea for a new indexing function 
'split_classes' which would help in your case, which essentially does


def split_classes(c, v):
return [v[c == u] for u in unique(c)]

Your example could be coded as

>>> [sum(c) for c in split_classes(label, data)]
[9, 12, 15]

I feel I've come across the need for such a function often enough that 
it might be generally useful to people as part of numpy. The 
implementation of split_classes above has pretty poor performance 
because it creates many temporary boolean arrays, so my plan for a PR 
was to have a speedy version of it that uses a single pass through v.

(I often wanted to use this function on large datasets).

If anyone has any comments on the idea (good idea. bad idea?) I'd love 
to hear.


I have some further notes and examples here: 
https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21


Allan

On 02/12/2016 09:40 AM, Sérgio wrote:

Hello,

This is my first e-mail, I will try to make the idea simple.

Similar to masked array it would be interesting to use a label array to
guide operations.

Ex.:
 >>> x
labelled_array(data =
  [[0 1 2]
  [3 4 5]
  [6 7 8]],
 label =
  [[0 1 2]
  [0 1 2]
  [0 1 2]])

 >>> sum(x)
array([9, 12, 15])

The operations would create a new axis for label indexing.

You could think of it as a collection of masks, one for each label.

I don't know a way to make something like this efficiently without a
loop. Just wondering...

Sérgio.


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [Suggestion] Labelled Array

2016-02-13 Thread Allan Haldane
Sorry, to reply to myself here, but looking at it with fresh eyes maybe 
the performance of the naive version isn't too bad. Here's a comparison 
of the naive vs a better implementation:


def split_classes_naive(c, v):
return [v[c == u] for u in unique(c)]

def split_classes(c, v):
perm = c.argsort()
csrt = c[perm]
div = where(csrt[1:] != csrt[:-1])[0] + 1
return [v[x] for x in split(perm, div)]

>>> c = randint(0,32,size=10)
>>> v = arange(10)
>>> %timeit split_classes_naive(c,v)
100 loops, best of 3: 8.4 ms per loop
>>> %timeit split_classes(c,v)
100 loops, best of 3: 4.79 ms per loop

In any case, maybe it is useful to Sergio or others.

Allan

On 02/13/2016 12:11 PM, Allan Haldane wrote:

I've had a pretty similar idea for a new indexing function
'split_classes' which would help in your case, which essentially does

 def split_classes(c, v):
 return [v[c == u] for u in unique(c)]

Your example could be coded as

 >>> [sum(c) for c in split_classes(label, data)]
 [9, 12, 15]

I feel I've come across the need for such a function often enough that
it might be generally useful to people as part of numpy. The
implementation of split_classes above has pretty poor performance
because it creates many temporary boolean arrays, so my plan for a PR
was to have a speedy version of it that uses a single pass through v.
(I often wanted to use this function on large datasets).

If anyone has any comments on the idea (good idea. bad idea?) I'd love
to hear.

I have some further notes and examples here:
https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21

Allan

On 02/12/2016 09:40 AM, Sérgio wrote:

Hello,

This is my first e-mail, I will try to make the idea simple.

Similar to masked array it would be interesting to use a label array to
guide operations.

Ex.:
 >>> x
labelled_array(data =
  [[0 1 2]
  [3 4 5]
  [6 7 8]],
 label =
  [[0 1 2]
  [0 1 2]
  [0 1 2]])

 >>> sum(x)
array([9, 12, 15])

The operations would create a new axis for label indexing.

You could think of it as a collection of masks, one for each label.

I don't know a way to make something like this efficiently without a
loop. Just wondering...

Sérgio.


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion





___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [Suggestion] Labelled Array

2016-02-13 Thread Nathaniel Smith
I believe this is basically a groupby, which is one of pandas's core
competencies... even if numpy were to add some utilities for this kind of
thing, then I doubt we'd do as well as them, so you might check whether
pandas works for you first :-)
On Feb 12, 2016 6:40 AM, "Sérgio"  wrote:

> Hello,
>
> This is my first e-mail, I will try to make the idea simple.
>
> Similar to masked array it would be interesting to use a label array to
> guide operations.
>
> Ex.:
> >>> x
> labelled_array(data =
>  [[0 1 2]
>  [3 4 5]
>  [6 7 8]],
> label =
>  [[0 1 2]
>  [0 1 2]
>  [0 1 2]])
>
> >>> sum(x)
> array([9, 12, 15])
>
> The operations would create a new axis for label indexing.
>
> You could think of it as a collection of masks, one for each label.
>
> I don't know a way to make something like this efficiently without a loop.
> Just wondering...
>
> Sérgio.
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [Suggestion] Labelled Array

2016-02-13 Thread josef.pktd
On Sat, Feb 13, 2016 at 1:01 PM, Allan Haldane 
wrote:

> Sorry, to reply to myself here, but looking at it with fresh eyes maybe
> the performance of the naive version isn't too bad. Here's a comparison of
> the naive vs a better implementation:
>
> def split_classes_naive(c, v):
> return [v[c == u] for u in unique(c)]
>
> def split_classes(c, v):
> perm = c.argsort()
> csrt = c[perm]
> div = where(csrt[1:] != csrt[:-1])[0] + 1
> return [v[x] for x in split(perm, div)]
>
> >>> c = randint(0,32,size=10)
> >>> v = arange(10)
> >>> %timeit split_classes_naive(c,v)
> 100 loops, best of 3: 8.4 ms per loop
> >>> %timeit split_classes(c,v)
> 100 loops, best of 3: 4.79 ms per loop
>

The usecases I recently started to target for similar things is 1 Million
or more rows and 1 uniques in the labels.
The second version should be faster for large number of uniques, I guess.

Overall numpy is falling far behind pandas in terms of simple groupby
operations. bincount and histogram (IIRC) worked for some cases but are
rather limited.

reduce_at looks nice for cases where it applies.

In contrast to the full sized labels in the original post, I only know of
applications where the labels are 1-D corresponding to rows or columns.

Josef



>
> In any case, maybe it is useful to Sergio or others.
>
> Allan
>
>
> On 02/13/2016 12:11 PM, Allan Haldane wrote:
>
>> I've had a pretty similar idea for a new indexing function
>> 'split_classes' which would help in your case, which essentially does
>>
>>  def split_classes(c, v):
>>  return [v[c == u] for u in unique(c)]
>>
>> Your example could be coded as
>>
>>  >>> [sum(c) for c in split_classes(label, data)]
>>  [9, 12, 15]
>>
>> I feel I've come across the need for such a function often enough that
>> it might be generally useful to people as part of numpy. The
>> implementation of split_classes above has pretty poor performance
>> because it creates many temporary boolean arrays, so my plan for a PR
>> was to have a speedy version of it that uses a single pass through v.
>> (I often wanted to use this function on large datasets).
>>
>> If anyone has any comments on the idea (good idea. bad idea?) I'd love
>> to hear.
>>
>> I have some further notes and examples here:
>> https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21
>>
>> Allan
>>
>> On 02/12/2016 09:40 AM, Sérgio wrote:
>>
>>> Hello,
>>>
>>> This is my first e-mail, I will try to make the idea simple.
>>>
>>> Similar to masked array it would be interesting to use a label array to
>>> guide operations.
>>>
>>> Ex.:
>>>  >>> x
>>> labelled_array(data =
>>>   [[0 1 2]
>>>   [3 4 5]
>>>   [6 7 8]],
>>>  label =
>>>   [[0 1 2]
>>>   [0 1 2]
>>>   [0 1 2]])
>>>
>>>  >>> sum(x)
>>> array([9, 12, 15])
>>>
>>> The operations would create a new axis for label indexing.
>>>
>>> You could think of it as a collection of masks, one for each label.
>>>
>>> I don't know a way to make something like this efficiently without a
>>> loop. Just wondering...
>>>
>>> Sérgio.
>>>
>>>
>>> ___
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion@scipy.org
>>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>>
>>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [Suggestion] Labelled Array

2016-02-13 Thread Jeff Reback
In [10]: pd.options.display.max_rows=10

In [13]: np.random.seed(1234)

In [14]: c = np.random.randint(0,32,size=10)

In [15]: v = np.arange(10)

In [16]: df = DataFrame({'v' : v, 'c' : c})

In [17]: df
Out[17]:
c  v
0  15  0
1  19  1
2   6  2
3  21  3
4  12  4
........
5   7  5
6   2  6
7  27  7
8  28  8
9   7  9

[10 rows x 2 columns]

In [19]: df.groupby('c').count()
Out[19]:
   v
c
0   3136
1   3229
2   3093
3   3121
4   3041
..   ...
27  3128
28  3063
29  3147
30  3073
31  3090

[32 rows x 1 columns]

In [20]: %timeit df.groupby('c').count()
100 loops, best of 3: 2 ms per loop

In [21]: %timeit df.groupby('c').mean()
100 loops, best of 3: 2.39 ms per loop

In [22]: df.groupby('c').mean()
Out[22]:
   v
c
0   49883.384885
1   50233.692165
2   48634.116069
3   50811.743992
4   50505.368629
..   ...
27  49715.349425
28  50363.501469
29  50485.395933
30  50190.155223
31  50691.041748

[32 rows x 1 columns]


On Sat, Feb 13, 2016 at 1:29 PM,  wrote:

>
>
> On Sat, Feb 13, 2016 at 1:01 PM, Allan Haldane 
> wrote:
>
>> Sorry, to reply to myself here, but looking at it with fresh eyes maybe
>> the performance of the naive version isn't too bad. Here's a comparison of
>> the naive vs a better implementation:
>>
>> def split_classes_naive(c, v):
>> return [v[c == u] for u in unique(c)]
>>
>> def split_classes(c, v):
>> perm = c.argsort()
>> csrt = c[perm]
>> div = where(csrt[1:] != csrt[:-1])[0] + 1
>> return [v[x] for x in split(perm, div)]
>>
>> >>> c = randint(0,32,size=10)
>> >>> v = arange(10)
>> >>> %timeit split_classes_naive(c,v)
>> 100 loops, best of 3: 8.4 ms per loop
>> >>> %timeit split_classes(c,v)
>> 100 loops, best of 3: 4.79 ms per loop
>>
>
> The usecases I recently started to target for similar things is 1 Million
> or more rows and 1 uniques in the labels.
> The second version should be faster for large number of uniques, I guess.
>
> Overall numpy is falling far behind pandas in terms of simple groupby
> operations. bincount and histogram (IIRC) worked for some cases but are
> rather limited.
>
> reduce_at looks nice for cases where it applies.
>
> In contrast to the full sized labels in the original post, I only know of
> applications where the labels are 1-D corresponding to rows or columns.
>
> Josef
>
>
>
>>
>> In any case, maybe it is useful to Sergio or others.
>>
>> Allan
>>
>>
>> On 02/13/2016 12:11 PM, Allan Haldane wrote:
>>
>>> I've had a pretty similar idea for a new indexing function
>>> 'split_classes' which would help in your case, which essentially does
>>>
>>>  def split_classes(c, v):
>>>  return [v[c == u] for u in unique(c)]
>>>
>>> Your example could be coded as
>>>
>>>  >>> [sum(c) for c in split_classes(label, data)]
>>>  [9, 12, 15]
>>>
>>> I feel I've come across the need for such a function often enough that
>>> it might be generally useful to people as part of numpy. The
>>> implementation of split_classes above has pretty poor performance
>>> because it creates many temporary boolean arrays, so my plan for a PR
>>> was to have a speedy version of it that uses a single pass through v.
>>> (I often wanted to use this function on large datasets).
>>>
>>> If anyone has any comments on the idea (good idea. bad idea?) I'd love
>>> to hear.
>>>
>>> I have some further notes and examples here:
>>> https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21
>>>
>>> Allan
>>>
>>> On 02/12/2016 09:40 AM, Sérgio wrote:
>>>
 Hello,

 This is my first e-mail, I will try to make the idea simple.

 Similar to masked array it would be interesting to use a label array to
 guide operations.

 Ex.:
  >>> x
 labelled_array(data =
   [[0 1 2]
   [3 4 5]
   [6 7 8]],
  label =
   [[0 1 2]
   [0 1 2]
   [0 1 2]])

  >>> sum(x)
 array([9, 12, 15])

 The operations would create a new axis for label indexing.

 You could think of it as a collection of masks, one for each label.

 I don't know a way to make something like this efficiently without a
 loop. Just wondering...

 Sérgio.


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 https://mail.scipy.org/mailman/listinfo/numpy-discussion


>>>
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list

Re: [Numpy-discussion] [Suggestion] Labelled Array

2016-02-13 Thread Jeff Reback
These operations get slower as the number of groups increase, but with a
faster function (e.g. the standard ones which are cythonized), the constant
on
the increase is pretty low.

In [23]: c = np.random.randint(0,1,size=10)

In [24]: df = DataFrame({'v' : v, 'c' : c})

In [25]: %timeit df.groupby('c').count()
100 loops, best of 3: 3.18 ms per loop

In [26]: len(df.groupby('c').count())
Out[26]: 1

In [27]: df.groupby('c').count()
Out[27]:
   v
c
0  9
1 11
2  7
3  8
4 16
...   ..
9995  11
9996  13
9997  13
9998   7
  10

[1 rows x 1 columns]


On Sat, Feb 13, 2016 at 1:39 PM, Jeff Reback  wrote:

> In [10]: pd.options.display.max_rows=10
>
> In [13]: np.random.seed(1234)
>
> In [14]: c = np.random.randint(0,32,size=10)
>
> In [15]: v = np.arange(10)
>
> In [16]: df = DataFrame({'v' : v, 'c' : c})
>
> In [17]: df
> Out[17]:
> c  v
> 0  15  0
> 1  19  1
> 2   6  2
> 3  21  3
> 4  12  4
> ........
> 5   7  5
> 6   2  6
> 7  27  7
> 8  28  8
> 9   7  9
>
> [10 rows x 2 columns]
>
> In [19]: df.groupby('c').count()
> Out[19]:
>v
> c
> 0   3136
> 1   3229
> 2   3093
> 3   3121
> 4   3041
> ..   ...
> 27  3128
> 28  3063
> 29  3147
> 30  3073
> 31  3090
>
> [32 rows x 1 columns]
>
> In [20]: %timeit df.groupby('c').count()
> 100 loops, best of 3: 2 ms per loop
>
> In [21]: %timeit df.groupby('c').mean()
> 100 loops, best of 3: 2.39 ms per loop
>
> In [22]: df.groupby('c').mean()
> Out[22]:
>v
> c
> 0   49883.384885
> 1   50233.692165
> 2   48634.116069
> 3   50811.743992
> 4   50505.368629
> ..   ...
> 27  49715.349425
> 28  50363.501469
> 29  50485.395933
> 30  50190.155223
> 31  50691.041748
>
> [32 rows x 1 columns]
>
>
> On Sat, Feb 13, 2016 at 1:29 PM,  wrote:
>
>>
>>
>> On Sat, Feb 13, 2016 at 1:01 PM, Allan Haldane 
>> wrote:
>>
>>> Sorry, to reply to myself here, but looking at it with fresh eyes maybe
>>> the performance of the naive version isn't too bad. Here's a comparison of
>>> the naive vs a better implementation:
>>>
>>> def split_classes_naive(c, v):
>>> return [v[c == u] for u in unique(c)]
>>>
>>> def split_classes(c, v):
>>> perm = c.argsort()
>>> csrt = c[perm]
>>> div = where(csrt[1:] != csrt[:-1])[0] + 1
>>> return [v[x] for x in split(perm, div)]
>>>
>>> >>> c = randint(0,32,size=10)
>>> >>> v = arange(10)
>>> >>> %timeit split_classes_naive(c,v)
>>> 100 loops, best of 3: 8.4 ms per loop
>>> >>> %timeit split_classes(c,v)
>>> 100 loops, best of 3: 4.79 ms per loop
>>>
>>
>> The usecases I recently started to target for similar things is 1 Million
>> or more rows and 1 uniques in the labels.
>> The second version should be faster for large number of uniques, I guess.
>>
>> Overall numpy is falling far behind pandas in terms of simple groupby
>> operations. bincount and histogram (IIRC) worked for some cases but are
>> rather limited.
>>
>> reduce_at looks nice for cases where it applies.
>>
>> In contrast to the full sized labels in the original post, I only know of
>> applications where the labels are 1-D corresponding to rows or columns.
>>
>> Josef
>>
>>
>>
>>>
>>> In any case, maybe it is useful to Sergio or others.
>>>
>>> Allan
>>>
>>>
>>> On 02/13/2016 12:11 PM, Allan Haldane wrote:
>>>
 I've had a pretty similar idea for a new indexing function
 'split_classes' which would help in your case, which essentially does

  def split_classes(c, v):
  return [v[c == u] for u in unique(c)]

 Your example could be coded as

  >>> [sum(c) for c in split_classes(label, data)]
  [9, 12, 15]

 I feel I've come across the need for such a function often enough that
 it might be generally useful to people as part of numpy. The
 implementation of split_classes above has pretty poor performance
 because it creates many temporary boolean arrays, so my plan for a PR
 was to have a speedy version of it that uses a single pass through v.
 (I often wanted to use this function on large datasets).

 If anyone has any comments on the idea (good idea. bad idea?) I'd love
 to hear.

 I have some further notes and examples here:
 https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21

 Allan

 On 02/12/2016 09:40 AM, Sérgio wrote:

> Hello,
>
> This is my first e-mail, I will try to make the idea simple.
>
> Similar to masked array it would be interesting to use a label array to
> guide operations.
>
> Ex.:
>  >>> x
> labelled_array(data =
>   [[0 1 2]
>   [3 4 5]
>   [6 7 8]],
>  label =
>   [[0 1 2]
>   [0 1 2]
>   [0 1 2]])
>
>  >>> sum(x)
> array([9, 12, 15])
>
> The operations would 

Re: [Numpy-discussion] [Suggestion] Labelled Array

2016-02-13 Thread josef.pktd
On Sat, Feb 13, 2016 at 1:42 PM, Jeff Reback  wrote:

> These operations get slower as the number of groups increase, but with a
> faster function (e.g. the standard ones which are cythonized), the
> constant on
> the increase is pretty low.
>
> In [23]: c = np.random.randint(0,1,size=10)
>
> In [24]: df = DataFrame({'v' : v, 'c' : c})
>
> In [25]: %timeit df.groupby('c').count()
> 100 loops, best of 3: 3.18 ms per loop
>
> In [26]: len(df.groupby('c').count())
> Out[26]: 1
>
> In [27]: df.groupby('c').count()
> Out[27]:
>v
> c
> 0  9
> 1 11
> 2  7
> 3  8
> 4 16
> ...   ..
> 9995  11
> 9996  13
> 9997  13
> 9998   7
>   10
>
> [1 rows x 1 columns]
>
>
One other difference across usecases is whether this is a single operation,
or we want to optimize the data format for a large number of different
calculations.  (We have both cases in statsmodels.)

In the latter case it's worth spending some extra computational effort on
rearranging the data to be either sorted or in lists of arrays, (I guess
without having done any timings).

Josef




>
> On Sat, Feb 13, 2016 at 1:39 PM, Jeff Reback  wrote:
>
>> In [10]: pd.options.display.max_rows=10
>>
>> In [13]: np.random.seed(1234)
>>
>> In [14]: c = np.random.randint(0,32,size=10)
>>
>> In [15]: v = np.arange(10)
>>
>> In [16]: df = DataFrame({'v' : v, 'c' : c})
>>
>> In [17]: df
>> Out[17]:
>> c  v
>> 0  15  0
>> 1  19  1
>> 2   6  2
>> 3  21  3
>> 4  12  4
>> ........
>> 5   7  5
>> 6   2  6
>> 7  27  7
>> 8  28  8
>> 9   7  9
>>
>> [10 rows x 2 columns]
>>
>> In [19]: df.groupby('c').count()
>> Out[19]:
>>v
>> c
>> 0   3136
>> 1   3229
>> 2   3093
>> 3   3121
>> 4   3041
>> ..   ...
>> 27  3128
>> 28  3063
>> 29  3147
>> 30  3073
>> 31  3090
>>
>> [32 rows x 1 columns]
>>
>> In [20]: %timeit df.groupby('c').count()
>> 100 loops, best of 3: 2 ms per loop
>>
>> In [21]: %timeit df.groupby('c').mean()
>> 100 loops, best of 3: 2.39 ms per loop
>>
>> In [22]: df.groupby('c').mean()
>> Out[22]:
>>v
>> c
>> 0   49883.384885
>> 1   50233.692165
>> 2   48634.116069
>> 3   50811.743992
>> 4   50505.368629
>> ..   ...
>> 27  49715.349425
>> 28  50363.501469
>> 29  50485.395933
>> 30  50190.155223
>> 31  50691.041748
>>
>> [32 rows x 1 columns]
>>
>>
>> On Sat, Feb 13, 2016 at 1:29 PM,  wrote:
>>
>>>
>>>
>>> On Sat, Feb 13, 2016 at 1:01 PM, Allan Haldane 
>>> wrote:
>>>
 Sorry, to reply to myself here, but looking at it with fresh eyes maybe
 the performance of the naive version isn't too bad. Here's a comparison of
 the naive vs a better implementation:

 def split_classes_naive(c, v):
 return [v[c == u] for u in unique(c)]

 def split_classes(c, v):
 perm = c.argsort()
 csrt = c[perm]
 div = where(csrt[1:] != csrt[:-1])[0] + 1
 return [v[x] for x in split(perm, div)]

 >>> c = randint(0,32,size=10)
 >>> v = arange(10)
 >>> %timeit split_classes_naive(c,v)
 100 loops, best of 3: 8.4 ms per loop
 >>> %timeit split_classes(c,v)
 100 loops, best of 3: 4.79 ms per loop

>>>
>>> The usecases I recently started to target for similar things is 1
>>> Million or more rows and 1 uniques in the labels.
>>> The second version should be faster for large number of uniques, I guess.
>>>
>>> Overall numpy is falling far behind pandas in terms of simple groupby
>>> operations. bincount and histogram (IIRC) worked for some cases but are
>>> rather limited.
>>>
>>> reduce_at looks nice for cases where it applies.
>>>
>>> In contrast to the full sized labels in the original post, I only know
>>> of applications where the labels are 1-D corresponding to rows or columns.
>>>
>>> Josef
>>>
>>>
>>>

 In any case, maybe it is useful to Sergio or others.

 Allan


 On 02/13/2016 12:11 PM, Allan Haldane wrote:

> I've had a pretty similar idea for a new indexing function
> 'split_classes' which would help in your case, which essentially does
>
>  def split_classes(c, v):
>  return [v[c == u] for u in unique(c)]
>
> Your example could be coded as
>
>  >>> [sum(c) for c in split_classes(label, data)]
>  [9, 12, 15]
>
> I feel I've come across the need for such a function often enough that
> it might be generally useful to people as part of numpy. The
> implementation of split_classes above has pretty poor performance
> because it creates many temporary boolean arrays, so my plan for a PR
> was to have a speedy version of it that uses a single pass through v.
> (I often wanted to use this function on large datasets).
>
> If anyone has any comments on the idea (good idea. bad idea?) I'd love

Re: [Numpy-discussion] [Suggestion] Labelled Array

2016-02-13 Thread Allan Haldane

Impressive!

Possibly there's still a case for including a 'groupby' function in 
numpy itself since it's a generally useful operation, but I do see less 
of a need given the nice pandas functionality.


At least, next time someone asks a stackoverflow question like the ones 
below someone should tell them to use pandas!


(copied from my gist for future list reference).

http://stackoverflow.com/questions/4373631/sum-array-by-number-in-numpy
http://stackoverflow.com/questions/31483912/split-numpy-array-according-to-values-in-the-array-a-condition/31484134#31484134
http://stackoverflow.com/questions/31863083/python-split-numpy-array-based-on-values-in-the-array
http://stackoverflow.com/questions/28599405/splitting-an-array-into-two-smaller-arrays-in-python
http://stackoverflow.com/questions/7662458/how-to-split-an-array-according-to-a-condition-in-numpy

Allan


On 02/13/2016 01:39 PM, Jeff Reback wrote:

In [10]: pd.options.display.max_rows=10

In [13]: np.random.seed(1234)

In [14]: c = np.random.randint(0,32,size=10)

In [15]: v = np.arange(10)

In [16]: df = DataFrame({'v' : v, 'c' : c})

In [17]: df
Out[17]:
 c  v
0  15  0
1  19  1
2   6  2
3  21  3
4  12  4
........
5   7  5
6   2  6
7  27  7
8  28  8
9   7  9

[10 rows x 2 columns]

In [19]: df.groupby('c').count()
Out[19]:
v
c
0   3136
1   3229
2   3093
3   3121
4   3041
..   ...
27  3128
28  3063
29  3147
30  3073
31  3090

[32 rows x 1 columns]

In [20]: %timeit df.groupby('c').count()
100 loops, best of 3: 2 ms per loop

In [21]: %timeit df.groupby('c').mean()
100 loops, best of 3: 2.39 ms per loop

In [22]: df.groupby('c').mean()
Out[22]:
v
c
0   49883.384885
1   50233.692165
2   48634.116069
3   50811.743992
4   50505.368629
..   ...
27  49715.349425
28  50363.501469
29  50485.395933
30  50190.155223
31  50691.041748

[32 rows x 1 columns]


On Sat, Feb 13, 2016 at 1:29 PM, > wrote:



On Sat, Feb 13, 2016 at 1:01 PM, Allan Haldane
> wrote:

Sorry, to reply to myself here, but looking at it with fresh
eyes maybe the performance of the naive version isn't too bad.
Here's a comparison of the naive vs a better implementation:

def split_classes_naive(c, v):
 return [v[c == u] for u in unique(c)]

def split_classes(c, v):
 perm = c.argsort()
 csrt = c[perm]
 div = where(csrt[1:] != csrt[:-1])[0] + 1
 return [v[x] for x in split(perm, div)]

>>> c = randint(0,32,size=10)
>>> v = arange(10)
>>> %timeit split_classes_naive(c,v)
100 loops, best of 3: 8.4 ms per loop
>>> %timeit split_classes(c,v)
100 loops, best of 3: 4.79 ms per loop


The usecases I recently started to target for similar things is 1
Million or more rows and 1 uniques in the labels.
The second version should be faster for large number of uniques, I
guess.

Overall numpy is falling far behind pandas in terms of simple
groupby operations. bincount and histogram (IIRC) worked for some
cases but are rather limited.

reduce_at looks nice for cases where it applies.

In contrast to the full sized labels in the original post, I only
know of applications where the labels are 1-D corresponding to rows
or columns.

Josef


In any case, maybe it is useful to Sergio or others.

Allan


On 02/13/2016 12:11 PM, Allan Haldane wrote:

I've had a pretty similar idea for a new indexing function
'split_classes' which would help in your case, which
essentially does

  def split_classes(c, v):
  return [v[c == u] for u in unique(c)]

Your example could be coded as

  >>> [sum(c) for c in split_classes(label, data)]
  [9, 12, 15]

I feel I've come across the need for such a function often
enough that
it might be generally useful to people as part of numpy. The
implementation of split_classes above has pretty poor
performance
because it creates many temporary boolean arrays, so my plan
for a PR
was to have a speedy version of it that uses a single pass
through v.
(I often wanted to use this function on large datasets).

If anyone has any comments on the idea (good idea. bad
idea?) I'd love
to hear.

I have some further notes and examples here:
https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21

Allan

On 02/12/2016 09:40 AM, Sérgio wrote:

Hello,

This is my first e-mail, I 

Re: [Numpy-discussion] [Suggestion] Labelled Array

2016-02-12 Thread Benjamin Root
Seems like you are talking about xarray: https://github.com/pydata/xarray

Cheers!
Ben Root

On Fri, Feb 12, 2016 at 9:40 AM, Sérgio  wrote:

> Hello,
>
> This is my first e-mail, I will try to make the idea simple.
>
> Similar to masked array it would be interesting to use a label array to
> guide operations.
>
> Ex.:
> >>> x
> labelled_array(data =
>  [[0 1 2]
>  [3 4 5]
>  [6 7 8]],
> label =
>  [[0 1 2]
>  [0 1 2]
>  [0 1 2]])
>
> >>> sum(x)
> array([9, 12, 15])
>
> The operations would create a new axis for label indexing.
>
> You could think of it as a collection of masks, one for each label.
>
> I don't know a way to make something like this efficiently without a loop.
> Just wondering...
>
> Sérgio.
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [Suggestion] Labelled Array

2016-02-12 Thread Benjamin Root
Re-reading your post, I see you are talking about something different. Not
exactly sure what your use-case is.

Ben Root

On Fri, Feb 12, 2016 at 9:49 AM, Benjamin Root  wrote:

> Seems like you are talking about xarray: https://github.com/pydata/xarray
>
> Cheers!
> Ben Root
>
> On Fri, Feb 12, 2016 at 9:40 AM, Sérgio  wrote:
>
>> Hello,
>>
>> This is my first e-mail, I will try to make the idea simple.
>>
>> Similar to masked array it would be interesting to use a label array to
>> guide operations.
>>
>> Ex.:
>> >>> x
>> labelled_array(data =
>>  [[0 1 2]
>>  [3 4 5]
>>  [6 7 8]],
>> label =
>>  [[0 1 2]
>>  [0 1 2]
>>  [0 1 2]])
>>
>> >>> sum(x)
>> array([9, 12, 15])
>>
>> The operations would create a new axis for label indexing.
>>
>> You could think of it as a collection of masks, one for each label.
>>
>> I don't know a way to make something like this efficiently without a
>> loop. Just wondering...
>>
>> Sérgio.
>>
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion