Re: [Numpy-discussion] how to name "contagious" keyword in np.ma.convolve

2016-10-18 Thread Allan Haldane
On 10/17/2016 01:01 PM, Pierre Haessig wrote:
> Le 16/10/2016 à 11:52, Hanno Klemm a écrit :
>> When I have similar situations, I usually interpolate between the valid 
>> values. I assume there are a lot of use cases for convolutions but I have 
>> difficulties imagining that ignoring a missing value and, for the purpose of 
>> the computation, treating it as zero is useful in many of them. 
> When estimating the autocorrelation of a signal, it make sense to drop
> missing pairs of values. Only in this use case, it opens the question of
> correcting or not correcting for the number of missing elements  when
> computing the mean. I don't remember what R function "acf" is doing.
> 
> 
> Also, coming back to the initial question, I feel that it is necessary
> that the name "mask" (or "na" or similar) appears in the parameter name.
> Otherwise, people will wonder : "what on earth is contagious/being
> propagated"

Based on feedback so far, I think "propagate_mask" sounds like the best
word to use. Let's go with that.

As for whether it should default to "True" or "False", the arguments I
see are:

 * False, because that is the way most functions like `np.ma.sum`
   already work, as well as matlab and octave's similar "nanconv".

 * True, because its effects are more visible and might lead to less
   surprises. The "False" case seems like it is often not what the user
   intended. Eg, it affects the overall normalization of normalized
   kernels, and the choice of 0 seems arbitrary.

If no one says anything, I'd probably go with True.

Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] how to name "contagious" keyword in np.ma.convolve

2016-10-18 Thread Allan Haldane
On 10/16/2016 05:52 AM, Hanno Klemm wrote:
> 
> 
>> On 16 Oct 2016, at 03:21, Allan Haldane <allanhald...@gmail.com> wrote:
>>
>>> On 10/14/2016 07:49 PM, Juan Nunez-Iglesias wrote:
>>> +1 for propagate_mask. That is the only proposal that immediately makes
>>> sense to me. "contagious" may be cute but I think approximately 0% of
>>> users would guess its purpose on first use.
>>>
>>> Can you elaborate on what happens with the masks exactly? I didn't quite
>>> get why propagate_mask=False was unintuitive. My expectation is that any
>>> mask present in the input will not be set in the output, but the mask
>>> will be "respected" by the function.
>>
>> Here's an illustration of how the PR currently works with convolve, using 
>> the name "propagate_mask":
>>
>>>>> m = np.ma.masked
>>>>> a = np.ma.array([1,1,1,m,1,1,1,m,m,m,1,1,1])
>>>>> b = np.ma.array([1,1,1])
>>>>>
>>>>> print np.ma.convolve(a, b, propagate_mask=True)
>>[1 2 3 -- -- -- 3 -- -- -- -- -- 3 2 1]
>>>>> print np.ma.convolve(a, b, propagate_mask=False)
>>[1 2 3 2 2 2 3 2 1 -- 1 2 3 2 1]
>>
>> Allan
>>
> 
> Given this behaviour, I'm actually more concerned about the logic ma.convolve 
> uses in the propagate_mask=False case. It appears that the masked values are 
> essentially replaced by zero. Is my interpretation correct and if so does 
> this make sense?
> 

I think that's right.

Its usefulness wasn't obvious to me either, but googling shows that
in matlab people like the file "nanconv.m" which works this way, using
nans similarly to how the mask is used here.

Just as convolution functions often add zero-padding around an image,
here the mask behavior would allow you to have different borders, eg
[m,m,m,1,1,1,1,m,m,m,m]
using my notation from before.

Octave's "nanconv" does this too.

I still agree that in most cases people should be handling the missing
values more carefully (manually) if they are doing convolutions, but
this default behaviour maybe seems reasonable to me.

Allan



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] how to name "contagious" keyword in np.ma.convolve

2016-10-18 Thread Allan Haldane
On 10/17/2016 01:01 PM, Pierre Haessig wrote:
> Hi,
> 
> 
> Le 16/10/2016 à 11:52, Hanno Klemm a écrit :
>> When I have similar situations, I usually interpolate between the valid 
>> values. I assume there are a lot of use cases for convolutions but I have 
>> difficulties imagining that ignoring a missing value and, for the purpose of 
>> the computation, treating it as zero is useful in many of them. 
> When estimating the autocorrelation of a signal, it make sense to drop
> missing pairs of values. Only in this use case, it opens the question of
> correcting or not correcting for the number of missing elements  when
> computing the mean. I don't remember what R function "acf" is doing.
> 
> 
> Also, coming back to the initial question, I feel that it is necessary
> that the name "mask" (or "na" or similar) appears in the parameter name.
> Otherwise, people will wonder : "what on earth is contagious/being
> propagated"
> 
> just thinking of yet another keyword name  : ignore_masked (or drop_masked)
> 
> If I remember well, in R it is dropna. It would be nice if the boolean
> switch followed the same logic.

There is an old unimplemented NEP which uses similar language, like
"ignorena", and np.NA.

http://docs.scipy.org/doc/numpy/neps/missing-data.html

But right now that isn't part of numpy, so I think it would be confusing
to use that terminology.

Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] how to name "contagious" keyword in np.ma.convolve

2016-10-15 Thread Allan Haldane

On 10/14/2016 07:49 PM, Juan Nunez-Iglesias wrote:

+1 for propagate_mask. That is the only proposal that immediately makes
sense to me. "contagious" may be cute but I think approximately 0% of
users would guess its purpose on first use.

Can you elaborate on what happens with the masks exactly? I didn't quite
get why propagate_mask=False was unintuitive. My expectation is that any
mask present in the input will not be set in the output, but the mask
will be "respected" by the function.


Here's an illustration of how the PR currently works with convolve, 
using the name "propagate_mask":


>>> m = np.ma.masked
>>> a = np.ma.array([1,1,1,m,1,1,1,m,m,m,1,1,1])
>>> b = np.ma.array([1,1,1])
>>>
>>> print np.ma.convolve(a, b, propagate_mask=True)
[1 2 3 -- -- -- 3 -- -- -- -- -- 3 2 1]
>>> print np.ma.convolve(a, b, propagate_mask=False)
    [1 2 3 2 2 2 3 2 1 -- 1 2 3 2 1]

Allan



On 15 Oct. 2016, 5:23 AM +1100, Allan Haldane <allanhald...@gmail.com>,
wrote:

I think the possibilities that have been mentioned so far (here or in
the PR) are:

contagious
contagious_mask
propagate
propagate_mask
propagated

`propogate_mask=False` seemed to imply that the mask would never be set,
so Eric also suggested
propagate_mask='any' or propagate_mask='all'


I would be happy with 'propagated=False' as the name/default. As Eric
pointed out, most MaskedArray functions like sum implicitly don't
propagate, currently, so maybe we should do likewise here.


Allan


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] how to name "contagious" keyword in np.ma.convolve

2016-10-14 Thread Allan Haldane
I think the possibilities that have been mentioned so far (here or in
the PR) are:

contagious
contagious_mask
propagate
propagate_mask
propagated

`propogate_mask=False` seemed to imply that the mask would never be set,
so Eric also suggested
propagate_mask='any' or propagate_mask='all'


I would be happy with 'propagated=False' as the name/default. As Eric
pointed out, most MaskedArray functions like sum implicitly don't
propagate, currently, so maybe we should do likewise here.


Allan

On 10/14/2016 01:44 PM, Benjamin Root wrote:
> Why not "propagated"?
> 
> On Fri, Oct 14, 2016 at 1:08 PM, Sebastian Berg
> <sebast...@sipsolutions.net <mailto:sebast...@sipsolutions.net>> wrote:
> 
> On Fr, 2016-10-14 at 13:00 -0400, Allan Haldane wrote:
> > Hi all,
> >
> > Eric Wieser has a PR which defines new functions np.ma.correlate and
> > np.ma.convolve:
> >
> > https://github.com/numpy/numpy/pull/7922
> <https://github.com/numpy/numpy/pull/7922>
> >
> > We're deciding how to name the keyword arg which determines whether
> > masked elements are "propagated" in the convolution sums. Currently
> > we
> > are leaning towards calling it "contagious", with default of True:
> >
> > def convolve(a, v, mode='full', contagious=True):
> >
> > Any thoughts?
> >
> 
> Sounds a bit overly odd to me to be honest. Just brain storming, you
> could think/name it the other way around maybe? Should the masked
> values be considered as zero/ignored?
> 
> - Sebastian
> 
> 
> > Cheers,
> > Allan
> >
> > ___
> > NumPy-Discussion mailing list
> > NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org>
> > https://mail.scipy.org/mailman/listinfo/numpy-discussion
> <https://mail.scipy.org/mailman/listinfo/numpy-discussion>
> >
> 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org>
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
> <https://mail.scipy.org/mailman/listinfo/numpy-discussion>
> 
> 
> 
> 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
> 

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] how to name "contagious" keyword in np.ma.convolve

2016-10-14 Thread Allan Haldane
Hi all,

Eric Wieser has a PR which defines new functions np.ma.correlate and
np.ma.convolve:

https://github.com/numpy/numpy/pull/7922

We're deciding how to name the keyword arg which determines whether
masked elements are "propagated" in the convolution sums. Currently we
are leaning towards calling it "contagious", with default of True:

def convolve(a, v, mode='full', contagious=True):

Any thoughts?

Cheers,
Allan

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Integers to integer powers, let's make a decision

2016-06-13 Thread Allan Haldane
On 06/13/2016 05:05 AM, V. Armando Solé wrote:
> On 11/06/2016 02:28, Allan Haldane wrote:
>>
>> So as an extra twist in this discussion, this means numpy actually
>> *does* return a float value for an integer power in a few cases:
>>
>>  >>> type( np.uint64(2) ** np.int8(3) )
>>  numpy.float64
>>
> 
> Shouldn't that example end up the discussion? I find that behaviour for
> any integer power of an np.uint64. I guess if something was to be
> broken, I guess it is already the case.
> 
> We were given the choice between:
> 
> 1 - Integers to negative integer powers raise an error.
> 2 - Integers to integer powers always results in floats.
> 
> and we were never given the choice to adapt the returned type to the
> result. Assuming that option is not possible, it is certainly better
> option 2 than 1 (why to refuse to perform a clearly defined
> operation???) *and* returning a float is already the behaviour for
> integer powers of np.uint64.

Not for any uints: "type( np.uint64(2) ** np.uint8(3) )" is uint64.

Although I brought it up I think the mixed dtype case is a bit of a red
herring. The single-dtype case is better to think about for now, eg
"np.uint64(2) ** np.uint64(3)".

Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Integers to integer powers, let's make a decision

2016-06-10 Thread Allan Haldane
On 06/10/2016 03:38 PM, Alan Isaac wrote:
 np.find_common_type([np.int8],[np.int32])
> dtype('int8')
 (np.arange(10,dtype=np.int8)+np.int32(2**10)).dtype
> dtype('int16')
> 
> And so on.  If these other binary operators upcast based
> on the scalar value, why wouldn't exponentiation?
> I suppose the answer is: they upcast only insofar
> as necessary to fit the scalar value, which I see is
> a simple and enforceable rule.  However, that seems the wrong
> rule for exponentiation, and in fact it is not in play:
> 
 (np.int8(2)**2).dtype
> dtype('int32')

My understanding is that numpy never upcasts based on the values, it
upcasts based on the datatype ranges.

http://docs.scipy.org/doc/numpy-1.10.1/reference/ufuncs.html#casting-rules

For arrays of different datatype, numpy finds the datatype which can
store values in both dtype's ranges, *not* the type which is large
enough to accurately store the result values.

So for instance,

>>> (np.arange(10, dtype=np.uint8) + np.uint32(2**32-1)).dtype
dtype('uint32')

Overflow has occurred, but numpy didn't upcast to uint64.

This rule has some slightly strange consequences. For example, the
ranges of np.int8 and np.uint64 don't match up, and numpy has decided
that the only type covering both ranges is np.float64.

So as an extra twist in this discussion, this means numpy actually
*does* return a float value for an integer power in a few cases:

>>> type( np.uint64(2) ** np.int8(3) )
numpy.float64

> OK, my question to those who have argued a**2 should
> produce an int32 when a is an int32: what if a is an int8?
> (Obviously the overflow problem is becoming extremely pressing ...)

To me, whether it's int8 or int32, the user should just be aware of
overflow.

Also, I like to think of numpy as having quite C-like behavior, allowing
you to play with the lowlevel bits and bytes. (I actually wish its
casting behavior was more C-like). I suspect that people working with
uint8 arrays might be doing byte-fiddling hacks and actually *want*
overflow/wraparound to occur, at least when multiplying/adding.

Allan



PS
I would concede that numpy's uint8 integer power currently doesn't
wraparound like mutliply does, but it would be cool if it did.
(modulo arithmetic is associative, so it should, right?).

>>> x = np.arange(256, dtype='uint8')
>>> x**8 # returns all 0
>>> x*x*x*x*x*x*x*x  # returns wrapped values


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Integers to integer powers, let's make a decision

2016-06-10 Thread Allan Haldane
On 06/10/2016 01:50 PM, Alan Isaac wrote:
> Again, **almost all** integer combinations overflow: that's the point.

Don't almost all integer combinations overflow for multiplication as well?

I estimate that for unsigned 32 bit integers, only roughly 1 in 2e8
combinations don't overflow.

The fraction is approximately (np.euler_gamma + 32*np.log(2))/2.0**32,
if I didn't make a mistake.

:)
Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Integers to integer powers, let's make a decision

2016-06-10 Thread Allan Haldane
On 06/10/2016 08:10 AM, Alan Isaac wrote:
> Is np.arange(10)**10 also "innocent looking" to a Python user?

This doesn't bother me much because numpy users have to be aware of
overflow issues in lots of other (simple) cases anyway, eg plain
addition and multiplication.

I'll add my +1 for integer powers returning an integer, and an error for
negative powers. Integer powers are a useful operation that I would bet
a lot of code currently depends on.

Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Numpy arrays shareable among related processes (PR #7533)

2016-05-11 Thread Allan Haldane
On 05/11/2016 06:48 PM, Sturla Molden wrote:
> Elliot Hallmark  wrote:
>> Strula, this sounds brilliant!  To be clear, you're talking about
>> serializing the numpy array and reconstructing it in a way that's faster
>> than pickle?
> 
> Yes. We know the binary format of NumPy arrays. We don't need to invoke the
> machinery of pickle to serialize an array and write the bytes to some IPC
> mechanism (pipe, tcp socket, unix socket, shared memory). The choise of IPC
> mechanism might not even be relevant, and could even be deferred to a
> library like ZeroMQ. The point is that if multiple peocesses are to
> cooperate efficiently, we need a way to let them communicate NumPy arrays
> quickly. That is where using multiprocessing hurts today, and shared memory
> does not help here.
> 
> Sturla

You probably already know this, but I just wanted to note that the
mpi4py module has worked around pickle too. They discuss how they
efficiently transfer numpy arrays in mpi messages here:
http://pythonhosted.org/mpi4py/usrman/overview.html#communicating-python-objects-and-array-data

Of course not everyone is able to install mpi easily.


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Numpy arrays shareable among related processes (PR #7533)

2016-05-11 Thread Allan Haldane
On 05/11/2016 06:39 PM, Joe Kington wrote:
> 
> 
> In python2 it appears that multiprocessing uses pickle protocol 0 which
> must cause a big slowdown (a factor of 100) relative to protocol 2, and
> uses pickle instead of cPickle.
> 
> 
> Even on Python 2.x, multiprocessing uses protocol 2, not protocol 0. 
> The default for the `pickle` module changed, but multiprocessing has
> always used a binary pickle protocol to communicate between processes. 
> Have a look at multiprocessing's forking.py  in
> Python 2.7.

Are you sure? As far as I understood the code, it uses the default
protocol 0. The file forking.py no longer exists, also.

https://github.com/python/cpython/tree/master/Lib/multiprocessing
(see reduction.py and queue.py)
http://bugs.python.org/issue23403
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Numpy arrays shareable among related processes (PR #7533)

2016-05-11 Thread Allan Haldane
On 05/11/2016 04:29 AM, Sturla Molden wrote:
> 4. The reason IPC appears expensive with NumPy is because multiprocessing
> pickles the arrays. It is pickle that is slow, not the IPC. Some would say
> that the pickle overhead is an integral part of the IPC ovearhead, but i
> will argue that it is not. The slowness of pickle is a separate problem
> alltogether.

That's interesting. I've also used multiprocessing with numpy and didn't
realize that. Is this true in python3 too?

In python2 it appears that multiprocessing uses pickle protocol 0 which
must cause a big slowdown (a factor of 100) relative to protocol 2, and
uses pickle instead of cPickle.

a = np.arange(40*40)

%timeit pickle.dumps(a)
1000 loops, best of 3: 1.63 ms per loop

%timeit cPickle.dumps(a)
1000 loops, best of 3: 1.56 ms per loop

%timeit cPickle.dumps(a, protocol=2)
10 loops, best of 3: 18.9 µs per loop

Python 3 uses protocol 3 by default:

%timeit pickle.dumps(a)
1 loops, best of 3: 20 µs per loop


> 5. Share memory does not improve on the pickle overhead because also NumPy
> arrays with shared memory must be pickled. Multiprocessing can bypass
> pickling the RawArray object, but the rest of the NumPy array is pickled.
> Using shared memory arrays have no speed advantage over normal NumPy arrays
> when we use multiprocessing.
> 
> 6. It is much easier to write concurrent code that uses queues for message
> passing than anything else. That is why using a Queue object has been the
> popular Pythonic approach to both multitreading and multiprocessing. I
> would like this to continue.
> 
> I am therefore focusing my effort on the multiprocessing.Queue object. If
> you understand the six points I listed you will see where this is going:
> What we really need is a specialized queue that has knowledge about NumPy
> arrays and can bypass pickle. I am therefore focusing my efforts on
> creating a NumPy aware queue object.
> 
> We are not doing the users a favor by encouraging the use of shared memory
> arrays. They help with nothing.
> 
> 
> Sturla Molden


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] change to memmap subclass propagation

2016-03-30 Thread Allan Haldane
Hi all,

This is a warning for a planned change to np.memmap in
https://github.com/numpy/numpy/pull/7406.

The return values of ufuncs and fancy slices of a memmap will now be
plain ndarrays, since those return values don't point to mem-mapped memory.

There is a possibility that if you are subclassing memmap using multiple
inheritance something may break. We don't think anyone will be affected,
but please reply if it does affect you.

Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [Suggestion] Labelled Array

2016-02-19 Thread Allan Haldane
I also want to add a historical note here, that 'groupby' has been
discussed a couple times before.

Travis Oliphant even made an NEP for it, and Wes McKinney lightly hinted
at adding it to numpy.

http://thread.gmane.org/gmane.comp.python.numeric.general/37480/focus=37480
http://thread.gmane.org/gmane.comp.python.numeric.general/38272/focus=38299
http://docs.scipy.org/doc/numpy-1.10.1/neps/groupby_additions.html

Travis's idea for a ufunc method 'reduceby' is more along the lines of
what I was originally thinking. Just musing about it, it might cover few
small cases pandas groupby might not: It could work on arbitrary ufuncs,
and over particular axes of multidimensional data. Eg, to sum over
pixels from NxNx3 image data. But maybe pandas can cover the
multidimensional case through additional index columns or with Panel.

Cheers,
Allan

On 02/15/2016 05:31 PM, Paul Hobson wrote:
> Just for posterity -- any future readers to this thread who need to do
> pandas-like on record arrays should look at matplotlib's mlab submodule. 
> 
> I've been in situations (::cough:: Esri production ::cough::) where I've
> had one hand tied behind my back and unable to install pandas. mlab was
> a big help there.
> 
> https://goo.gl/M7Mi8B
> 
> -paul
> 
> 
> 
> On Mon, Feb 15, 2016 at 1:28 PM, Lluís Vilanova  > wrote:
> 
> Benjamin Root writes:
> 
> > Seems like you are talking about xarray: 
> https://github.com/pydata/xarray
> 
> Oh, I wasn't aware of xarray, but there's also this:
> 
>  
> 
> https://people.gso.ac.upc.edu/vilanova/doc/sciexp2/user_guide/data.html#basic-indexing
>  
> 
> https://people.gso.ac.upc.edu/vilanova/doc/sciexp2/user_guide/data.html#dimension-oblivious-indexing
> 
> 
> Cheers,
>   Lluis
> 
> 
> 
> > Cheers!
> > Ben Root
> 
> > On Fri, Feb 12, 2016 at 9:40 AM, Sérgio  > wrote:
> 
> > Hello,
> 
> 
> > This is my first e-mail, I will try to make the idea simple.
> 
> 
> > Similar to masked array it would be interesting to use a label
> array to
> > guide operations.
> 
> 
> > Ex.:
>  x
> > labelled_array(data =
> 
> > [[0 1 2]
> > [3 4 5]
> > [6 7 8]],
> > label =
> > [[0 1 2]
> > [0 1 2]
> > [0 1 2]])
> 
> 
>  sum(x)
> > array([9, 12, 15])
> 
> 
> > The operations would create a new axis for label indexing.
> 
> 
> > You could think of it as a collection of masks, one for each
> label.
> 
> 
> > I don't know a way to make something like this efficiently
> without a loop.
> > Just wondering...
> 
> 
> > Sérgio.
> 
> > ___
> > NumPy-Discussion mailing list
> > NumPy-Discussion@scipy.org 
> > https://mail.scipy.org/mailman/listinfo/numpy-discussion
> 
> 
> 
> 
> > ___
> > NumPy-Discussion mailing list
> > NumPy-Discussion@scipy.org 
> > https://mail.scipy.org/mailman/listinfo/numpy-discussion
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org 
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
> 
> 
> 
> 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
> 

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [Suggestion] Labelled Array

2016-02-13 Thread Allan Haldane
I've had a pretty similar idea for a new indexing function 
'split_classes' which would help in your case, which essentially does


def split_classes(c, v):
return [v[c == u] for u in unique(c)]

Your example could be coded as

>>> [sum(c) for c in split_classes(label, data)]
[9, 12, 15]

I feel I've come across the need for such a function often enough that 
it might be generally useful to people as part of numpy. The 
implementation of split_classes above has pretty poor performance 
because it creates many temporary boolean arrays, so my plan for a PR 
was to have a speedy version of it that uses a single pass through v.

(I often wanted to use this function on large datasets).

If anyone has any comments on the idea (good idea. bad idea?) I'd love 
to hear.


I have some further notes and examples here: 
https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21


Allan

On 02/12/2016 09:40 AM, Sérgio wrote:

Hello,

This is my first e-mail, I will try to make the idea simple.

Similar to masked array it would be interesting to use a label array to
guide operations.

Ex.:
 >>> x
labelled_array(data =
  [[0 1 2]
  [3 4 5]
  [6 7 8]],
 label =
  [[0 1 2]
  [0 1 2]
  [0 1 2]])

 >>> sum(x)
array([9, 12, 15])

The operations would create a new axis for label indexing.

You could think of it as a collection of masks, one for each label.

I don't know a way to make something like this efficiently without a
loop. Just wondering...

Sérgio.


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [Suggestion] Labelled Array

2016-02-13 Thread Allan Haldane
Sorry, to reply to myself here, but looking at it with fresh eyes maybe 
the performance of the naive version isn't too bad. Here's a comparison 
of the naive vs a better implementation:


def split_classes_naive(c, v):
return [v[c == u] for u in unique(c)]

def split_classes(c, v):
perm = c.argsort()
csrt = c[perm]
div = where(csrt[1:] != csrt[:-1])[0] + 1
return [v[x] for x in split(perm, div)]

>>> c = randint(0,32,size=10)
>>> v = arange(10)
>>> %timeit split_classes_naive(c,v)
100 loops, best of 3: 8.4 ms per loop
>>> %timeit split_classes(c,v)
100 loops, best of 3: 4.79 ms per loop

In any case, maybe it is useful to Sergio or others.

Allan

On 02/13/2016 12:11 PM, Allan Haldane wrote:

I've had a pretty similar idea for a new indexing function
'split_classes' which would help in your case, which essentially does

 def split_classes(c, v):
 return [v[c == u] for u in unique(c)]

Your example could be coded as

 >>> [sum(c) for c in split_classes(label, data)]
 [9, 12, 15]

I feel I've come across the need for such a function often enough that
it might be generally useful to people as part of numpy. The
implementation of split_classes above has pretty poor performance
because it creates many temporary boolean arrays, so my plan for a PR
was to have a speedy version of it that uses a single pass through v.
(I often wanted to use this function on large datasets).

If anyone has any comments on the idea (good idea. bad idea?) I'd love
to hear.

I have some further notes and examples here:
https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21

Allan

On 02/12/2016 09:40 AM, Sérgio wrote:

Hello,

This is my first e-mail, I will try to make the idea simple.

Similar to masked array it would be interesting to use a label array to
guide operations.

Ex.:
 >>> x
labelled_array(data =
  [[0 1 2]
  [3 4 5]
  [6 7 8]],
 label =
  [[0 1 2]
  [0 1 2]
  [0 1 2]])

 >>> sum(x)
array([9, 12, 15])

The operations would create a new axis for label indexing.

You could think of it as a collection of masks, one for each label.

I don't know a way to make something like this efficiently without a
loop. Just wondering...

Sérgio.


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion





___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [Suggestion] Labelled Array

2016-02-13 Thread Allan Haldane

Impressive!

Possibly there's still a case for including a 'groupby' function in 
numpy itself since it's a generally useful operation, but I do see less 
of a need given the nice pandas functionality.


At least, next time someone asks a stackoverflow question like the ones 
below someone should tell them to use pandas!


(copied from my gist for future list reference).

http://stackoverflow.com/questions/4373631/sum-array-by-number-in-numpy
http://stackoverflow.com/questions/31483912/split-numpy-array-according-to-values-in-the-array-a-condition/31484134#31484134
http://stackoverflow.com/questions/31863083/python-split-numpy-array-based-on-values-in-the-array
http://stackoverflow.com/questions/28599405/splitting-an-array-into-two-smaller-arrays-in-python
http://stackoverflow.com/questions/7662458/how-to-split-an-array-according-to-a-condition-in-numpy

Allan


On 02/13/2016 01:39 PM, Jeff Reback wrote:

In [10]: pd.options.display.max_rows=10

In [13]: np.random.seed(1234)

In [14]: c = np.random.randint(0,32,size=10)

In [15]: v = np.arange(10)

In [16]: df = DataFrame({'v' : v, 'c' : c})

In [17]: df
Out[17]:
 c  v
0  15  0
1  19  1
2   6  2
3  21  3
4  12  4
........
5   7  5
6   2  6
7  27  7
8  28  8
9   7  9

[10 rows x 2 columns]

In [19]: df.groupby('c').count()
Out[19]:
v
c
0   3136
1   3229
2   3093
3   3121
4   3041
..   ...
27  3128
28  3063
29  3147
30  3073
31  3090

[32 rows x 1 columns]

In [20]: %timeit df.groupby('c').count()
100 loops, best of 3: 2 ms per loop

In [21]: %timeit df.groupby('c').mean()
100 loops, best of 3: 2.39 ms per loop

In [22]: df.groupby('c').mean()
Out[22]:
v
c
0   49883.384885
1   50233.692165
2   48634.116069
3   50811.743992
4   50505.368629
..   ...
27  49715.349425
28  50363.501469
29  50485.395933
30  50190.155223
31  50691.041748

[32 rows x 1 columns]


On Sat, Feb 13, 2016 at 1:29 PM, <josef.p...@gmail.com
<mailto:josef.p...@gmail.com>> wrote:



On Sat, Feb 13, 2016 at 1:01 PM, Allan Haldane
<allanhald...@gmail.com <mailto:allanhald...@gmail.com>> wrote:

Sorry, to reply to myself here, but looking at it with fresh
eyes maybe the performance of the naive version isn't too bad.
Here's a comparison of the naive vs a better implementation:

def split_classes_naive(c, v):
 return [v[c == u] for u in unique(c)]

def split_classes(c, v):
 perm = c.argsort()
 csrt = c[perm]
 div = where(csrt[1:] != csrt[:-1])[0] + 1
 return [v[x] for x in split(perm, div)]

>>> c = randint(0,32,size=10)
>>> v = arange(10)
>>> %timeit split_classes_naive(c,v)
100 loops, best of 3: 8.4 ms per loop
>>> %timeit split_classes(c,v)
100 loops, best of 3: 4.79 ms per loop


The usecases I recently started to target for similar things is 1
Million or more rows and 1 uniques in the labels.
The second version should be faster for large number of uniques, I
guess.

Overall numpy is falling far behind pandas in terms of simple
groupby operations. bincount and histogram (IIRC) worked for some
cases but are rather limited.

reduce_at looks nice for cases where it applies.

In contrast to the full sized labels in the original post, I only
know of applications where the labels are 1-D corresponding to rows
or columns.

Josef


In any case, maybe it is useful to Sergio or others.

    Allan


On 02/13/2016 12:11 PM, Allan Haldane wrote:

I've had a pretty similar idea for a new indexing function
'split_classes' which would help in your case, which
essentially does

  def split_classes(c, v):
  return [v[c == u] for u in unique(c)]

Your example could be coded as

  >>> [sum(c) for c in split_classes(label, data)]
  [9, 12, 15]

I feel I've come across the need for such a function often
enough that
it might be generally useful to people as part of numpy. The
implementation of split_classes above has pretty poor
performance
because it creates many temporary boolean arrays, so my plan
for a PR
was to have a speedy version of it that uses a single pass
through v.
(I often wanted to use this function on large datasets).

If anyone has any comments on the idea (good idea. bad
idea?) I'd love
to hear.

I have some further notes and examples here:
https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21

Allan

On 02/12/2016 09:40 AM, Sérgio wrote

Re: [Numpy-discussion] array of random numbers fails to construct

2015-12-08 Thread Allan Haldane
On 12/08/2015 07:40 PM, Stephan Hoyer wrote:
> On Sun, Dec 6, 2015 at 3:55 PM, Allan Haldane <allanhald...@gmail.com
> <mailto:allanhald...@gmail.com>> wrote:
> 
> 
> I've also often wanted to generate large datasets of random uint8
> and uint16. As a workaround, this is something I have used:
> 
> np.ndarray(100, 'u1', np.random.bytes(100))
> 
> It has also crossed my mind that np.random.randint and
> np.random.rand could use an extra 'dtype' keyword. It didn't look
> easy to implement though.
> 
>  
> Another workaround that avoids creating a copy is to use the view
> method, e.g.,
> np.random.randint(np.iinfo(int).min, np.iinfo(int).max,
> size=(1,)).view(np.uint8)  # creates 8 random bytes

Just to note, the line I pasted doesn't copy either, according to the
OWNDATA flag.

Cheers,
Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] array of random numbers fails to construct

2015-12-08 Thread Allan Haldane
On 12/08/2015 08:01 PM, Allan Haldane wrote:
> On 12/08/2015 07:40 PM, Stephan Hoyer wrote:
>> On Sun, Dec 6, 2015 at 3:55 PM, Allan Haldane <allanhald...@gmail.com
>> <mailto:allanhald...@gmail.com>> wrote:
>>
>>
>> I've also often wanted to generate large datasets of random uint8
>> and uint16. As a workaround, this is something I have used:
>>
>> np.ndarray(100, 'u1', np.random.bytes(100))
>>
>> It has also crossed my mind that np.random.randint and
>> np.random.rand could use an extra 'dtype' keyword. It didn't look
>> easy to implement though.
>>
>>  
>> Another workaround that avoids creating a copy is to use the view
>> method, e.g.,
>> np.random.randint(np.iinfo(int).min, np.iinfo(int).max,
>> size=(1,)).view(np.uint8)  # creates 8 random bytes
> 
> Just to note, the line I pasted doesn't copy either, according to the
> OWNDATA flag.
> 
> Cheers,
> Allan

Oops, but I forgot my version is readonly. If you want to write to it
you do need to make a copy, that's true.

Allan

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] array of random numbers fails to construct

2015-12-06 Thread Allan Haldane


I've also often wanted to generate large datasets of random uint8 and 
uint16. As a workaround, this is something I have used:


np.ndarray(100, 'u1', np.random.bytes(100))

It has also crossed my mind that np.random.randint and np.random.rand 
could use an extra 'dtype' keyword. It didn't look easy to implement though.


Allan

On 12/06/2015 04:55 PM, DAVID SAROFF (RIT Student) wrote:

Matthew,

That looks right. I'm concluding that the .astype(np.uint8) is applied
after the array is constructed, instead of during the process. This
random array is a test case. In the production analysis of radio
telescope data this is how the data comes in, and there is no  problem
with 10GBy files.
linearInputData = np.fromfile(dataFile, dtype = np.uint8, count = -1)
spectrumArray = linearInputData.reshape(nSpectra,sizeSpectrum)


On Sun, Dec 6, 2015 at 4:07 PM, Matthew Brett > wrote:

Hi,

On Sun, Dec 6, 2015 at 12:39 PM, DAVID SAROFF (RIT Student)
> wrote:
> This works. A big array of eight bit random numbers is constructed:
>
> import numpy as np
>
> spectrumArray = np.random.randint(0,255, (2**20,2**12)).astype(np.uint8)
>
>
>
> This fails. It eats up all 64GBy of RAM:
>
> spectrumArray = np.random.randint(0,255, (2**21,2**12)).astype(np.uint8)
>
>
> The difference is a factor of two, 2**21 rather than 2**20, for the extent
> of the first axis.

I think what's happening is that this:

np.random.randint(0,255, (2**21,2**12))

creates 2**33 random integers, which (on 64-bit) will be of dtype
int64 = 8 bytes, giving total size 2 ** (21 + 12 + 6) = 2 ** 39 bytes
= 512 GiB.

Cheers,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org 
https://mail.scipy.org/mailman/listinfo/numpy-discussion




--
David P. Saroff
Rochester Institute of Technology
54 Lomb Memorial Dr, Rochester, NY 14623
david.sar...@mail.rit.edu  | (434)
227-6242



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Commit rights for Jonathan J. Helmus

2015-10-28 Thread Allan Haldane

On 10/28/2015 05:27 PM, Nathaniel Smith wrote:

Hi all,

Jonathan J. Helmus (@jjhelmus) has been given commit rights -- let's all
welcome him aboard.

-n


Welcome Jonathan, happy to have you on the team!

Allan

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] correct sizeof for ndarray

2015-10-20 Thread Allan Haldane

On 10/20/2015 12:05 AM, Jason Newton wrote:

Hi folks,

I noticed an unexpected behavior of itemsize for structures with offsets
that are larger than that of a packed structure in memory.  This matters
when parsing in memory structures from C and some others (recently and
HDF5/h5py detail got me for a bit).

So what is the correct way to get "sizeof" a structure?  AFAIK this is
the size of the last item + it's offset.  If this doesn't exist...
shouldn't it?

Thanks,
Jason


Hi Jason,

The 'itemsize' attribute of a dtype object is probably what you're 
looking for. It gives the itemsize in bytes.


"last item + it's offset" is not a reliable way to get the itemsize 
because "aligned" (and other) structures can have trailing padding, just 
like C structs:


>>> dtype('i4,u1', align=True).itemsize
8

The documentation on all this is a little scattered right now, but there 
are hints in the array.dtypes reference page and the dtype docstring.


Cheers,
Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numpy 1.10.1 reduce operation on recarrays

2015-10-16 Thread Allan Haldane
On 10/16/2015 05:31 PM, josef.p...@gmail.com wrote:
> 
> 
> On Fri, Oct 16, 2015 at 2:21 PM, Charles R Harris
> > wrote:
> 
> 
> 
> On Fri, Oct 16, 2015 at 12:20 PM, Charles R Harris
> > wrote:
> 
> 
> 
> On Fri, Oct 16, 2015 at 11:58 AM,  > wrote:
> 
> was there a change with reduce operations with recarrays in
> 1.10 or 1.10.1?
> 
> Travis shows a new test failure in the statsmodels testsuite
> with 1.10.1:
> 
> ERROR: test suite for  'statsmodels.base.tests.test_data.TestRecarrays'>
> 
>   File
> 
> "/home/travis/miniconda/envs/statsmodels-test/lib/python2.7/site-packages/statsmodels-0.8.0-py2.7-linux-x86_64.egg/statsmodels/base/data.py",
> line 131, in _handle_constant
> const_idx = np.where(self.exog.ptp(axis=0) ==
> 0)[0].squeeze()
> TypeError: cannot perform reduce with flexible type
> 
> 
> Sorry for asking so late.
> (statsmodels is short on maintainers, and I'm distracted)
> 
> 
> statsmodels still has code to support recarrays and
> structured dtypes from the time before pandas became
> popular, but I don't think anyone is using them together
> with statsmodels anymore.
> 
> 
> There were several commits dealing both recarrays and ufuncs, so
> this might well be a regression.
> 
> 
> A bisection would be helpful. Also, open an issue.
> 
> 
> 
> The reason for the test failure might be somewhere else hiding behind
> several layers of statsmodels, but only started to show up with numpy 1.10.1
> 
> I already have the reduce exception with my currently installed numpy
> '1.9.2rc1'
> 
 x = np.random.random(9*3).view([('const', 'f8'),('x_1', 'f8'),
> ('x_2', 'f8')]).view(np.recarray)
> 
 np.ptp(x, axis=0)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File
> "C:\programs\WinPython-64bit-3.4.3.1\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py",
> line 2047, in ptp
> return ptp(axis, out)
> TypeError: cannot perform reduce with flexible type
> 
> 
> Sounds like fun, and I don't even know how to automatically bisect.
> 
> Josef

That example isn't the problem (ptp should definitely fail on structured
arrays), but I've tracked down what is - it has to do with views of
record arrays.

The fix looks simple, I'll get it in for the next release.

Allan




___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numpy 1.10.1 reduce operation on recarrays

2015-10-16 Thread Allan Haldane

On 10/16/2015 09:17 PM, josef.p...@gmail.com wrote:



On Fri, Oct 16, 2015 at 8:56 PM, Allan Haldane <allanhald...@gmail.com
<mailto:allanhald...@gmail.com>> wrote:

On 10/16/2015 05:31 PM, josef.p...@gmail.com
<mailto:josef.p...@gmail.com> wrote:
>
>
> On Fri, Oct 16, 2015 at 2:21 PM, Charles R Harris
> <charlesr.har...@gmail.com <mailto:charlesr.har...@gmail.com>
<mailto:charlesr.har...@gmail.com
<mailto:charlesr.har...@gmail.com>>> wrote:
>
>
>
> On Fri, Oct 16, 2015 at 12:20 PM, Charles R Harris
> <charlesr.har...@gmail.com <mailto:charlesr.har...@gmail.com>
<mailto:charlesr.har...@gmail.com
<mailto:charlesr.har...@gmail.com>>> wrote:
>
>
>
> On Fri, Oct 16, 2015 at 11:58 AM, <josef.p...@gmail.com 
<mailto:josef.p...@gmail.com>
 > <mailto:josef.p...@gmail.com
<mailto:josef.p...@gmail.com>>> wrote:
 >
 > was there a change with reduce operations with
recarrays in
 > 1.10 or 1.10.1?
 >
 > Travis shows a new test failure in the statsmodels
testsuite
 > with 1.10.1:
 >
 > ERROR: test suite for  'statsmodels.base.tests.test_data.TestRecarrays'>
 >
 >   File
 >
  
"/home/travis/miniconda/envs/statsmodels-test/lib/python2.7/site-packages/statsmodels-0.8.0-py2.7-linux-x86_64.egg/statsmodels/base/data.py",
 > line 131, in _handle_constant
 > const_idx = np.where(self.exog.ptp(axis=0) ==
 > 0)[0].squeeze()
 > TypeError: cannot perform reduce with flexible type
 >
 >
 > Sorry for asking so late.
 > (statsmodels is short on maintainers, and I'm distracted)
 >
 >
 > statsmodels still has code to support recarrays and
 > structured dtypes from the time before pandas became
 > popular, but I don't think anyone is using them together
 > with statsmodels anymore.
 >
 >
 > There were several commits dealing both recarrays and
ufuncs, so
 > this might well be a regression.
 >
 >
 > A bisection would be helpful. Also, open an issue.
 >
 >
 >
 > The reason for the test failure might be somewhere else hiding behind
 > several layers of statsmodels, but only started to show up with
numpy 1.10.1
 >
 > I already have the reduce exception with my currently installed numpy
 > '1.9.2rc1'
 >
 >>>> x = np.random.random(9*3).view([('const', 'f8'),('x_1', 'f8'),
 > ('x_2', 'f8')]).view(np.recarray)
 >
 >>>> np.ptp(x, axis=0)
 > Traceback (most recent call last):
 >   File "", line 1, in 
 >   File
 >

"C:\programs\WinPython-64bit-3.4.3.1\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py",
 > line 2047, in ptp
 > return ptp(axis, out)
 > TypeError: cannot perform reduce with flexible type
 >
 >
 > Sounds like fun, and I don't even know how to automatically bisect.
 >
 > Josef

That example isn't the problem (ptp should definitely fail on structured
arrays), but I've tracked down what is - it has to do with views of
record arrays.

The fix looks simple, I'll get it in for the next release.


Thanks,

I realized that at that point in the statsmodels code we should have
only regular ndarrays, so the array conversion fails somewhere.

AFAICS, the main helper function to convert is

def struct_to_ndarray(arr):
 return arr.view((float, len(arr.dtype.names)))

which doesn't look like it will handle other dtypes than float64. Nobody
ever complained, so maybe our test suite is the only user of this.

What is now the recommended way of converting structured
dtypes/recarrays to ndarrays?

Josef


Yes, that's the code I narrowed it down to as well. I think the code in 
statsmodels is fine, the problem is actually a  bug I must admit I 
introduced in changes to the way views of recarrays work.


If you are curious, the bug is in this line:

https://github.com/numpy/numpy/blob/master/numpy/core/records.py#L467

This line was intended to fix the problem that accessing a nested record 
array field would lose the 'np.record' dtype. I only considered void 
structured arrays, and had forgotten about sub-arrays which statsmodels 
uses.


I think the fix is to replace `issubclass(val.type, nt.void)` with 
`val.names` or something similar. I'll take a closer look soon.


Allan

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A regression in numpy 1.10: VERY slow memory mapped file generation

2015-10-14 Thread Allan Haldane

On 10/14/2015 01:23 AM, Nadav Horesh wrote:
> 
> I have binary files of size range between few MB to 1GB, which I read process 
> as memory mapped files (via np.memmap). Until numpy 1.9 the creation  of 
> recarray on an existing file (without reading its content) was instantaneous, 
> and now it takes ~6 seconds (system: archlinux on sandy bridge). A profiling 
> (using ipython %prun) top of the list is:
> 
> 
>ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>213.0370.1454.2660.203 
> _internal.py:372(_check_field_overlap)
>   37134311.6630.0001.6630.000 _internal.py:366()
>   37137500.7900.0000.7900.000 {range}
>   37137090.4060.0000.4060.000 {method 'update' of 'set' 
> objects}
>   3220.3200.0011.9840.006 {method 'extend' of 'list' 
> objects}
> 
> Nadav.

Hi Nadav,

The slowdown is due to a problem in PR I introduced to add safety checks
to views of structured arrays (to prevent segfaults involving object
fields), which will hopefully be fixed quickly. It is being discussed here

https://github.com/numpy/numpy/issues/6467

Also, I do not think the problem is with memmap - as far as I have
tested, memmmap is still fast. Most likely what is slowing your script
down is subsequent access to the fields of the array, which is what has
regressed. Is that right?

Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Sign of NaN

2015-09-29 Thread Allan Haldane


On 09/29/2015 11:39 AM, josef.p...@gmail.com wrote:
> 
> 
> On Tue, Sep 29, 2015 at 11:25 AM, Anne Archibald  > wrote:
> 
> IEEE 754 has signum(NaN)->NaN. So does np.sign on floating-point
> arrays. Why should it be different for object arrays?
> 
> Anne
> 
> P.S. If you want exceptions when NaNs appear, that's what np.seterr
> is for. -A
> 
> 
> 
> I also think NaN should be treated the same way as floating point
> numbers (whatever that is). Otherwise it is difficult to remember when
> nan is essentially a float dtype or another dtype.  
> (given that float is the smallest dtype that can hold a nan)
> 

Note that I've reimplemented np.sign for object arrays along these lines
in this open PR:
https://github.com/numpy/numpy/pull/6320

That PR recursively uses the np.sign ufunc to evaluate object arrays
containing float and complex numbers. This way the behavior on object
arrays is identical to float/complex arrays.

Here is what the np.sign ufunc does (for arbitrary x):

np.sign(np.nan)  ->  nan
np.sign(complex(np.nan, x))  ->  complex(nan, 0)
np.sign(complex(x, np.nan))  ->  complex(nan, 0)

Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Sign of NaN

2015-09-29 Thread Allan Haldane
On 09/29/2015 02:16 PM, Nathaniel Smith wrote:
> On Sep 29, 2015 8:25 AM, "Anne Archibald"  > wrote:
>>
>> IEEE 754 has signum(NaN)->NaN. So does np.sign on floating-point
> arrays. Why should it be different for object arrays?
> 
> The argument for doing it this way would be that arbitrary python
> objects don't have a sign, and the natural way to implement something
> like np.sign's semantics using only the "object" interface is
> 
> if obj < 0:
> return -1
> elif obj > 0:
> return 1
> elif obj == 0:
> return 0
> else:
> raise
> 
> In general I'm not a big fan of trying to do all kinds of guessing about
> how to handle random objects in object arrays, the kind that ends up
> with a big chain of type checks and fallback behaviors. Pretty soon we
> find ourselves trying to extend the language with our own generic
> dispatch system for arbitrary python types, just for object arrays. (The
> current hack where for object arrays np.log will try calling obj.log()
> is particularly horrible. There is no rule in python that "log" is a
> reserved method name for "logarithm" on arbitrary objects. Ditto for the
> other ufuncs that implement this hack.)
> 
> Plus we hope that many use cases for object arrays will soon be
> supplanted by better dtype support, so now may not be the best time to
> invest heavily in making object arrays complicated and powerful.

Even though I submitted the PR to make object arrays more powerful, this
makes a lot of sense to me.

Let's say we finish a new dtype system, in which (I imagine) each dtype
specifies how to calculate each ufunc elementwise for that type. What
are the remaining use cases for generic object arrays? The only one I
see is having an array with elements of different types, which seems
like a dubious idea anyway. (Nested ndarrays of varying length could be
implemented as a dtype, as could the PyFloatDtype Sebastian mentioned,
without need for a generic 'object' dtype which has to figure out how
to call ufuncs on individual objects of different type).

Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Commit rights for Allan Haldane

2015-09-22 Thread Allan Haldane
Thanks all. I'm very happy to contribute back to a project which has
been so useful to me over many years!

On 09/22/2015 04:53 PM, Travis Oliphant wrote:
> Excellent news!   Welcome Allan.  
> 
> -Travis
> 
> 
> On Tue, Sep 22, 2015 at 1:54 PM, Charles R Harris
> <charlesr.har...@gmail.com <mailto:charlesr.har...@gmail.com>> wrote:
> 
> Hi All,
> 
> Allan Haldane has been given commit rights. Here's to the new member
> of the team.
> 
> Chuck
> 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org>
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
> 
> 
> 
> 
> -- 
> *
> Travis Oliphant*
> /Co-founder and CEO/
> /
> /
> 
> @teoliphant
> 512-222-5440
> http://www.continuum.io
> 
> 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
> 

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] new ufunc implementations for object arrays

2015-09-18 Thread Allan Haldane
Hello all,

I've just submitted a PR which overhauls the implementation of ufuncs
for object types.
https://github.com/numpy/numpy/pull/6320

The motivation for this is that many ufuncs (eg all transcendental
functions) can't handle objects. This PR will also make object arrays
more customizable, as all ufuncs can now be overridden by the user by
defining methods on the object.

I'd like to see if there's enough interest/agreement for me to finish
it, and also to have more eyes look over the pure-python ufunc
implementations since I am sure I have missed some edge cases. I would
love suggestions, and I need help for a few cases I couldn't figure out
(eg np.nextafter, np.spacing, np.cbrt. Also, np.expm1 np.signbit and
others are a bit suspect).

Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.sign and object comparisons

2015-09-03 Thread Allan Haldane
On 08/31/2015 12:09 AM, Jaime Fernández del Río wrote:
> There are three ways of fixing this that I see:
> 
>  1. Arbitrarily choose a value to set the return to. This is equivalent
> to choosing a default return for `cmp` for comparisons. This
> preserves behavior, but feels wrong.
>  2. Similarly to how np.sign of a floating point array with nans returns
> nan for those values, return e,g, None for these cases. This is my
> preferred option.
>  3. Raise an error, along the lines of the TypeError: unorderable types
> that 3.x produces for some comparisons.

I think np.sign on nan object arrays should raise the error

AttributeError: 'float' object has no attribute 'sign'


If I've understood correctly, currently object arrays work like this:

If a ufunc has an equivalent pure-python func (eg, PyNumber_Add for
np.add, PyNumber_Absolute for np.abs, < for np.greater_than) then numpy
calls that for objects. Otherwise, if the object defines a method with
the same name as the ufunc, numpy calls that method. For example, arccos
is a ufunc that has no pure python equivalent, so you get the following
behavior

>>> a = np.array([-1], dtype='O')
>>> np.abs(a)
array([1], dtype=object)
>>> np.arccos(a)
AttributeError: 'int' object has no attribute 'arccos'
>>> class MyClass:
... def arccos(self):
...return 1
>>> b = np.array([MyClass()], dtype='O')
>>> np.arccos(b)
array([1], dtype=object)

Now, most comparison operators (eg, greater_than) are treated a little
specially in loops.c. For some reason, sign is treated just like the
other comparison operators, even through technically there is no
pure-python equivalent to sign.

I think that because there is no pure-python 'sign', numpy should
attempt to call obj.sign, and in most cases this should fail with the
error above. See also
http://stackoverflow.com/questions/1986152/why-doesnt-python-have-a-sign-function

I think the fix for sign is that the 'sign' ufunc in generate_umath.py
should look more like the arccos one, and we should get rid of
OBJECT_sign in loops.c. I'm not 100% sure about this since I haven't
followed all of how generate_umath.py works yet.

---

By the way, based on some comments I saw somewhere (apologies, I forget
who by!) I wrote up a vision for how ufuncs could work for objects,
here: https://gist.github.com/ahaldane/c3f9bcf1f62d898be7c7
I'm a little unsure the ideas there are a good idea since they might be
made obsolete by the big dtype subclassing improvements being discussed
in the numpy roadmap thread.

Allan

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Development workflow (not git tutorial)

2015-08-14 Thread Allan Haldane
On 08/13/2015 11:52 AM, Anne Archibald wrote:
 Hi,
 
 What is a sensible way to work on (modify, compile, and test) numpy? 
 
 There is documentation about contributing to numpy at:
 http://docs.scipy.org/doc/numpy-dev/dev/index.html
 and:
 http://docs.scipy.org/doc/numpy-dev/dev/gitwash/development_workflow.html
 but these are entirely focused on using git. I have no problem with that
 aspect. It is building and testing that I am looking for the Right Way
 to do.

Related to this, does anyone know how to debug numpy in gdb with proper
symbols/source lines, like I can do with other C extensions? I've tried
modifying numpy distutils to try to add the right compiler/linker flags,
without success.

Allan

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Development workflow (not git tutorial)

2015-08-14 Thread Allan Haldane
On 08/14/2015 01:52 PM, Pauli Virtanen wrote:
 14.08.2015, 20:45, Allan Haldane kirjoitti:
 [clip]
 Related to this, does anyone know how to debug numpy in gdb with proper
 symbols/source lines, like I can do with other C extensions? I've tried
 modifying numpy distutils to try to add the right compiler/linker flags,
 without success.
 
 runtests.py --help
 
 gdb --args python runtests.py -g --python script.py
 
 grep env runtests.py

Oh! Thank you, I missed that.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] improving structured array assignment

2015-08-06 Thread Allan Haldane
Hello all,

I've written up a tentative PR which tidies up structured array assignment,

https://github.com/numpy/numpy/pull/6053

It has a backward incompatible change which I'd especially like to get 
some feedback on: Structure assignment now always works by field 
position instead of by field name. Consider the following assignment:

v1 = np.array([(1,2,3)],
   ...   dtype=[('a', 'i4'), ('b', 'i4'), ('c', 'i4')])
v2 = np.array([(4,5,6)],
   ...   dtype=[('b', 'i4'), ('a', 'i4'), ('c', 'i4')])
v1[:] = v2

Previously, v1 would be set to (5,4,6) but with the PR it is set to 
(4,5,6).

This might seem like negligible improvement, but assignment by field 
name has lots of inconsistent/broken edge cases which I've listed in 
the PR, which disappear with assignment by field position. The PR 
doesn't seem to break much of anything in scipy, pandas, and astropy.

If possible, I'd like to try getting a deprecation warning for this 
change into 1.10.

I also changed a few more minor things about structure assignment, 
expanded the docs on structured arrays, and made a multi-field index 
(arr[['f1', 'f0']]) return a view instead of a copy, which had been 
planned for 1.10 but didn't get in because of the strange behavior of 
structure assignment.

Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: Deprecate np.int, np.float, etc.?

2015-08-03 Thread Allan Haldane
On 08/03/2015 12:25 PM, Chris Barker wrote:
 2) The vagaries of the standard C types:  int, long, etc (spelled
 np.intc, which is a int32 on my machine, anyway)
 [NOTE: is there a C long dtype? I can't find it at the moment...]

Numpy does define the platform dependent C integer types short, long,
longlong and their unsigned versions according to the docs. size_t is
the same size as intc.

Even though float and double are virtually always IEEE single and double
precision, maybe for consistency we should also define np.floatc,
np.doublec and np.longdoublec?

Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] constructing record dtypes from the c-api

2015-07-27 Thread Allan Haldane
Hi Jason,

As I understand numpy has been set up to mirror C-structs as long as you 
use the 'align' flag. For example, your struct can be represented as

  np.dtype('f8,f4,i4,u8', align=True)

(assuming 32 bit floats). The offsets of the fields should be exactly 
the offsets of the elements of the struct. In C you can create the dtype 
using PyArray_DescrAlignConverter.

If you want to get the field offsets given a numpy dtype in C, you need 
to iterate through the 'names' and 'fields' attributes of the dtype, eg


 PyObject *key, *tup, *offobj;
 n = PyTuple_Size(descr-names);
 for (i = 0; i  n; i++) {
 key = PyTuple_GetItem(descr-names, i);
 tup = PyDict_GetItem(descr-fields, key);
 offobj = PyTuple_GetItem(tup, 1);
 offset = PyInt_AsSsize_t(offobj);
 // do something with offset
 }

(error checks  DECREFS might be needed)

Allan

On 07/25/2015 12:26 AM, Jason Newton wrote:
 After drilling through the sources a second time, I found it was
 numpy/core/src/multiarray/descriptor.c was the file to consult with
 the primary routine being PyArray_DescrConverter and _convert_from_*
 functions being the most interesting to read and glean the
 capabilities of this with.

 So in particular it looks like the _convert_from_dict based path is
 the way to go to allow custom field offsets to safely and
 transparently map C POD structs with the offset information generated
 at compile time to hopefully keep dtype's in perfect sync with C
 sources vs declaring on the python source side.  I plan on building a
 helper class to generate the dictionaries for this subroutine since
 something akin to the list dtype specification is more user-friendly
 (even towards me).

 -Jason

 On Thu, Jul 23, 2015 at 7:55 PM, Jason Newton nev...@gmail.com wrote:
 Hi folks,

 The moderator for the ML approved my subscription so I can now post
 this back in the numpy list rather than scipy.  Apologies for the
 duplicate/cross posting.


 I was trying to figure out how to make a dtype for a c-struct on the
 c-side and storing that in some boost python libraries I'm making.

 Imagine the following c-struct, greatly simplified of course from the
 real ones I need to expose:

 struct datum{
  double position[3];
  float velocity[3];
  int32_t d0;
  uint64_t time_of_receipt;
 };


 How would you make the dtype/PyArray_Descr for  this?

 I have as a point of reference compound types in HDF for similar
 struct descriptions (h5py makes these nearly 1:1 and converts back and
 forth to dtypes and hdf5 types, it uses Cython to accomplish this) -
 but I don't want to bring in hdf for this task - I'm not sure how well
 the offsets would go over in that translation to h5py too.

 Proper/best solutions would make use of offsetof as we insert entries
 to the dtype/PyArray_Descr.  It's fine if you do this in straight C -
 I just need to figure out how to accomplish this in a practical way.

 The language I'm working in is C++11.  The end goal is probably going
 to be to create a helper infrastructure to allow this to be made
 automatically-ish provided implementation of a [static] visitor
 pattern in C++.  The idea is to make numpy compatible c++ POD types
 rather than use Boost.Python wrapped proxies for every object which
 will cut down on some complicated and time consuming code (both of
 computer and developer) when ndarrays are what's called for.

 Related - would one work with record dtypes passed to C?  How would
 one lookup fields and offsets within a structure?

 Thanks for any advisement!

 -Jason
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Should ndarray subclasses support the keepdims arg?

2015-05-05 Thread Allan Haldane
Hello all,

A question:

Many ndarray methods (eg sum, mean, any, min) have a keepdims keyword
argument, but ndarray subclass methods sometimes don't. The 'matrix'
subclass doesn't, and numpy functions like 'np.sum' intentionally
drop/ignore the keepdims argument when called with an ndarray subclass
as first argument.

This means you can't always use ndarray subclasses as 'drop in'
replacement for ndarrays if the code uses keepdims (even indirectly),
and it means code that deals with keepdims (eg np.sum and more) has to
detect ndarray subclasses and drop keepdims even if the subclass
supports it (since there is no good way to detect support). It seems to
me that if we are going to use inheritance, subclass methods should keep
the signature of the parent class methods. What does the list think?

 Details: 

This problem comes up in a PR I'm working on (#5706) to add the keepdims
arg to masked array methods. In order to support masked matrices (which
a lot of unit tests check), I would have to detect and drop the keepdims
arg to avoid an exception. This would be solved if the matrix class
supported keepdims (plus an update to np.sum). Similarly,
`np.sum(mymaskedarray, keepdims=True)` does not respect keepdims, but it
could work if all subclasses supported keepdims.

I do not foresee immediate problems with adding keepdims to the matrix
methods, except that it would be an unused argument. Modifying `np.sum`
to always pass on the keepdims arg is trickier, since it would break any
code that tried to np.sum a subclass that doesn't support keepdims, eg
pandas.DataFrame. **kwargs tricks might work. But if it's permissible I
think it would be better to require subclasses to support all the
keyword args ndarray supports.

Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Should ndarray subclasses support the keepdims arg?

2015-05-05 Thread Allan Haldane
That makes sense. I think it's the way to go, thanks.

The downside is using **kwargs instead of an explicit keepdims arg gives
a more obscure signature, but using the new __signature__ attribute this
could be hidden for python 3 users and python2 using ipython3+.

On 05/05/2015 01:55 PM, Nathaniel Smith wrote:
 AFAICT the only real solution here is for np.sum and friends to
 propagate the keepdims argument if and only if it was explicitly passed
 to them (or maybe the slightly different, if and only if it has a
 non-default value). If we just started requiring code to handle it and
 passing it unconditionally, then as soon as someone upgraded numpy all
 their existing code might break for no good reason.
 
 On May 5, 2015 8:13 AM, Allan Haldane allanhald...@gmail.com
 mailto:allanhald...@gmail.com wrote:
 
 Hello all,
 
 A question:
 
 Many ndarray methods (eg sum, mean, any, min) have a keepdims keyword
 argument, but ndarray subclass methods sometimes don't. The 'matrix'
 subclass doesn't, and numpy functions like 'np.sum' intentionally
 drop/ignore the keepdims argument when called with an ndarray subclass
 as first argument.
 
 This means you can't always use ndarray subclasses as 'drop in'
 replacement for ndarrays if the code uses keepdims (even indirectly),
 and it means code that deals with keepdims (eg np.sum and more) has to
 detect ndarray subclasses and drop keepdims even if the subclass
 supports it (since there is no good way to detect support). It seems to
 me that if we are going to use inheritance, subclass methods should keep
 the signature of the parent class methods. What does the list think?
 
  Details: 
 
 This problem comes up in a PR I'm working on (#5706) to add the keepdims
 arg to masked array methods. In order to support masked matrices (which
 a lot of unit tests check), I would have to detect and drop the keepdims
 arg to avoid an exception. This would be solved if the matrix class
 supported keepdims (plus an update to np.sum). Similarly,
 `np.sum(mymaskedarray, keepdims=True)` does not respect keepdims, but it
 could work if all subclasses supported keepdims.
 
 I do not foresee immediate problems with adding keepdims to the matrix
 methods, except that it would be an unused argument. Modifying `np.sum`
 to always pass on the keepdims arg is trickier, since it would break any
 code that tried to np.sum a subclass that doesn't support keepdims, eg
 pandas.DataFrame. **kwargs tricks might work. But if it's permissible I
 think it would be better to require subclasses to support all the
 keyword args ndarray supports.
 
 Allan
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org mailto:NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
 
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
 

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Rename arguments to np.clip and np.put

2015-03-31 Thread Allan Haldane
On 03/30/2015 07:16 PM, Jaime Fernández del Río wrote:
 On Mon, Mar 30, 2015 at 3:59 PM, Allan Haldane allanhald...@gmail.com
 mailto:allanhald...@gmail.com wrote:
 
 Hello everyone,
 
 What does the list think of renaming the arguments of np.clip and np.put
 to match those of ndarray.clip/put? Currently the signatures are
 
 np.clip(a, a_min, a_max, out=None)
 ndarray.clip(a, min=None, max=None, out=None)
 
 np.put(a, ind, v, mode='raise')
 ndarray.put(indices, values, mode='raise')
 
 (The docstring for ndarray.clip is incorrect, too).
 
 I suggest the signatures might be changed to this:
 
 np.clip(a, min=None, max=None, out=None, **kwargs)
 np.put(a, indices, values, mode='raise')
 
 We can still take care of the old argument names for np.clip using
 **kwards, while showing a deprecation warning. I think that would be
 fully back-compatible. Note this makes np.clip more flexible as only one
 of min or max are needed now, just like ndarray.clip.
 
 np.put is trickier to keep back-compatible as it has two positional
 arguments. Someone who called `np.put(a, v=0, ind=1)` would be in
 trouble with this change, although I didn't find anyone on github doing
 so. I suppose to maintain back-compatibility we could make indices and
 values keyword args, and use the same kwargs trick as in np.clip, but
 that might be confusing since they're both required args.
 
 
 Ideally we would want the signature to show as you describe it in the
 documentation, but during the deprecation period be something like e.g.
 
 np.put(a, indices=None, values=None, mode='raise', **kwargs)
 
 Not sure if that is even possible, maybe with functools.update_wrapper?
 
 Jaime


In Python 3 the __signature__ attribute does exactly what we want.
IPython 3.0.0 has also backported this for its docstrings internally for
Python 2. But that still leaves out Python 2 users who don't read
docstrings in IPython3+.

functools.update_wrapper uses the inspect module (as does IPython), and
the inspect module looks up func.func_code.co_varnames which is a
readonly python internals object, so that doesn't work.

How about making indices/value be keyword args, use __signature__ to
take care of Python 3, and then add a note in the docstring for Python 2
users that they are only temporarily keyword args during the deprecation
period?



 Any opinions or suggestions?
 
 Allan
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org mailto:NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
 
 
 -- 
 (\__/)
 ( O.o)
 (  ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus
 planes de dominación mundial.
 
 
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
 

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Rename arguments to np.clip and np.put

2015-03-30 Thread Allan Haldane
Hello everyone,

What does the list think of renaming the arguments of np.clip and np.put
to match those of ndarray.clip/put? Currently the signatures are

np.clip(a, a_min, a_max, out=None)
ndarray.clip(a, min=None, max=None, out=None)

np.put(a, ind, v, mode='raise')
ndarray.put(indices, values, mode='raise')

(The docstring for ndarray.clip is incorrect, too).

I suggest the signatures might be changed to this:

np.clip(a, min=None, max=None, out=None, **kwargs)
np.put(a, indices, values, mode='raise')

We can still take care of the old argument names for np.clip using
**kwards, while showing a deprecation warning. I think that would be
fully back-compatible. Note this makes np.clip more flexible as only one
of min or max are needed now, just like ndarray.clip.

np.put is trickier to keep back-compatible as it has two positional
arguments. Someone who called `np.put(a, v=0, ind=1)` would be in
trouble with this change, although I didn't find anyone on github doing
so. I suppose to maintain back-compatibility we could make indices and
values keyword args, and use the same kwargs trick as in np.clip, but
that might be confusing since they're both required args.

Any opinions or suggestions?

Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] should views into structured arrays be reversible?

2015-03-17 Thread Allan Haldane
Hello all,

I've introduced PR 5548 https://github.com/numpy/numpy/pull/5548
which, through more careful safety checks, allows views of object
arrays. However, I had to make 'partial views' into structured arrays
irreversible, and I want to check with the list that that's ok.

With the PR, if you only view certain fields of an array you cannot take
a 'reverse' view of the resulting object to get back the original array:

 arr = np.array([(1,2),(4,5)], dtype=[('A', 'i'), ('B', 'i')])
 varr = arr.view({'names': ['A'], 'formats': ['i'],
...  'itemsize': arr.dtype.itemsize})
 varr.view(arr.dtype)
TypeError: view would access data parent array doesn't own

Ie., with this PR you can only take views into parts of an array that
have fields.

This was necessary in order to guarantee that we never interpret memory
containing a python Object as another type, which could cause a
segfault. I have a more extensive discussion  motivation in the PR,
including an alternative idea.

So does this limitation seem reasonable?

Cheers,
Allan

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Views of a different dtype

2015-01-29 Thread Allan Haldane
Hi Jamie,

I'm not sure whether to reply here or on github!

I have a comment on the condition There must be the same total number
of objects before and after.

I originally looked into this to solve
https://github.com/numpy/numpy/issues/3256, which involves taking a view
of a subset of the fields of a structured array. Such a view causes the
final number objects to be less than the original number.

For example, say you have an array with three fields a,b,c, and you only
want a and c. Numpy currently does the equivalent of this (in
_index_fields in numpy/core/_internals.py):

 a = zeros([(1,2,3), (4,5,6)],
...   dtype([('a', 'i8'), ('b', 'i8'), ('c', 'i8')]))
 a.view(dtype({'names': ['x', 'y'],
...   'formats': ['i8', 'i8'],
...   'offsets': [0, 16]}))
array([(1, 3), (4, 6)],
  dtype={'names':['x','y'], 'formats':['i8','i8'],
'offsets':[0,16], 'itemsize':24})

It works although the repr is a bit wordy. However, it currently does
not work if we have 'O' instead of 'i8' since we can't take views of
object arrays. If this was an object array, the resulting view only has
two objects, not the original three.

This does mean that the view can't be undone since the view object no
longer knows there is an object at offset 8. But that does not seem
bad to me compared to the usefulness of such a view.

I think to fix this, you only need to remove the test for this case in
dtype_view_is_safe, and then return all(o in old_offsets for o in
new_offsets). (and btw, I like the much cleaner logic you have in that
section!)

Allan

On 01/28/2015 07:56 PM, Jaime Fernández del Río wrote:
 HI all,
 
 There has been some recent discussion going on on the limitations that
 numpy imposes to taking views of an array with a different dtype.
 
 As of right now, you can basically only take a view of an array if it
 has no Python objects and neither the old nor the new dtype are
 structured. Furthermore, the array has to be either C or Fortran contiguous.
 
 This seem to be way too strict, but the potential for disaster getting a
 loosening of the restrictions wrong is big, so it should be handled with
 care.
 
 Allan Haldane and myself have been looking into this separately and
 discussing some of the details over at github, and we both think that
 the only true limitation that has to be imposed is that the offsets of
 Python objects within the new and old dtypes remain compatible. I have
 expanded Allan's work from here:
 
 https://github.com/ahaldane/numpy/commit/e9ca367
 
 to make it as flexible as I have been able. An implementation of the
 algorithm in Python, with a few tests, can be found here:
 
 https://gist.github.com/jaimefrio/b4dae59fa09fccd9638c#file-dtype_compat-py
 
 I would appreciate getting some eyes on it for correctness, and to make
 sure that it won't break with some weird dtype.
 
 I am also trying to figure out what the ground rules for stride and
 shape conversions when taking a view with a different dtype should be. I
 submitted a PR (gh-5508) a couple for days ago working on that, but I am
 not so sure that the logic is completely sound. Again, to get more eyes
 on it, I am going to reproduce my thoughts here on the hope of getting
 some feedback.
 
 The objective would be to always allow a view of a different dtype
 (given that the dtypes be compatible as described above) to be taken if:
 
   * The itemsize of the dtype doesn't change.
   * The itemsize changes, but the array being viewed is the result of
 slicing and transposing axes of a contiguous array, and it is still
 contiguous, defined as stride == dtype.itemsize, along its
 smallest-strided dimension, and the itemsize of the newtype exactly
 divides the size of that dimension.
   * Ideally taking a view should be a reversible process, i.e. if
 oldtype = arr.dtype, then doing arr.view(newtype).view(oldtype)
 should give you back a view of arr with the same original shape,
 strides and dtype.
 
 This last point can get tricky if the minimal stride dimension has size
 1, as there could be several of those, e.g.:
 
  a = np.ones((3, 4, 1), dtype=float)[:, :2, :].transpose(0, 2, 1)
  a.flags.contiguous
 False
  a.shape
 (3, 1, 2)
  a.strides  # the stride of the size 1 dimension could be
 anything, ignore it!
 (32, 8, 8)
 
 b = a.view(complex)  # this fails right now, but should work
  b.flags.contiguous
 False
  b.shape
 (3, 1, 1)
  b.strides  # the stride of the size 1 dimensions could be
 anything, ignore them!
 (32, 16, 16)
 
 c = b.view(float)  # which of the two size 1 dimensions should we
 expand?
 
 
 In the face of ambiguity refuse the temptation to guess dictates that
 last view should raise an error, unless we agree and document some
 default. Any thoughts?
 
 Then there is the endless complication one could get into with arrays
 created with as_strided. I'm not smart enough

Re: [Numpy-discussion] Views of a different dtype

2015-01-29 Thread Allan Haldane
Hello again,

I also have a minor code comment:

In get_object_offsets you iterate over dtype.fields.values(). Be
careful, because dtype.fields also includes the field titles. For
example this fails:

dta = np.dtype([(('a', 'title'), 'O'), ('b', 'O'), ('c', 'i1')])
dtb = np.dtype([('a', 'O'), ('b', 'O'), ('c', 'i1')])
assert dtype_view_is_safe(dta, dtb)

I've seen two strategies in the numpy code to work around this. One is
to to skip entries that are titles, like this:

for key,field in dtype.fields.iteritems():
if len(field) == 3 and field[2] == key: #detect titles
continue
#do something

You can find all examples that do this by grepping NPY_TITLE_KEY in the
numpy source.

The other (more popular) strategy is to iterate over dtype.names. You
can find all examples of this by grepping for names_size.

I don't know the history of it, but it looks to me like titles in
dtypes are an obsolete feature. Are they actually used anywhere?

Allan

On 01/28/2015 07:56 PM, Jaime Fernández del Río wrote:
 HI all,
 
 There has been some recent discussion going on on the limitations that
 numpy imposes to taking views of an array with a different dtype.
 
 As of right now, you can basically only take a view of an array if it
 has no Python objects and neither the old nor the new dtype are
 structured. Furthermore, the array has to be either C or Fortran contiguous.
 
 This seem to be way too strict, but the potential for disaster getting a
 loosening of the restrictions wrong is big, so it should be handled with
 care.
 
 Allan Haldane and myself have been looking into this separately and
 discussing some of the details over at github, and we both think that
 the only true limitation that has to be imposed is that the offsets of
 Python objects within the new and old dtypes remain compatible. I have
 expanded Allan's work from here:
 
 https://github.com/ahaldane/numpy/commit/e9ca367
 
 to make it as flexible as I have been able. An implementation of the
 algorithm in Python, with a few tests, can be found here:
 
 https://gist.github.com/jaimefrio/b4dae59fa09fccd9638c#file-dtype_compat-py
 
 I would appreciate getting some eyes on it for correctness, and to make
 sure that it won't break with some weird dtype.
 
 I am also trying to figure out what the ground rules for stride and
 shape conversions when taking a view with a different dtype should be. I
 submitted a PR (gh-5508) a couple for days ago working on that, but I am
 not so sure that the logic is completely sound. Again, to get more eyes
 on it, I am going to reproduce my thoughts here on the hope of getting
 some feedback.
 
 The objective would be to always allow a view of a different dtype
 (given that the dtypes be compatible as described above) to be taken if:
 
   * The itemsize of the dtype doesn't change.
   * The itemsize changes, but the array being viewed is the result of
 slicing and transposing axes of a contiguous array, and it is still
 contiguous, defined as stride == dtype.itemsize, along its
 smallest-strided dimension, and the itemsize of the newtype exactly
 divides the size of that dimension.
   * Ideally taking a view should be a reversible process, i.e. if
 oldtype = arr.dtype, then doing arr.view(newtype).view(oldtype)
 should give you back a view of arr with the same original shape,
 strides and dtype.
 
 This last point can get tricky if the minimal stride dimension has size
 1, as there could be several of those, e.g.:
 
  a = np.ones((3, 4, 1), dtype=float)[:, :2, :].transpose(0, 2, 1)
  a.flags.contiguous
 False
  a.shape
 (3, 1, 2)
  a.strides  # the stride of the size 1 dimension could be
 anything, ignore it!
 (32, 8, 8)
 
 b = a.view(complex)  # this fails right now, but should work
  b.flags.contiguous
 False
  b.shape
 (3, 1, 1)
  b.strides  # the stride of the size 1 dimensions could be
 anything, ignore them!
 (32, 16, 16)
 
 c = b.view(float)  # which of the two size 1 dimensions should we
 expand?
 
 
 In the face of ambiguity refuse the temptation to guess dictates that
 last view should raise an error, unless we agree and document some
 default. Any thoughts?
 
 Then there is the endless complication one could get into with arrays
 created with as_strided. I'm not smart enough to figure when and when
 not those could work, but am willing to retake the discussion if someone
 wiser si interested.
 
 With all these in mind, my proposal for the new behavior is that taking
 a view of an array with a different dtype would require:
 
  1. That the newtype and oldtype be compatible, as defined by the
 algorithm checking object offsets linked above.
  2. If newtype.itemsize == oldtype.itemsize no more checks are needed,
 make it happen!
  3. If the array is C/Fortran contiguous, check that the size in bytes
 of the last/first dimension is evenly divided by newtype.itemsize

[Numpy-discussion] structured arrays, recarrays, and record arrays

2015-01-18 Thread Allan Haldane
Hello all,

Documentation of recarrays is poor and I'd like to improve it. In order 
to do this I've been looking at core/records.py, and I would appreciate 
some feedback on my plan.

Let me start by describing what I see. In the docs there is some 
confusion about 'structured arrays' vs 'record arrays' vs 'recarrays' - 
the docs use them often interchangeably. They also refer to structured 
dtypes alternately as 'struct data types', 'record data types' or simply 
'records' (eg, see the reference/arrays.dtypes and 
reference/arrays.indexing doc pages).

But by my reading of the code there are really three (or four) distinct 
types of arrays with structure. Here's a possible nomenclature:
  * Structured arrays are simply ndarrays with structured dtypes. That
is, the data type is subdivided into fields of different type.
  * recarrays are a subclass of ndarrays that allow access to the
fields by attribute.
  * Record arrays are recarrays where the elements have additionally
been converted to 'numpy.core.records.record' type such that each
data element is an object with field attributes.
  * (it is also possible to create arrays with dtype.dtype of
numpy.core.records.record, but which are not recarrays. However I
have never seen this done.)

Here's code demonstrating the creation of the different types of array 
(in order: structured array, recarray, ???, record array).

  arr = np.array([(1,'a'), (2,'b')],
dtype=[('foo', int), ('bar', 'S1')])
  recarr = arr.view(type=np.recarray)
  noname = arr.view(dtype=dtype(np.record, arr.dtype))
  recordarr = arr.view(dtype=dtype((np.record, arr.dtype)),
  type=np.recarray)

  type(arr), arr.dtype.type
 (numpy.ndarray, numpy.void)
  type(recarr), recarr.dtype.type
 (numpy.core.records.recarray, numpy.void)
  type(recordarr), recordarr.dtype.type
 (numpy.core.records.recarray, numpy.core.records.record)

Note that the functions numpy.rec.array, numpy.rec.fromrecords, 
numpy.rec.fromarrays, and np.recarray.__new__ create record arrays. 
However, in the docs you can see examples of the creation of recarrays, 
eg in the recarray and ndarray.view doctrings and in 
http://www.scipy.org/Cookbook/Recarray. The files 
numpy/lib/recfunctions.py and numpy/lib/npyio.py (and possibly masked 
arrays, but I haven't looked yet) make extensive use of recarrays (but 
not record arrays).

The main functional difference between recarrays and record arrays is 
field access on individual elements:

  recordarr[0].foo
 1
  recarr[0].foo
 Traceback (most recent call last):
   File stdin, line 1, in module
 AttributeError: 'numpy.void' object has no attribute 'foo'

Also, note that recarrays have a small performance penalty relative to 
structured arrays, and record arrays have another one relative to 
recarrays because of the additional python logic.

So my first goal in updating the docs is to use the right terms in the 
right place. In almost all cases, references to 'records' (eg 'record 
types') should be replaced with 'structured' (eg 'structured types'), 
with the exception of docs that deal specifically with record arrays. 
It's my guess that in the distant past structured datatypes were 
intended to always be of type numpy.core.records.record (thus the 
description in reference/arrays.dtypes) but that 
numpy.core.records.record became generally obsolete without updates to 
the docs. doc/records.rst.txt seems to document the transition.

I've made a preliminary pass of the docs, which you can see here
https://github.com/ahaldane/numpy/commit/d87633b228dabee2ddfe75d1ee9e41ba7039e715
Mostly I renamed 'record type' to 'structured type', and added a very 
rough draft to numpy/doc/structured_arrays.py.

I would love to hear from those more knowledgeable than myself on 
whether this works!

Cheers,
Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] structured arrays, recarrays, and record arrays

2015-01-18 Thread Allan Haldane

In light of my previous message I'd like to bring up 
https://github.com/numpy/numpy/issues/3581, as it is now clearer to me 
what is happening. In the example on that page the user creates a 
recarray and a record array (in my nomenclature) without realizing that 
they are slightly different types of beast. This is probably because the 
str() or repr() representations of these two objects are identical. To 
distinguish them you have to look at their dtype.type. Using the setup 
from my last message:

  print repr(recarr)
 rec.array([(1, 'a'), (2, 'b')],
   dtype=[('foo', 'i8'), ('bar', 'S1')])
  print repr(recordarr)
 rec.array([(1, 'a'), (2, 'b')],
   dtype=[('foo', 'i8'), ('bar', 'S1')])
  print repr(recarr.dtype)
 dtype([('foo', 'i8'), ('bar', 'S1')])
  print repr(recordarr.dtype)
 dtype([('foo', 'i8'), ('bar', 'S1')])
  print recarr.dtype.type
 type 'numpy.void'
  print recordarr.dtype.type
 class 'numpy.core.records.record'

Based on this, it occurs to me that the repr of a dtype should list 
dtype.type if it is not numpy.void. This might be nice to see:

  print repr(recarr.dtype)
dtype([('foo', 'i8'), ('bar', 'S1')])
  print repr(recordarr.dtype)
dtype((numpy.core.records.record, [('foo', 'i8'), ('bar', 'S1')]))

I could easily implement this by redefining __repr__ for the 
numpy.core.records.record class, but this does not solve the problem for 
any other cases of overridden base_dtype. So perhaps modifications 
should be made to the original repr function of dtype (in the functions 
arraydescr_struct_str and arraydescr_struct_repr in 
numpy/core/src/multiarray/descriptor.c). However, also note that the doc 
http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html says that 
creating dtypes using the form dtype((base_dtype, new_dtype)) is 
discouraged (near the bottom).

Another possibility is to discourage recarrays, and only document record 
arrays (or vv). However, many people's code already depends on both of 
these types.

Is any of this at all reasonable? It would require a change to dtype str 
and repr, which could affect a lot of things.

Cheers,
Allan

On 01/18/2015 11:36 PM, Allan Haldane wrote:
 Hello all,

 Documentation of recarrays is poor and I'd like to improve it. In order
 to do this I've been looking at core/records.py, and I would appreciate
 some feedback on my plan.

 Let me start by describing what I see. In the docs there is some
 confusion about 'structured arrays' vs 'record arrays' vs 'recarrays' -
 the docs use them often interchangeably. They also refer to structured
 dtypes alternately as 'struct data types', 'record data types' or simply
 'records' (eg, see the reference/arrays.dtypes and
 reference/arrays.indexing doc pages).

 But by my reading of the code there are really three (or four) distinct
 types of arrays with structure. Here's a possible nomenclature:
   * Structured arrays are simply ndarrays with structured dtypes. That
 is, the data type is subdivided into fields of different type.
   * recarrays are a subclass of ndarrays that allow access to the
 fields by attribute.
   * Record arrays are recarrays where the elements have additionally
 been converted to 'numpy.core.records.record' type such that each
 data element is an object with field attributes.
   * (it is also possible to create arrays with dtype.dtype of
 numpy.core.records.record, but which are not recarrays. However I
 have never seen this done.)

 Here's code demonstrating the creation of the different types of array
 (in order: structured array, recarray, ???, record array).

   arr = np.array([(1,'a'), (2,'b')],
 dtype=[('foo', int), ('bar', 'S1')])
   recarr = arr.view(type=np.recarray)
   noname = arr.view(dtype=dtype(np.record, arr.dtype))
   recordarr = arr.view(dtype=dtype((np.record, arr.dtype)),
   type=np.recarray)

   type(arr), arr.dtype.type
  (numpy.ndarray, numpy.void)
   type(recarr), recarr.dtype.type
  (numpy.core.records.recarray, numpy.void)
   type(recordarr), recordarr.dtype.type
  (numpy.core.records.recarray, numpy.core.records.record)

 Note that the functions numpy.rec.array, numpy.rec.fromrecords,
 numpy.rec.fromarrays, and np.recarray.__new__ create record arrays.
 However, in the docs you can see examples of the creation of recarrays,
 eg in the recarray and ndarray.view doctrings and in
 http://www.scipy.org/Cookbook/Recarray. The files
 numpy/lib/recfunctions.py and numpy/lib/npyio.py (and possibly masked
 arrays, but I haven't looked yet) make extensive use of recarrays (but
 not record arrays).

 The main functional difference between recarrays and record arrays is
 field access on individual elements:

   recordarr[0].foo
  1
   recarr[0].foo
  Traceback (most recent call last):
File stdin, line 1, in module
  AttributeError

[Numpy-discussion] proposed change to recarray access

2015-01-14 Thread Allan Haldane
Hello all,

I've submitted a pull request on github which changes how string values
in recarrays are returned, which may break old code.

https://github.com/numpy/numpy/pull/5454
See also: https://github.com/numpy/numpy/issues/3993

Previously, recarray fields of type 'S' or 'U' (ie, strings) would be
returned as chararrays when accessed by attribute, but ndarrays when
accessed by indexing:

 arr = np.array([('abc ', 1), ('abc', 2)],
   dtype=[('str', 'S4'), ('id', int)])
 arr = arr.view(np.recarray)
 type(arr.str)
numpy.core.defchararray.chararray
 type(arr['str'])
numpy.core.records.recarray

Chararray is deprecated, and furthermore this led to bugs in my code
since chararrays trim trailing whitespace but but ndarrays do not (and I
was not aware of conversion to chararray). For example:

 arr.str[0] == arr.str[1]
True
 arr['str'][0] == arr['str'][1]
False

In the pull request I have changed recarray attribute access so ndarrays
are always returned. I think this is a sensible thing to do but it may
break code which depends on chararray features (including the trimmed
whitespace).

Does this sound reasonable?

Best,
Allan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion