Re: Attn: Wes, Re: Masked Arrays

Joris Van den Bossche Tue, 31 Mar 2020 05:52:54 -0700

Note that pandas is starting to use a notion of "masked arrays" as well,
for example for its nullable integer data type, but also not using the np.ma
masked array, but a custom implementation (for technical reasons in pandas
this was easier).


Also, there has been quite some discussion last year in numpy about a
possible re-implementation of a MaskedArray, but using numpy's protocols
(`__array_ufunc__`, `__array_function__` etc), instead of being a subclass
like np.ma now is. See eg
https://mail.python.org/pipermail/numpy-discussion/2019-June/079681.html.

Joris

On Mon, 30 Mar 2020 at 18:57, Daniel Nugent <[email protected]> wrote:

> Ok. That actually aligns closely to what I'm familiar with. Good to know.
>
> Thanks again for taking the time to respond,
>
> -Dan Nugent
>
>
> On Mon, Mar 30, 2020 at 12:38 PM Wes McKinney <[email protected]> wrote:
>
>> Social and technical reasons I guess. Empirically it's just not used much.
>>
>> You can see my comments about numpy.ma in my 2010 paper about pandas
>>
>> https://conference.scipy.org/proceedings/scipy2010/pdfs/mckinney.pdf
>>
>> At least in 2010, there were notable performance problems when using
>> MaskedArray for computations
>>
>> "We chose to use NaN as opposed to using NumPy MaskedArrays for
>> performance reasons (which are beyond the scope of this paper), as NaN
>> propagates in floating-point operations in a natural way and can be
>> easily detected in algorithms."
>>
>> On Mon, Mar 30, 2020 at 11:20 AM Daniel Nugent <[email protected]> wrote:
>> >
>> > Thanks! Since I'm just using it to jump to Arrow, I think I'll stick
>> with it.
>> >
>> > Do you have any feelings about why Numpy's masked arrays didn't gain
>> favor when many data representation formats explicitly support nullity
>> (including Arrow)? Is it just that not carrying nulls in computations
>> forward is preferable (that is, early filtering/value filling was easier)?
>> >
>> > -Dan Nugent
>> >
>> >
>> > On Mon, Mar 30, 2020 at 11:40 AM Wes McKinney <[email protected]>
>> wrote:
>> >>
>> >> On Mon, Mar 30, 2020 at 8:31 AM Daniel Nugent <[email protected]>
>> wrote:
>> >> >
>> >> > Didn’t want to follow up on this on the Jira issue earlier since
>> it's sort of tangential to that bug and more of a usage question. You said:
>> >> >
>> >> > > I wouldn't recommend building applications based on them nowadays
>> since the level of support / compatibility in other projects is low.
>> >> >
>> >> > In my case, I am using them since it seemed like a straightforward
>> representation of my data that has nulls, the format I’m converting from
>> has zero cost numpy representations, and converting from an internal format
>> into Arrow in memory structures appears zero cost (or close to it) as well.
>> I guess I can just provide the mask as an explicit argument, but my
>> original desire to use it came from being able to exploit
>> numpy.ma.concatenate in a way that saved some complexity in implementation.
>> >> >
>> >> > Since Arrow itself supports masking values with a bitfield, is there
>> something intrinsic to the notion of array masks that is not well
>> supported? Or do you just mean the specific numpy MaskedArray class?
>> >> >
>> >>
>> >> I mean just the numpy.ma module. Not many Python computing projects
>> >> nowadays treat MaskedArray objects as first class citizens. Depending
>> >> on what you need it may or may not be a problem. pyarrow supports
>> >> ingesting from MaskedArray as a convenience, but it would not be
>> >> common in my experience for a library's APIs to return MaskedArrays.
>> >>
>> >> > If this is too much of a numpy question rather than an arrow
>> question, could you point me to where I can read up on masked array support
>> or maybe what the right place to ask the numpy community about whether what
>> I'm doing is appropriate or not.
>> >> >
>> >> > Thanks,
>> >> >
>> >> >
>> >> > -Dan Nugent
>>
>

Re: Attn: Wes, Re: Masked Arrays

Reply via email to