Note that pandas is starting to use a notion of "masked arrays" as well, for example for its nullable integer data type, but also not using the np.ma masked array, but a custom implementation (for technical reasons in pandas this was easier).
Also, there has been quite some discussion last year in numpy about a possible re-implementation of a MaskedArray, but using numpy's protocols (`__array_ufunc__`, `__array_function__` etc), instead of being a subclass like np.ma now is. See eg https://mail.python.org/pipermail/numpy-discussion/2019-June/079681.html. Joris On Mon, 30 Mar 2020 at 18:57, Daniel Nugent <[email protected]> wrote: > Ok. That actually aligns closely to what I'm familiar with. Good to know. > > Thanks again for taking the time to respond, > > -Dan Nugent > > > On Mon, Mar 30, 2020 at 12:38 PM Wes McKinney <[email protected]> wrote: > >> Social and technical reasons I guess. Empirically it's just not used much. >> >> You can see my comments about numpy.ma in my 2010 paper about pandas >> >> https://conference.scipy.org/proceedings/scipy2010/pdfs/mckinney.pdf >> >> At least in 2010, there were notable performance problems when using >> MaskedArray for computations >> >> "We chose to use NaN as opposed to using NumPy MaskedArrays for >> performance reasons (which are beyond the scope of this paper), as NaN >> propagates in floating-point operations in a natural way and can be >> easily detected in algorithms." >> >> On Mon, Mar 30, 2020 at 11:20 AM Daniel Nugent <[email protected]> wrote: >> > >> > Thanks! Since I'm just using it to jump to Arrow, I think I'll stick >> with it. >> > >> > Do you have any feelings about why Numpy's masked arrays didn't gain >> favor when many data representation formats explicitly support nullity >> (including Arrow)? Is it just that not carrying nulls in computations >> forward is preferable (that is, early filtering/value filling was easier)? >> > >> > -Dan Nugent >> > >> > >> > On Mon, Mar 30, 2020 at 11:40 AM Wes McKinney <[email protected]> >> wrote: >> >> >> >> On Mon, Mar 30, 2020 at 8:31 AM Daniel Nugent <[email protected]> >> wrote: >> >> > >> >> > Didn’t want to follow up on this on the Jira issue earlier since >> it's sort of tangential to that bug and more of a usage question. You said: >> >> > >> >> > > I wouldn't recommend building applications based on them nowadays >> since the level of support / compatibility in other projects is low. >> >> > >> >> > In my case, I am using them since it seemed like a straightforward >> representation of my data that has nulls, the format I’m converting from >> has zero cost numpy representations, and converting from an internal format >> into Arrow in memory structures appears zero cost (or close to it) as well. >> I guess I can just provide the mask as an explicit argument, but my >> original desire to use it came from being able to exploit >> numpy.ma.concatenate in a way that saved some complexity in implementation. >> >> > >> >> > Since Arrow itself supports masking values with a bitfield, is there >> something intrinsic to the notion of array masks that is not well >> supported? Or do you just mean the specific numpy MaskedArray class? >> >> > >> >> >> >> I mean just the numpy.ma module. Not many Python computing projects >> >> nowadays treat MaskedArray objects as first class citizens. Depending >> >> on what you need it may or may not be a problem. pyarrow supports >> >> ingesting from MaskedArray as a convenience, but it would not be >> >> common in my experience for a library's APIs to return MaskedArrays. >> >> >> >> > If this is too much of a numpy question rather than an arrow >> question, could you point me to where I can read up on masked array support >> or maybe what the right place to ask the numpy community about whether what >> I'm doing is appropriate or not. >> >> > >> >> > Thanks, >> >> > >> >> > >> >> > -Dan Nugent >> >
