Does anyone have an opinion (or links) about Bitpattern vs Masked Arrays
for NA implementations? There seems to have been a discussion about that
in the numpy community in 2012
https://numpy.org/neps/nep-0026-missing-data-summary.html without an
apparent result.
Summary of the Summary:
- The Bitpattern approach reserves one bitpattern of any type as na, the
only type not having spare bitpatterns are integers which means this
decreases their range by one. This approach is taken by R and was
regarded as more performant in 2012.
- The Mask approach was deemed more flexible, since it would allow
"degrees of missingness", and also cleaner/easier implementation.
Since bitpattern checks would probably disrupt SIMD, I feel like some
calculations (e.g. mean) would actually benefit more, from setting na
values to zero, proceeding as if they were not there, and using the
number of nas in the metadata to adjust the result. This of course does
not work if two columns are used (e.g. scalar product), which is
probably more important.
Was using Bitmasks in Arrow a conscious performance decision? Or was the
decision only based on the fact, that R and Bitpattern implementations
in general are a niche, which means that Bitmasks are more compatible
with other languages?
I am curious about this topic, since the "lack of proper na support" was
cited as the reason, why Python would never replace R in statistics.
Thanks,
Felix
On 31.03.20 14:52, Joris Van den Bossche wrote:
Note that pandas is starting to use a notion of "masked arrays" as
well, for example for its nullable integer data type, but also not
using the np.ma <http://np.ma> masked array, but a custom
implementation (for technical reasons in pandas this was easier).
Also, there has been quite some discussion last year in numpy about a
possible re-implementation of a MaskedArray, but using numpy's
protocols (`__array_ufunc__`, `__array_function__` etc), instead of
being a subclass like np.ma <http://np.ma> now is. See eg
https://mail.python.org/pipermail/numpy-discussion/2019-June/079681.html.
Joris
On Mon, 30 Mar 2020 at 18:57, Daniel Nugent <[email protected]
<mailto:[email protected]>> wrote:
Ok. That actually aligns closely to what I'm familiar with. Good
to know.
Thanks again for taking the time to respond,
-Dan Nugent
On Mon, Mar 30, 2020 at 12:38 PM Wes McKinney <[email protected]
<mailto:[email protected]>> wrote:
Social and technical reasons I guess. Empirically it's just
not used much.
You can see my comments about numpy.ma <http://numpy.ma> in my
2010 paper about pandas
https://conference.scipy.org/proceedings/scipy2010/pdfs/mckinney.pdf
At least in 2010, there were notable performance problems when
using
MaskedArray for computations
"We chose to use NaN as opposed to using NumPy MaskedArrays for
performance reasons (which are beyond the scope of this
paper), as NaN
propagates in floating-point operations in a natural way and
can be
easily detected in algorithms."
On Mon, Mar 30, 2020 at 11:20 AM Daniel Nugent
<[email protected] <mailto:[email protected]>> wrote:
>
> Thanks! Since I'm just using it to jump to Arrow, I think
I'll stick with it.
>
> Do you have any feelings about why Numpy's masked arrays
didn't gain favor when many data representation formats
explicitly support nullity (including Arrow)? Is it just that
not carrying nulls in computations forward is preferable (that
is, early filtering/value filling was easier)?
>
> -Dan Nugent
>
>
> On Mon, Mar 30, 2020 at 11:40 AM Wes McKinney
<[email protected] <mailto:[email protected]>> wrote:
>>
>> On Mon, Mar 30, 2020 at 8:31 AM Daniel Nugent
<[email protected] <mailto:[email protected]>> wrote:
>> >
>> > Didn’t want to follow up on this on the Jira issue
earlier since it's sort of tangential to that bug and more of
a usage question. You said:
>> >
>> > > I wouldn't recommend building applications based on
them nowadays since the level of support / compatibility in
other projects is low.
>> >
>> > In my case, I am using them since it seemed like a
straightforward representation of my data that has nulls, the
format I’m converting from has zero cost numpy
representations, and converting from an internal format into
Arrow in memory structures appears zero cost (or close to it)
as well. I guess I can just provide the mask as an explicit
argument, but my original desire to use it came from being
able to exploit numpy.ma.concatenate in a way that saved some
complexity in implementation.
>> >
>> > Since Arrow itself supports masking values with a
bitfield, is there something intrinsic to the notion of array
masks that is not well supported? Or do you just mean the
specific numpy MaskedArray class?
>> >
>>
>> I mean just the numpy.ma <http://numpy.ma> module. Not many
Python computing projects
>> nowadays treat MaskedArray objects as first class citizens.
Depending
>> on what you need it may or may not be a problem. pyarrow
supports
>> ingesting from MaskedArray as a convenience, but it would
not be
>> common in my experience for a library's APIs to return
MaskedArrays.
>>
>> > If this is too much of a numpy question rather than an
arrow question, could you point me to where I can read up on
masked array support or maybe what the right place to ask the
numpy community about whether what I'm doing is appropriate or
not.
>> >
>> > Thanks,
>> >
>> >
>> > -Dan Nugent