Re: Attn: Wes, Re: Masked Arrays

Wes McKinney Sun, 05 Apr 2020 13:32:25 -0700

As I recall the contents "underneath" have been discussed before and
the consensus was that the contents are not specified. If you'e like
to make a proposal to change something I would suggest raising it on
[email protected]


On Sun, Apr 5, 2020 at 1:56 PM Felix Benning <[email protected]> wrote:
>
> Follow up: Do you think it would make sense to have an `na_are_zero` flag? 
> Since it appears that the baseline (naively assuming there are no null 
> values) is still a bit faster than equally optimized null value handling 
> algorithms. So you might want to make the assumption, that all null values 
> are set to zero in the array (instead of undefined). This would allow for 
> very fast means, scalar products and thus matrix multiplication which ignore 
> nas. And in case of matrix multiplication, you might prefer sacrificing an 
> O(n^2) effort to set all null entries to zero before multiplying. And 
> assuming you do not overwrite this data, you would be able to reuse that 
> assumption in later computations with such a flag.
> In some use cases, you might even be able to utilize unused computing 
> resources for this task. I.e. clean up the nulls while the computer is not 
> used, preparing for the next query.
>
>
> On Sun, 5 Apr 2020 at 18:34, Felix Benning <[email protected]> wrote:
>>
>> Awesome, that was exactly what I was looking for, thank you!
>>
>> On Sun, 5 Apr 2020 at 00:40, Wes McKinney <[email protected]> wrote:
>>>
>>> I wrote a blog post a couple of years about this
>>>
>>> https://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
>>>
>>> Pasha Stetsenko did a follow-up analysis that showed that my
>>> "sentinel" code could be significantly improved, see:
>>>
>>> https://github.com/st-pasha/microbench-nas/blob/master/README.md
>>>
>>> Generally speaking in Apache Arrow we've been happy to have a uniform
>>> representation of nullness across all types, both primitive (booleans,
>>> numbers, or strings) and nested (lists, structs, unions, etc.). Many
>>> computational operations (like elementwise functions) need not concern
>>> themselves with the nulls at all, for example, since the bitmap from
>>> the input array can be passed along (with zero copy even) to the
>>> output array.
>>>
>>> On Sat, Apr 4, 2020 at 4:39 PM Felix Benning <[email protected]> 
>>> wrote:
>>> >
>>> > Does anyone have an opinion (or links) about Bitpattern vs Masked Arrays 
>>> > for NA implementations? There seems to have been a discussion about that 
>>> > in the numpy community in 2012 
>>> > https://numpy.org/neps/nep-0026-missing-data-summary.html without an 
>>> > apparent result.
>>> >
>>> > Summary of the Summary:
>>> > - The Bitpattern approach reserves one bitpattern of any type as na, the 
>>> > only type not having spare bitpatterns are integers which means this 
>>> > decreases their range by one. This approach is taken by R and was 
>>> > regarded as more performant in 2012.
>>> > - The Mask approach was deemed more flexible, since it would allow 
>>> > "degrees of missingness", and also cleaner/easier implementation.
>>> >
>>> > Since bitpattern checks would probably disrupt SIMD, I feel like some 
>>> > calculations (e.g. mean) would actually benefit more, from setting na 
>>> > values to zero, proceeding as if they were not there, and using the 
>>> > number of nas in the metadata to adjust the result. This of course does 
>>> > not work if two columns are used (e.g. scalar product), which is probably 
>>> > more important.
>>> >
>>> > Was using Bitmasks in Arrow a conscious performance decision? Or was the 
>>> > decision only based on the fact, that R and Bitpattern implementations in 
>>> > general are a niche, which means that Bitmasks are more compatible with 
>>> > other languages?
>>> >
>>> > I am curious about this topic, since the "lack of proper na support" was 
>>> > cited as the reason, why Python would never replace R in statistics.
>>> >
>>> > Thanks,
>>> >
>>> > Felix
>>> >
>>> >
>>> > On 31.03.20 14:52, Joris Van den Bossche wrote:
>>> >
>>> > Note that pandas is starting to use a notion of "masked arrays" as well, 
>>> > for example for its nullable integer data type, but also not using the 
>>> > np.ma masked array, but a custom implementation (for technical reasons in 
>>> > pandas this was easier).
>>> >
>>> > Also, there has been quite some discussion last year in numpy about a 
>>> > possible re-implementation of a MaskedArray, but using numpy's protocols 
>>> > (`__array_ufunc__`, `__array_function__` etc), instead of being a 
>>> > subclass like np.ma now is. See eg 
>>> > https://mail.python.org/pipermail/numpy-discussion/2019-June/079681.html.
>>> >
>>> > Joris
>>> >
>>> > On Mon, 30 Mar 2020 at 18:57, Daniel Nugent <[email protected]> wrote:
>>> >>
>>> >> Ok. That actually aligns closely to what I'm familiar with. Good to know.
>>> >>
>>> >> Thanks again for taking the time to respond,
>>> >>
>>> >> -Dan Nugent
>>> >>
>>> >>
>>> >> On Mon, Mar 30, 2020 at 12:38 PM Wes McKinney <[email protected]> 
>>> >> wrote:
>>> >>>
>>> >>> Social and technical reasons I guess. Empirically it's just not used 
>>> >>> much.
>>> >>>
>>> >>> You can see my comments about numpy.ma in my 2010 paper about pandas
>>> >>>
>>> >>> https://conference.scipy.org/proceedings/scipy2010/pdfs/mckinney.pdf
>>> >>>
>>> >>> At least in 2010, there were notable performance problems when using
>>> >>> MaskedArray for computations
>>> >>>
>>> >>> "We chose to use NaN as opposed to using NumPy MaskedArrays for
>>> >>> performance reasons (which are beyond the scope of this paper), as NaN
>>> >>> propagates in floating-point operations in a natural way and can be
>>> >>> easily detected in algorithms."
>>> >>>
>>> >>> On Mon, Mar 30, 2020 at 11:20 AM Daniel Nugent <[email protected]> wrote:
>>> >>> >
>>> >>> > Thanks! Since I'm just using it to jump to Arrow, I think I'll stick 
>>> >>> > with it.
>>> >>> >
>>> >>> > Do you have any feelings about why Numpy's masked arrays didn't gain 
>>> >>> > favor when many data representation formats explicitly support 
>>> >>> > nullity (including Arrow)? Is it just that not carrying nulls in 
>>> >>> > computations forward is preferable (that is, early filtering/value 
>>> >>> > filling was easier)?
>>> >>> >
>>> >>> > -Dan Nugent
>>> >>> >
>>> >>> >
>>> >>> > On Mon, Mar 30, 2020 at 11:40 AM Wes McKinney <[email protected]> 
>>> >>> > wrote:
>>> >>> >>
>>> >>> >> On Mon, Mar 30, 2020 at 8:31 AM Daniel Nugent <[email protected]> 
>>> >>> >> wrote:
>>> >>> >> >
>>> >>> >> > Didn’t want to follow up on this on the Jira issue earlier since 
>>> >>> >> > it's sort of tangential to that bug and more of a usage question. 
>>> >>> >> > You said:
>>> >>> >> >
>>> >>> >> > > I wouldn't recommend building applications based on them 
>>> >>> >> > > nowadays since the level of support / compatibility in other 
>>> >>> >> > > projects is low.
>>> >>> >> >
>>> >>> >> > In my case, I am using them since it seemed like a straightforward 
>>> >>> >> > representation of my data that has nulls, the format I’m 
>>> >>> >> > converting from has zero cost numpy representations, and 
>>> >>> >> > converting from an internal format into Arrow in memory structures 
>>> >>> >> > appears zero cost (or close to it) as well. I guess I can just 
>>> >>> >> > provide the mask as an explicit argument, but my original desire 
>>> >>> >> > to use it came from being able to exploit numpy.ma.concatenate in 
>>> >>> >> > a way that saved some complexity in implementation.
>>> >>> >> >
>>> >>> >> > Since Arrow itself supports masking values with a bitfield, is 
>>> >>> >> > there something intrinsic to the notion of array masks that is not 
>>> >>> >> > well supported? Or do you just mean the specific numpy MaskedArray 
>>> >>> >> > class?
>>> >>> >> >
>>> >>> >>
>>> >>> >> I mean just the numpy.ma module. Not many Python computing projects
>>> >>> >> nowadays treat MaskedArray objects as first class citizens. Depending
>>> >>> >> on what you need it may or may not be a problem. pyarrow supports
>>> >>> >> ingesting from MaskedArray as a convenience, but it would not be
>>> >>> >> common in my experience for a library's APIs to return MaskedArrays.
>>> >>> >>
>>> >>> >> > If this is too much of a numpy question rather than an arrow 
>>> >>> >> > question, could you point me to where I can read up on masked 
>>> >>> >> > array support or maybe what the right place to ask the numpy 
>>> >>> >> > community about whether what I'm doing is appropriate or not.
>>> >>> >> >
>>> >>> >> > Thanks,
>>> >>> >> >
>>> >>> >> >
>>> >>> >> > -Dan Nugent

Re: Attn: Wes, Re: Masked Arrays

Reply via email to