Awesome, that was exactly what I was looking for, thank you! On Sun, 5 Apr 2020 at 00:40, Wes McKinney <[email protected]> wrote:
> I wrote a blog post a couple of years about this > > https://wesmckinney.com/blog/bitmaps-vs-sentinel-values/ > > Pasha Stetsenko did a follow-up analysis that showed that my > "sentinel" code could be significantly improved, see: > > https://github.com/st-pasha/microbench-nas/blob/master/README.md > > Generally speaking in Apache Arrow we've been happy to have a uniform > representation of nullness across all types, both primitive (booleans, > numbers, or strings) and nested (lists, structs, unions, etc.). Many > computational operations (like elementwise functions) need not concern > themselves with the nulls at all, for example, since the bitmap from > the input array can be passed along (with zero copy even) to the > output array. > > On Sat, Apr 4, 2020 at 4:39 PM Felix Benning <[email protected]> > wrote: > > > > Does anyone have an opinion (or links) about Bitpattern vs Masked Arrays > for NA implementations? There seems to have been a discussion about that in > the numpy community in 2012 > https://numpy.org/neps/nep-0026-missing-data-summary.html without an > apparent result. > > > > Summary of the Summary: > > - The Bitpattern approach reserves one bitpattern of any type as na, the > only type not having spare bitpatterns are integers which means this > decreases their range by one. This approach is taken by R and was regarded > as more performant in 2012. > > - The Mask approach was deemed more flexible, since it would allow > "degrees of missingness", and also cleaner/easier implementation. > > > > Since bitpattern checks would probably disrupt SIMD, I feel like some > calculations (e.g. mean) would actually benefit more, from setting na > values to zero, proceeding as if they were not there, and using the number > of nas in the metadata to adjust the result. This of course does not work > if two columns are used (e.g. scalar product), which is probably more > important. > > > > Was using Bitmasks in Arrow a conscious performance decision? Or was the > decision only based on the fact, that R and Bitpattern implementations in > general are a niche, which means that Bitmasks are more compatible with > other languages? > > > > I am curious about this topic, since the "lack of proper na support" was > cited as the reason, why Python would never replace R in statistics. > > > > Thanks, > > > > Felix > > > > > > On 31.03.20 14:52, Joris Van den Bossche wrote: > > > > Note that pandas is starting to use a notion of "masked arrays" as well, > for example for its nullable integer data type, but also not using the > np.ma masked array, but a custom implementation (for technical reasons in > pandas this was easier). > > > > Also, there has been quite some discussion last year in numpy about a > possible re-implementation of a MaskedArray, but using numpy's protocols > (`__array_ufunc__`, `__array_function__` etc), instead of being a subclass > like np.ma now is. See eg > https://mail.python.org/pipermail/numpy-discussion/2019-June/079681.html. > > > > Joris > > > > On Mon, 30 Mar 2020 at 18:57, Daniel Nugent <[email protected]> wrote: > >> > >> Ok. That actually aligns closely to what I'm familiar with. Good to > know. > >> > >> Thanks again for taking the time to respond, > >> > >> -Dan Nugent > >> > >> > >> On Mon, Mar 30, 2020 at 12:38 PM Wes McKinney <[email protected]> > wrote: > >>> > >>> Social and technical reasons I guess. Empirically it's just not used > much. > >>> > >>> You can see my comments about numpy.ma in my 2010 paper about pandas > >>> > >>> https://conference.scipy.org/proceedings/scipy2010/pdfs/mckinney.pdf > >>> > >>> At least in 2010, there were notable performance problems when using > >>> MaskedArray for computations > >>> > >>> "We chose to use NaN as opposed to using NumPy MaskedArrays for > >>> performance reasons (which are beyond the scope of this paper), as NaN > >>> propagates in floating-point operations in a natural way and can be > >>> easily detected in algorithms." > >>> > >>> On Mon, Mar 30, 2020 at 11:20 AM Daniel Nugent <[email protected]> > wrote: > >>> > > >>> > Thanks! Since I'm just using it to jump to Arrow, I think I'll stick > with it. > >>> > > >>> > Do you have any feelings about why Numpy's masked arrays didn't gain > favor when many data representation formats explicitly support nullity > (including Arrow)? Is it just that not carrying nulls in computations > forward is preferable (that is, early filtering/value filling was easier)? > >>> > > >>> > -Dan Nugent > >>> > > >>> > > >>> > On Mon, Mar 30, 2020 at 11:40 AM Wes McKinney <[email protected]> > wrote: > >>> >> > >>> >> On Mon, Mar 30, 2020 at 8:31 AM Daniel Nugent <[email protected]> > wrote: > >>> >> > > >>> >> > Didn’t want to follow up on this on the Jira issue earlier since > it's sort of tangential to that bug and more of a usage question. You said: > >>> >> > > >>> >> > > I wouldn't recommend building applications based on them > nowadays since the level of support / compatibility in other projects is > low. > >>> >> > > >>> >> > In my case, I am using them since it seemed like a > straightforward representation of my data that has nulls, the format I’m > converting from has zero cost numpy representations, and converting from an > internal format into Arrow in memory structures appears zero cost (or close > to it) as well. I guess I can just provide the mask as an explicit > argument, but my original desire to use it came from being able to exploit > numpy.ma.concatenate in a way that saved some complexity in implementation. > >>> >> > > >>> >> > Since Arrow itself supports masking values with a bitfield, is > there something intrinsic to the notion of array masks that is not well > supported? Or do you just mean the specific numpy MaskedArray class? > >>> >> > > >>> >> > >>> >> I mean just the numpy.ma module. Not many Python computing projects > >>> >> nowadays treat MaskedArray objects as first class citizens. > Depending > >>> >> on what you need it may or may not be a problem. pyarrow supports > >>> >> ingesting from MaskedArray as a convenience, but it would not be > >>> >> common in my experience for a library's APIs to return MaskedArrays. > >>> >> > >>> >> > If this is too much of a numpy question rather than an arrow > question, could you point me to where I can read up on masked array support > or maybe what the right place to ask the numpy community about whether what > I'm doing is appropriate or not. > >>> >> > > >>> >> > Thanks, > >>> >> > > >>> >> > > >>> >> > -Dan Nugent >
