As I recall the contents "underneath" have been discussed before and the consensus was that the contents are not specified. If you'e like to make a proposal to change something I would suggest raising it on [email protected]
On Sun, Apr 5, 2020 at 1:56 PM Felix Benning <[email protected]> wrote: > > Follow up: Do you think it would make sense to have an `na_are_zero` flag? > Since it appears that the baseline (naively assuming there are no null > values) is still a bit faster than equally optimized null value handling > algorithms. So you might want to make the assumption, that all null values > are set to zero in the array (instead of undefined). This would allow for > very fast means, scalar products and thus matrix multiplication which ignore > nas. And in case of matrix multiplication, you might prefer sacrificing an > O(n^2) effort to set all null entries to zero before multiplying. And > assuming you do not overwrite this data, you would be able to reuse that > assumption in later computations with such a flag. > In some use cases, you might even be able to utilize unused computing > resources for this task. I.e. clean up the nulls while the computer is not > used, preparing for the next query. > > > On Sun, 5 Apr 2020 at 18:34, Felix Benning <[email protected]> wrote: >> >> Awesome, that was exactly what I was looking for, thank you! >> >> On Sun, 5 Apr 2020 at 00:40, Wes McKinney <[email protected]> wrote: >>> >>> I wrote a blog post a couple of years about this >>> >>> https://wesmckinney.com/blog/bitmaps-vs-sentinel-values/ >>> >>> Pasha Stetsenko did a follow-up analysis that showed that my >>> "sentinel" code could be significantly improved, see: >>> >>> https://github.com/st-pasha/microbench-nas/blob/master/README.md >>> >>> Generally speaking in Apache Arrow we've been happy to have a uniform >>> representation of nullness across all types, both primitive (booleans, >>> numbers, or strings) and nested (lists, structs, unions, etc.). Many >>> computational operations (like elementwise functions) need not concern >>> themselves with the nulls at all, for example, since the bitmap from >>> the input array can be passed along (with zero copy even) to the >>> output array. >>> >>> On Sat, Apr 4, 2020 at 4:39 PM Felix Benning <[email protected]> >>> wrote: >>> > >>> > Does anyone have an opinion (or links) about Bitpattern vs Masked Arrays >>> > for NA implementations? There seems to have been a discussion about that >>> > in the numpy community in 2012 >>> > https://numpy.org/neps/nep-0026-missing-data-summary.html without an >>> > apparent result. >>> > >>> > Summary of the Summary: >>> > - The Bitpattern approach reserves one bitpattern of any type as na, the >>> > only type not having spare bitpatterns are integers which means this >>> > decreases their range by one. This approach is taken by R and was >>> > regarded as more performant in 2012. >>> > - The Mask approach was deemed more flexible, since it would allow >>> > "degrees of missingness", and also cleaner/easier implementation. >>> > >>> > Since bitpattern checks would probably disrupt SIMD, I feel like some >>> > calculations (e.g. mean) would actually benefit more, from setting na >>> > values to zero, proceeding as if they were not there, and using the >>> > number of nas in the metadata to adjust the result. This of course does >>> > not work if two columns are used (e.g. scalar product), which is probably >>> > more important. >>> > >>> > Was using Bitmasks in Arrow a conscious performance decision? Or was the >>> > decision only based on the fact, that R and Bitpattern implementations in >>> > general are a niche, which means that Bitmasks are more compatible with >>> > other languages? >>> > >>> > I am curious about this topic, since the "lack of proper na support" was >>> > cited as the reason, why Python would never replace R in statistics. >>> > >>> > Thanks, >>> > >>> > Felix >>> > >>> > >>> > On 31.03.20 14:52, Joris Van den Bossche wrote: >>> > >>> > Note that pandas is starting to use a notion of "masked arrays" as well, >>> > for example for its nullable integer data type, but also not using the >>> > np.ma masked array, but a custom implementation (for technical reasons in >>> > pandas this was easier). >>> > >>> > Also, there has been quite some discussion last year in numpy about a >>> > possible re-implementation of a MaskedArray, but using numpy's protocols >>> > (`__array_ufunc__`, `__array_function__` etc), instead of being a >>> > subclass like np.ma now is. See eg >>> > https://mail.python.org/pipermail/numpy-discussion/2019-June/079681.html. >>> > >>> > Joris >>> > >>> > On Mon, 30 Mar 2020 at 18:57, Daniel Nugent <[email protected]> wrote: >>> >> >>> >> Ok. That actually aligns closely to what I'm familiar with. Good to know. >>> >> >>> >> Thanks again for taking the time to respond, >>> >> >>> >> -Dan Nugent >>> >> >>> >> >>> >> On Mon, Mar 30, 2020 at 12:38 PM Wes McKinney <[email protected]> >>> >> wrote: >>> >>> >>> >>> Social and technical reasons I guess. Empirically it's just not used >>> >>> much. >>> >>> >>> >>> You can see my comments about numpy.ma in my 2010 paper about pandas >>> >>> >>> >>> https://conference.scipy.org/proceedings/scipy2010/pdfs/mckinney.pdf >>> >>> >>> >>> At least in 2010, there were notable performance problems when using >>> >>> MaskedArray for computations >>> >>> >>> >>> "We chose to use NaN as opposed to using NumPy MaskedArrays for >>> >>> performance reasons (which are beyond the scope of this paper), as NaN >>> >>> propagates in floating-point operations in a natural way and can be >>> >>> easily detected in algorithms." >>> >>> >>> >>> On Mon, Mar 30, 2020 at 11:20 AM Daniel Nugent <[email protected]> wrote: >>> >>> > >>> >>> > Thanks! Since I'm just using it to jump to Arrow, I think I'll stick >>> >>> > with it. >>> >>> > >>> >>> > Do you have any feelings about why Numpy's masked arrays didn't gain >>> >>> > favor when many data representation formats explicitly support >>> >>> > nullity (including Arrow)? Is it just that not carrying nulls in >>> >>> > computations forward is preferable (that is, early filtering/value >>> >>> > filling was easier)? >>> >>> > >>> >>> > -Dan Nugent >>> >>> > >>> >>> > >>> >>> > On Mon, Mar 30, 2020 at 11:40 AM Wes McKinney <[email protected]> >>> >>> > wrote: >>> >>> >> >>> >>> >> On Mon, Mar 30, 2020 at 8:31 AM Daniel Nugent <[email protected]> >>> >>> >> wrote: >>> >>> >> > >>> >>> >> > Didn’t want to follow up on this on the Jira issue earlier since >>> >>> >> > it's sort of tangential to that bug and more of a usage question. >>> >>> >> > You said: >>> >>> >> > >>> >>> >> > > I wouldn't recommend building applications based on them >>> >>> >> > > nowadays since the level of support / compatibility in other >>> >>> >> > > projects is low. >>> >>> >> > >>> >>> >> > In my case, I am using them since it seemed like a straightforward >>> >>> >> > representation of my data that has nulls, the format I’m >>> >>> >> > converting from has zero cost numpy representations, and >>> >>> >> > converting from an internal format into Arrow in memory structures >>> >>> >> > appears zero cost (or close to it) as well. I guess I can just >>> >>> >> > provide the mask as an explicit argument, but my original desire >>> >>> >> > to use it came from being able to exploit numpy.ma.concatenate in >>> >>> >> > a way that saved some complexity in implementation. >>> >>> >> > >>> >>> >> > Since Arrow itself supports masking values with a bitfield, is >>> >>> >> > there something intrinsic to the notion of array masks that is not >>> >>> >> > well supported? Or do you just mean the specific numpy MaskedArray >>> >>> >> > class? >>> >>> >> > >>> >>> >> >>> >>> >> I mean just the numpy.ma module. Not many Python computing projects >>> >>> >> nowadays treat MaskedArray objects as first class citizens. Depending >>> >>> >> on what you need it may or may not be a problem. pyarrow supports >>> >>> >> ingesting from MaskedArray as a convenience, but it would not be >>> >>> >> common in my experience for a library's APIs to return MaskedArrays. >>> >>> >> >>> >>> >> > If this is too much of a numpy question rather than an arrow >>> >>> >> > question, could you point me to where I can read up on masked >>> >>> >> > array support or maybe what the right place to ask the numpy >>> >>> >> > community about whether what I'm doing is appropriate or not. >>> >>> >> > >>> >>> >> > Thanks, >>> >>> >> > >>> >>> >> > >>> >>> >> > -Dan Nugent
