Re: [Rd] [External] Re: 1954 from NA
Greg, I am curious what they suggest you use multiple NaN values for. Or, is it simply like how text messages on your phone started because standard size packets were bigger than what some uses required so they piggy-backed messages on the "empty" space. If by NaN you include the various flavors of NA such as NA_logical_ and NA_complex_ I have sometimes wondered if they are slightly different bitstreams or all the same but interpreted by programs as being the right kind for their context. Sounds like maybe they are different and there is one for pretty much each basic type except perhaps raw. But if you add more, in that case, will it be seen as the right NA for the environment it is in? Heck, if R adds yet another basic type (like a quaternion) or a nibble, could they use the same bits you took without asking for your application? It does sound like some suggest you use a method with existing abilities and tightly control that all functions used to manipulate the data will behave and preserve those attributes. I am not so sure the clients using it will obey. I have seen plenty of people say use some tidyverse functions for various purposes then use something more base-R like complete.cases() or rbind() that may, but also may not, preserve what they want. And once lost, ... Now, of course, you could write wrapper functions that will take the data, copy the attributes, allow whatever changes, and carefully put them back before returning. This may not be trivial though if you want to do something like delete lots of rows as you might need to first identify what rows will be kept, then adjust the vector of attributes accordingly before returning it. Sorting is another such annoyance. Many things do conversions such as making copies or converting a copy to a factor, that may mess things up. If it has already been done and people have experience, great. If not, good luck. -Original Message- From: Gregory Warnes Sent: Tuesday, May 25, 2021 9:13 PM To: Avi Gross Cc: r-devel Subject: Re: [Rd] [External] Re: 1954 from NA As a side note, for floating point values, the IEEE 754 standard provides for a large set of NaN values, making it possible to have multiple types of NAs for floating point values... __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [External] Re: 1954 from NA
As a side note, for floating point values, the IEEE 754 standard provides for a large set of NaN values, making it possible to have multiple types of NAs for floating point values... Sent from my iPad > On May 25, 2021, at 3:03 PM, Avi Gross via R-devel > wrote: > > That helps get more understanding of what you want to do, Adrian. Getting > anyone to switch is always a challenge but changing R enough to tempt them > may be a bigger challenge. His is an old story. I was the first adopter for > C++ in my area and at first had to have my code be built with an all C > project making me reinvent some wheels so the same “make” system knew how to > build the two compatibly and link them. Of course, they all eventually had to > join me in a later release but I had moved forward by then. > > > > I have changed (or more accurately added) lots of languages in my life and > continue to do so. The biggest challenge is not to just adapt and use it > similarly to the previous ones already mastered but to understand WHY someone > designed the language this way and what kind of idioms are common and useful > even if that means a new way of thinking. But, of course, any “older” > language has evolved and often drifted in multiple directions. Many now > borrow heavily from others even when the philosophy is different and often > the results are not pretty. Making major changes in R might have serious > impacts on existing programs including just by making them fail as they run > out of memory. > > > > If you look at R, there is plenty you can do in base R, sometimes by standing > on your head. Yet you see package after package coming along that offers not > just new things but sometimes a reworking and even remodeling of old things. > R has a base graphics system I now rarely use and another called lattice I > have no reason to use again because I can do so much quite easily in ggplot. > Similarly, the evolving tidyverse group of packages approaches things from an > interesting direction to the point where many people mainly use it and not > base R. So if they were to teach a class in how to gather your data and > analyze it and draw pretty pictures, the students might walk away thinking > they had learned R but actually have learned these packages. > > > > Your scenario seems related to a common scenario of how we can have values > that signal beyond some range in an out-of-band manner. Years ago we had > functions in languages like C that would return a -1 on failure when only > non-negative results were otherwise possible. That can work fine but fails in > cases when any possible value in the range can be returned. We have languages > that deal with this kind of thing using error handling constructs like > exceptions. Sometimes you bundle up multiple items into a structure and > return that with one element of the structure holding some kind of return > status and another holding the payload. A variation on this theme, as in > languages like GO is to have function that return multiple values with one of > them containing nil on success and an error structure on failure. > > > > The situation we have here that seems to be of concern to you is that you > would like each item in a structure to have attributes that are recognized > and propagated as it is being processed. Older languages tended not to even > have a concept so basic types simply existed and two instances of the number > 5 might even be the same underlying one or two strings with the same contents > and so on. You could of course play the game of making a struct, as mentioned > above, but then you needed your own code to do all the handling as nothing > else knew it contained multiple items and which ones had which purpose. > > > > R did add generalized attributes and some are fairly well integrated or at > least partially. “Names” were discussed as not being easy to keep around. > Factors used their own tagging method that seems to work fairly well but > probably not everywhere. But what you want may be more general and not built > on similar foundations. > > > > I look at languages like Python that are arguably more object-oriented now > than R is and in some ways can be extended better, albeit not in others. If I > wanted to create an object to hold the number 5 and I add methods to the > object that allow it to participate in various ways with other objects using > the hidden payload but also sometimes using the hidden payload, then I might > pair it with the string “five” but also with dozens of other strings for the > word representing 5 in many languages. So I might have it act like a number > in numerical situations and like text when someone is using it in writing a > novel in any of many languages. > > > > You seem to want to have the original text visible that gives a reason > something is missing (or something like that) but have the software TREAT it > like it is m
Re: [Rd] [External] Re: 1954 from NA
You've already been told how to solve this: just add attributes to the objects. Use the standard NA to indicate that there is some kind of missingness, and the attribute to describe exactly what it is. Stick a class on those objects and define methods so that subsetting and arithmetic preserves the extra info you've added. If you do some operation that turns those NAs into NaNs, big deal: the attribute will still be there, and is.na(NaN) still returns TRUE. Base R doesn't need anything else. You complained that users shouldn't need to know about attributes, and they won't: you, as the author of the package that does this, will handle all those details. Working in your subject area you know all the different kinds of NAs that people care about, and how they code them in input data, so you can make it all totally transparent. If you do it well, someone in some other subject area with a completely different set of kinds of missingness will be able to adapt your code to their use. I imagine this has all been done in one of the thousands of packages on CRAN, but if it hasn't been done well enough for you, do it better. Duncan Murdoch On 25/05/2021 7:01 p.m., Adrian Dușa wrote: Dear Avi, That was quite a lengthy email... What you write makes sense of course. I try hard not to deviate from the base R, and thought my solution does just that but apparently no such luck. I suspect, however, that something will have to eventually change: since one of the R building blocks (such as an NA) is questioned by compilers, it is serious enough to attract attention from the R core and maintainers. And if that happens, my fingers are crossed the solution would allow users to declare existing values as missing. The importance of that, for the social sciences, cannot be stressed enough. Best wishes, thanks once again to everyone, Adrian On Tue, May 25, 2021 at 10:03 PM Avi Gross via R-devel < r-devel@r-project.org> wrote: That helps get more understanding of what you want to do, Adrian. Getting anyone to switch is always a challenge but changing R enough to tempt them may be a bigger challenge. His is an old story. I was the first adopter for C++ in my area and at first had to have my code be built with an all C project making me reinvent some wheels so the same “make” system knew how to build the two compatibly and link them. Of course, they all eventually had to join me in a later release but I had moved forward by then. I have changed (or more accurately added) lots of languages in my life and continue to do so. The biggest challenge is not to just adapt and use it similarly to the previous ones already mastered but to understand WHY someone designed the language this way and what kind of idioms are common and useful even if that means a new way of thinking. But, of course, any “older” language has evolved and often drifted in multiple directions. Many now borrow heavily from others even when the philosophy is different and often the results are not pretty. Making major changes in R might have serious impacts on existing programs including just by making them fail as they run out of memory. If you look at R, there is plenty you can do in base R, sometimes by standing on your head. Yet you see package after package coming along that offers not just new things but sometimes a reworking and even remodeling of old things. R has a base graphics system I now rarely use and another called lattice I have no reason to use again because I can do so much quite easily in ggplot. Similarly, the evolving tidyverse group of packages approaches things from an interesting direction to the point where many people mainly use it and not base R. So if they were to teach a class in how to gather your data and analyze it and draw pretty pictures, the students might walk away thinking they had learned R but actually have learned these packages. Your scenario seems related to a common scenario of how we can have values that signal beyond some range in an out-of-band manner. Years ago we had functions in languages like C that would return a -1 on failure when only non-negative results were otherwise possible. That can work fine but fails in cases when any possible value in the range can be returned. We have languages that deal with this kind of thing using error handling constructs like exceptions. Sometimes you bundle up multiple items into a structure and return that with one element of the structure holding some kind of return status and another holding the payload. A variation on this theme, as in languages like GO is to have function that return multiple values with one of them containing nil on success and an error structure on failure. The situation we have here that seems to be of concern to you is that you would like each item in a structure to have attributes that are recognized and propagated as it is being processed. Older languages tended not to even have a concept so basic types simply exis
Re: [Rd] [External] Re: 1954 from NA
Dear Avi, That was quite a lengthy email... What you write makes sense of course. I try hard not to deviate from the base R, and thought my solution does just that but apparently no such luck. I suspect, however, that something will have to eventually change: since one of the R building blocks (such as an NA) is questioned by compilers, it is serious enough to attract attention from the R core and maintainers. And if that happens, my fingers are crossed the solution would allow users to declare existing values as missing. The importance of that, for the social sciences, cannot be stressed enough. Best wishes, thanks once again to everyone, Adrian On Tue, May 25, 2021 at 10:03 PM Avi Gross via R-devel < r-devel@r-project.org> wrote: > That helps get more understanding of what you want to do, Adrian. Getting > anyone to switch is always a challenge but changing R enough to tempt them > may be a bigger challenge. His is an old story. I was the first adopter for > C++ in my area and at first had to have my code be built with an all C > project making me reinvent some wheels so the same “make” system knew how > to build the two compatibly and link them. Of course, they all eventually > had to join me in a later release but I had moved forward by then. > > > > I have changed (or more accurately added) lots of languages in my life and > continue to do so. The biggest challenge is not to just adapt and use it > similarly to the previous ones already mastered but to understand WHY > someone designed the language this way and what kind of idioms are common > and useful even if that means a new way of thinking. But, of course, any > “older” language has evolved and often drifted in multiple directions. Many > now borrow heavily from others even when the philosophy is different and > often the results are not pretty. Making major changes in R might have > serious impacts on existing programs including just by making them fail as > they run out of memory. > > > > If you look at R, there is plenty you can do in base R, sometimes by > standing on your head. Yet you see package after package coming along that > offers not just new things but sometimes a reworking and even remodeling of > old things. R has a base graphics system I now rarely use and another > called lattice I have no reason to use again because I can do so much quite > easily in ggplot. Similarly, the evolving tidyverse group of packages > approaches things from an interesting direction to the point where many > people mainly use it and not base R. So if they were to teach a class in > how to gather your data and analyze it and draw pretty pictures, the > students might walk away thinking they had learned R but actually have > learned these packages. > > > > Your scenario seems related to a common scenario of how we can have values > that signal beyond some range in an out-of-band manner. Years ago we had > functions in languages like C that would return a -1 on failure when only > non-negative results were otherwise possible. That can work fine but fails > in cases when any possible value in the range can be returned. We have > languages that deal with this kind of thing using error handling constructs > like exceptions. Sometimes you bundle up multiple items into a structure > and return that with one element of the structure holding some kind of > return status and another holding the payload. A variation on this theme, > as in languages like GO is to have function that return multiple values > with one of them containing nil on success and an error structure on > failure. > > > > The situation we have here that seems to be of concern to you is that you > would like each item in a structure to have attributes that are recognized > and propagated as it is being processed. Older languages tended not to even > have a concept so basic types simply existed and two instances of the > number 5 might even be the same underlying one or two strings with the same > contents and so on. You could of course play the game of making a struct, > as mentioned above, but then you needed your own code to do all the > handling as nothing else knew it contained multiple items and which ones > had which purpose. > > > > R did add generalized attributes and some are fairly well integrated or at > least partially. “Names” were discussed as not being easy to keep around. > Factors used their own tagging method that seems to work fairly well but > probably not everywhere. But what you want may be more general and not > built on similar foundations. > > > > I look at languages like Python that are arguably more object-oriented now > than R is and in some ways can be extended better, albeit not in others. If > I wanted to create an object to hold the number 5 and I add methods to the > object that allow it to participate in various ways with other objects > using the hidden payload but also sometimes using the hidden payload, then > I might pair it with the string “five” but
Re: [Rd] [External] Re: 1954 from NA
That helps get more understanding of what you want to do, Adrian. Getting anyone to switch is always a challenge but changing R enough to tempt them may be a bigger challenge. His is an old story. I was the first adopter for C++ in my area and at first had to have my code be built with an all C project making me reinvent some wheels so the same “make” system knew how to build the two compatibly and link them. Of course, they all eventually had to join me in a later release but I had moved forward by then. I have changed (or more accurately added) lots of languages in my life and continue to do so. The biggest challenge is not to just adapt and use it similarly to the previous ones already mastered but to understand WHY someone designed the language this way and what kind of idioms are common and useful even if that means a new way of thinking. But, of course, any “older” language has evolved and often drifted in multiple directions. Many now borrow heavily from others even when the philosophy is different and often the results are not pretty. Making major changes in R might have serious impacts on existing programs including just by making them fail as they run out of memory. If you look at R, there is plenty you can do in base R, sometimes by standing on your head. Yet you see package after package coming along that offers not just new things but sometimes a reworking and even remodeling of old things. R has a base graphics system I now rarely use and another called lattice I have no reason to use again because I can do so much quite easily in ggplot. Similarly, the evolving tidyverse group of packages approaches things from an interesting direction to the point where many people mainly use it and not base R. So if they were to teach a class in how to gather your data and analyze it and draw pretty pictures, the students might walk away thinking they had learned R but actually have learned these packages. Your scenario seems related to a common scenario of how we can have values that signal beyond some range in an out-of-band manner. Years ago we had functions in languages like C that would return a -1 on failure when only non-negative results were otherwise possible. That can work fine but fails in cases when any possible value in the range can be returned. We have languages that deal with this kind of thing using error handling constructs like exceptions. Sometimes you bundle up multiple items into a structure and return that with one element of the structure holding some kind of return status and another holding the payload. A variation on this theme, as in languages like GO is to have function that return multiple values with one of them containing nil on success and an error structure on failure. The situation we have here that seems to be of concern to you is that you would like each item in a structure to have attributes that are recognized and propagated as it is being processed. Older languages tended not to even have a concept so basic types simply existed and two instances of the number 5 might even be the same underlying one or two strings with the same contents and so on. You could of course play the game of making a struct, as mentioned above, but then you needed your own code to do all the handling as nothing else knew it contained multiple items and which ones had which purpose. R did add generalized attributes and some are fairly well integrated or at least partially. “Names” were discussed as not being easy to keep around. Factors used their own tagging method that seems to work fairly well but probably not everywhere. But what you want may be more general and not built on similar foundations. I look at languages like Python that are arguably more object-oriented now than R is and in some ways can be extended better, albeit not in others. If I wanted to create an object to hold the number 5 and I add methods to the object that allow it to participate in various ways with other objects using the hidden payload but also sometimes using the hidden payload, then I might pair it with the string “five” but also with dozens of other strings for the word representing 5 in many languages. So I might have it act like a number in numerical situations and like text when someone is using it in writing a novel in any of many languages. You seem to want to have the original text visible that gives a reason something is missing (or something like that) but have the software TREAT it like it is missing in calculations. In effect, you want is.na() to be a bit more like is.numeric() or is.character() and care more about the TYPE of what is being stored. An item may contain a 999 and yet not be seen as a number but as an NA. The problem I see is that you also may want the item to be a string like “DELETED” and yet include it in the vector that R insists can only hold integers. R does have a built-in data structure called
[Rd] Should all.equal.POSIXt respect check.attributes?
Hello, Since bugzilla #17277 (https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17277) was resolved all.equal.POSIXt now compares timezone attributes. Comment 4 (https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17277#c4) in that ticket makes reference that both arguments check.tz (which appears to have actually been implemented as check.tzone) and check.attributes should disable this checking. However looking at the implementation (and behavior with a devel version of R) I'm finding that check.attributes is not disabling the timezone checks. Should the more general check.attributes disable this check (as well as being able to specifically disable only timezone checks with check.tzone)? -Jon __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [External] Re: 1954 from NA
On Tue, May 25, 2021 at 4:14 PM wrote: > [...] > > Yes, it should be discarded. > > You can of course do what you like in code you keep to yourself. But > please do not distribute code that does this. via CRAN or any other > means. It will only create problems for those maintaining R. > > > After all, the NA is nothing but a tagged NaN. > > And we are now paying a price for what was, in hindsight, an > unfortunate decision. > I (only now) understand that. That code is based on the R sources and (mind you) an almost identical one from package haven. Regardless, it was not the code I was trying to show, but the vignette: the end result, the functionality of the software. That is, automatically treat declared missing values as NAs, without users being required to explicitly deal with attributes. Now that I think about it, there might be a way to do this without tagging NAs, so back to square one. Best wishes, Adrian [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [External] Re: 1954 from NA
On Tue, 25 May 2021, Adrian Dușa wrote: Dear Avi, Thank you so much for the extended messages, I read them carefully. While partially offering a solution (I've already been there), it creates additional work for the user, and some of that is unnecessary. What I am trying to achieve is best described in this draft vignette: devtools::install_github("dusadrian/mixed") vignette("mixed") Once a value is declared to be missing, the user should not do anything else about it. Despite being present, the value should automatically be treated as missing by the software. That is the way it's done in all major statistical packages like SAS, Stata and even SPSS. My end goal is to make R attractive for my faculty peers (and beyond), almost all of whom are massively using SPSS and sometimes Stata. But in order to convince them to (finally) make the switch, I need to provide similar functionality, not additional work. Re. your first part of the message, I am definitely not trying to change the R internals. The NA will still be NA, exactly as currently defined. My initial proposal was based on the observation that the 1954 payload was stored as an unsigned int (thus occupying 32 bits) when it is obvious it doesn't need more than 16. That was the only proposed modification, and everything else stays the same. I now learned, thanks to all contributors in this list, that building something around that payload is risky because we do not know exactly what the compilers will do. One possible solution that I can think of, while (still) maintaining the current functionality around the NA, is to use a different high word for the NA that would not trigger compilation issues. But I have absolutely no idea what that implies for the other inner workings of R. I very much trust the R core will eventually find a robust solution, they've solved much more complicated problems than this. I just hope the current thread will push the idea of tagged NAs on the table, for when they will discuss this. Once that will be solved, and despite the current advice discouraging this route, I believe tagging NAs is a valuable idea that should not be discarded. Yes, it should be discarded. You can of course do what you like in code you keep to yourself. But please do not distribute code that does this. via CRAN or any other means. It will only create problems for those maintaining R. After all, the NA is nothing but a tagged NaN. And we are now paying a price for what was, in hindsight, an unfortunate decision. Best, luke All the best, Adrian On Tue, May 25, 2021 at 7:05 AM Avi Gross via R-devel wrote: I was thinking about how one does things in a language that is properly object-oriented versus R that makes various half-assed attempts at being such. Clearly in some such languages you can make an object that is a wrapper that allows you to save an item that is the main payload as well as anything else you want. You might need a way to convince everything else to allow you to make things like lists and vectors and other collections of the objects and perhaps automatically unbox them for many purposes. As an example in a language like Python, you might provide methods so that adding A and B actually gets the value out of A and/or B and adds them properly. But there may be too many edge cases to handle and some software may not pay attention to what you want including some libraries written in other languages. I mention Python for the odd reason that it is now possible to combine Python and R in the same program and sort of switch back and forth between data representations. This may provide some openings for preserving and accessing metadata when needed. Realistically, if R was being designed from scratch TODAY, many things might be done differently. But I recall it being developed at Bell Labs for purposes where it was sort of revolutionary at the time (back when it was S) and designed to do things in a vectorized way and probably primarily for the kinds of scientific and mathematical operations where a single NA (of several types depending on the data) was enough when augmented by a few things like a Nan and Inf and -Inf. I doubt they seriously saw a need for an unlimited number of NA that were all the same AND also all different that they felt had to be built-in. As noted, had they had a reason to make it fully object-oriented too and made the base types such as integer into full-fledged objects with room for additional metadata, then things may be different. I note I have seen languages which have both a data type called integer as lower case and Integer as upper case. One of them is regularly boxed and unboxed automagically when used in a context that needs the other. As far as efficiency goes, this invisibly adds many steps. So do languages that sometimes take a variable that is a pointer and invisibly reference it to provide the underlying field rather than make you do extra typing and so on. So is there any reason onl
Re: [Rd] [External] Re: 1954 from NA
Dear Avi, Thank you so much for the extended messages, I read them carefully. While partially offering a solution (I've already been there), it creates additional work for the user, and some of that is unnecessary. What I am trying to achieve is best described in this draft vignette: devtools::install_github("dusadrian/mixed") vignette("mixed") Once a value is declared to be missing, the user should not do anything else about it. Despite being present, the value should automatically be treated as missing by the software. That is the way it's done in all major statistical packages like SAS, Stata and even SPSS. My end goal is to make R attractive for my faculty peers (and beyond), almost all of whom are massively using SPSS and sometimes Stata. But in order to convince them to (finally) make the switch, I need to provide similar functionality, not additional work. Re. your first part of the message, I am definitely not trying to change the R internals. The NA will still be NA, exactly as currently defined. My initial proposal was based on the observation that the 1954 payload was stored as an unsigned int (thus occupying 32 bits) when it is obvious it doesn't need more than 16. That was the only proposed modification, and everything else stays the same. I now learned, thanks to all contributors in this list, that building something around that payload is risky because we do not know exactly what the compilers will do. One possible solution that I can think of, while (still) maintaining the current functionality around the NA, is to use a different high word for the NA that would not trigger compilation issues. But I have absolutely no idea what that implies for the other inner workings of R. I very much trust the R core will eventually find a robust solution, they've solved much more complicated problems than this. I just hope the current thread will push the idea of tagged NAs on the table, for when they will discuss this. Once that will be solved, and despite the current advice discouraging this route, I believe tagging NAs is a valuable idea that should not be discarded. After all, the NA is nothing but a tagged NaN. All the best, Adrian On Tue, May 25, 2021 at 7:05 AM Avi Gross via R-devel wrote: > I was thinking about how one does things in a language that is properly > object-oriented versus R that makes various half-assed attempts at being > such. > > Clearly in some such languages you can make an object that is a wrapper > that allows you to save an item that is the main payload as well as > anything else you want. You might need a way to convince everything else to > allow you to make things like lists and vectors and other collections of > the objects and perhaps automatically unbox them for many purposes. As an > example in a language like Python, you might provide methods so that adding > A and B actually gets the value out of A and/or B and adds them properly. > But there may be too many edge cases to handle and some software may not > pay attention to what you want including some libraries written in other > languages. > > I mention Python for the odd reason that it is now possible to combine > Python and R in the same program and sort of switch back and forth between > data representations. This may provide some openings for preserving and > accessing metadata when needed. > > Realistically, if R was being designed from scratch TODAY, many things > might be done differently. But I recall it being developed at Bell Labs for > purposes where it was sort of revolutionary at the time (back when it was > S) and designed to do things in a vectorized way and probably primarily for > the kinds of scientific and mathematical operations where a single NA (of > several types depending on the data) was enough when augmented by a few > things like a Nan and Inf and -Inf. I doubt they seriously saw a need for > an unlimited number of NA that were all the same AND also all different > that they felt had to be built-in. As noted, had they had a reason to make > it fully object-oriented too and made the base types such as integer into > full-fledged objects with room for additional metadata, then things may be > different. I note I have seen languages which have both a data type called > integer as lower case and Integer as upper case. One of them is regularly > boxed and unboxed automagically when used in a context that needs the > other. As far as efficiency goes, this invisibly adds many steps. So do > languages that sometimes take a variable that is a pointer and invisibly > reference it to provide the underlying field rather than make you do extra > typing and so on. > > So is there any reason only an NA should have such meta-data? Why not have > reasons associated with Inf stating it was an Inf because you asked for one > or the result of a calculation such as dividing by Zero (albeit maybe that > might be a NaN) and so on. Maybe I could annotate integers with whether > they are prime or e