Re: [Rd] 1954 from NA
+1 Avi Gross via R-devel wrote: > Arguably, R was not developed to satisfy some needs in the way intended. > > When I have had to work with datasets from some of the social sciences I have > had to adapt to subtleties in how they did things with software like SPSS in > which an NA was done using an out of bounds marker like 999 or "." or even a > blank cell. The problem is that R has a concept where data such as integers > or floating point numbers is not stored as text normally but in their own > formats and a vector by definition can only contain ONE data type. So the > various forms of NA as well as Nan and Inf had to be grafted on to be > considered VALID to share the same storage area as if they sort of were an > integer or floating point number or text or whatever. > > It does strike me as possible to simply have a column that is something like > a factor that can contain as many NA excuses as you wish such as "NOT > ANSWERED" to "CANNOT READ THE SQUIGLE" to "NOT SURE" to "WILL BE FILLED IN > LATER" to "I DON'T SPEAK ENGLISH AND CANNOT ANSWER STUPID QUESTIONS". This > additional column would presumably only have content when the other column > has an NA. Your queries and other changes would work on something like a > data.frame where both such columns coexisted. > > Note reading in data with multiple NA reasons may take extra work. If your > errors codes are text, it will all become text. If the errors are 999 and 998 > and 997, it may all be treated as numeric and you may not want to convert all > such codes to an NA immediately. Rather, you would use the first > vector/column to make the second vector and THEN replace everything that > should be an NA with an actual NA and reparse the entire vector to become > properly numeric unless you like working with text and will convert to > numbers as needed on the fly. > > Now this form of annotation may not be pleasing but I suggest that an > implementation that does allow annotation may use up space too. Of course, if > your NA values are rare and space is only used then, you might save space. > But if you could make a factor column and have it use the smallest int it can > get as a basis, it may be a way to save on space. > > People who have done work with R, especially those using the tidyverse, are > quite used to using one column to explain another. So if you are asked to say > tabulate what percent of missing values are due to reasons A/B/C then the > added columns works fine for that calculation too. > > > -Original Message- > From: R-devel On Behalf Of Adrian Du?a > Sent: Sunday, May 23, 2021 2:04 PM > To: Tomas Kalibera > Cc: r-devel > Subject: Re: [Rd] 1954 from NA > > Dear Tomas, > > I understand that perfectly, but that is fine. > The payload is not going to be used in any computations anyways, it is > strictly an information carrier that differentiates between different types > of (tagged) NA values. > > Having only one NA value in R is extremely limiting for the social sciences, > where multiple missing values may exist, because respondents: > - did not know what to respond, or > - did not want to respond, or perhaps > - the question did not apply in a given situation etc. > > All of these need to be captured, stored, and most importantly treated as if > they would be regular missing values. Whether the payload might be lost in > computations makes no difference: they were supposed to be "missing values" > anyways. > > The original question is how the payload is currently stored: as an unsigned > int of 32 bits, or as an unsigned short of 16 bits. If the R internals would > not be affected (and I see no reason why they would be), it would allow an > entire universe for the social sciences that is not currently available and > which all other major statistical packages do offer. > > Thank you very much, your attention is greatly appreciated, Adrian > > On Sun, May 23, 2021 at 7:59 PM Tomas Kalibera > wrote: > > > TLDR: tagging R NAs is not possible. > > > > External software should not depend on how R currently implements NA, > > this may change at any time. Tagging of NA is not supported in R (if > > it were, it would have been documented). It would not be possible to > > implement such tagging reliably with the current implementation of NA in R. > > > > NaN payload propagation is not standardized. Compilers are free to and > > do optimize code not preserving/achieving any specific propagation. > > CPUs/FPUs differ in how they propagate in binary operations, some zero > > the payload on any operation. Virtualized environments, binary > > translations, etc, may not preserve it in any way, either. ?NA has > > disclaimers about this, an NA may become NaN (payload lost) even in > > unary operations and also in binary operations not involving other NaN/NAs. > > > > Writing any new software that would depend on that anything specific > > happens to the NaN payloads would
Re: [Rd] 1954 from NA
Arguably, R was not developed to satisfy some needs in the way intended. When I have had to work with datasets from some of the social sciences I have had to adapt to subtleties in how they did things with software like SPSS in which an NA was done using an out of bounds marker like 999 or "." or even a blank cell. The problem is that R has a concept where data such as integers or floating point numbers is not stored as text normally but in their own formats and a vector by definition can only contain ONE data type. So the various forms of NA as well as Nan and Inf had to be grafted on to be considered VALID to share the same storage area as if they sort of were an integer or floating point number or text or whatever. It does strike me as possible to simply have a column that is something like a factor that can contain as many NA excuses as you wish such as "NOT ANSWERED" to "CANNOT READ THE SQUIGLE" to "NOT SURE" to "WILL BE FILLED IN LATER" to "I DON'T SPEAK ENGLISH AND CANNOT ANSWER STUPID QUESTIONS". This additional column would presumably only have content when the other column has an NA. Your queries and other changes would work on something like a data.frame where both such columns coexisted. Note reading in data with multiple NA reasons may take extra work. If your errors codes are text, it will all become text. If the errors are 999 and 998 and 997, it may all be treated as numeric and you may not want to convert all such codes to an NA immediately. Rather, you would use the first vector/column to make the second vector and THEN replace everything that should be an NA with an actual NA and reparse the entire vector to become properly numeric unless you like working with text and will convert to numbers as needed on the fly. Now this form of annotation may not be pleasing but I suggest that an implementation that does allow annotation may use up space too. Of course, if your NA values are rare and space is only used then, you might save space. But if you could make a factor column and have it use the smallest int it can get as a basis, it may be a way to save on space. People who have done work with R, especially those using the tidyverse, are quite used to using one column to explain another. So if you are asked to say tabulate what percent of missing values are due to reasons A/B/C then the added columns works fine for that calculation too. -Original Message- From: R-devel On Behalf Of Adrian Du?a Sent: Sunday, May 23, 2021 2:04 PM To: Tomas Kalibera Cc: r-devel Subject: Re: [Rd] 1954 from NA Dear Tomas, I understand that perfectly, but that is fine. The payload is not going to be used in any computations anyways, it is strictly an information carrier that differentiates between different types of (tagged) NA values. Having only one NA value in R is extremely limiting for the social sciences, where multiple missing values may exist, because respondents: - did not know what to respond, or - did not want to respond, or perhaps - the question did not apply in a given situation etc. All of these need to be captured, stored, and most importantly treated as if they would be regular missing values. Whether the payload might be lost in computations makes no difference: they were supposed to be "missing values" anyways. The original question is how the payload is currently stored: as an unsigned int of 32 bits, or as an unsigned short of 16 bits. If the R internals would not be affected (and I see no reason why they would be), it would allow an entire universe for the social sciences that is not currently available and which all other major statistical packages do offer. Thank you very much, your attention is greatly appreciated, Adrian On Sun, May 23, 2021 at 7:59 PM Tomas Kalibera wrote: > TLDR: tagging R NAs is not possible. > > External software should not depend on how R currently implements NA, > this may change at any time. Tagging of NA is not supported in R (if > it were, it would have been documented). It would not be possible to > implement such tagging reliably with the current implementation of NA in R. > > NaN payload propagation is not standardized. Compilers are free to and > do optimize code not preserving/achieving any specific propagation. > CPUs/FPUs differ in how they propagate in binary operations, some zero > the payload on any operation. Virtualized environments, binary > translations, etc, may not preserve it in any way, either. ?NA has > disclaimers about this, an NA may become NaN (payload lost) even in > unary operations and also in binary operations not involving other NaN/NAs. > > Writing any new software that would depend on that anything specific > happens to the NaN payloads would not be a good idea. One can only > reliably use the NaN payload bits for storage, that is if one avoids > any computation at all, avoids passing the values to any external code > unaware of such tagging (including R), etc
Re: [Rd] 1954 from NA
On 5/23/21 8:04 PM, Adrian Dușa wrote: > Dear Tomas, > > I understand that perfectly, but that is fine. > The payload is not going to be used in any computations anyways, it is > strictly an information carrier that differentiates between different > types of (tagged) NA values. Good, but unfortunately the delineation between computation and non-computation is not always transparent. Even if an operation doesn't look like "computation" on the high-level, it may internally involve computation - so, really, an R NA can become R NaN and vice versa, at any point (this is not a "feature", but it is how things are now). > Having only one NA value in R is extremely limiting for the social > sciences, where multiple missing values may exist, because respondents: > - did not know what to respond, or > - did not want to respond, or perhaps > - the question did not apply in a given situation etc. > > All of these need to be captured, stored, and most importantly treated > as if they would be regular missing values. Whether the payload might > be lost in computations makes no difference: they were supposed to be > "missing values" anyways. Ok, then I would probably keep the meta-data on the missing values on the side to implement such missing values in such code, and treat them explicitly in supported operations. But. in principle, you can use the floating-point NaN payloads, and you can pass such values to R. You just need to be prepared that not only you would loose your payloads/tags, but also the difference between R NA and R NaNs. Thanks to value semantics of R, you would not loose the tags in input values with proper reference counts (e.g. marked immutable), because those values will not be modified. Best Tomas > The original question is how the payload is currently stored: as an > unsigned int of 32 bits, or as an unsigned short of 16 bits. If the R > internals would not be affected (and I see no reason why they would > be), it would allow an entire universe for the social sciences that is > not currently available and which all other major statistical packages > do offer. > > Thank you very much, your attention is greatly appreciated, > Adrian > > On Sun, May 23, 2021 at 7:59 PM Tomas Kalibera > mailto:tomas.kalib...@gmail.com>> wrote: > > TLDR: tagging R NAs is not possible. > > External software should not depend on how R currently implements NA, > this may change at any time. Tagging of NA is not supported in R > (if it > were, it would have been documented). It would not be possible to > implement such tagging reliably with the current implementation of > NA in R. > > NaN payload propagation is not standardized. Compilers are free to > and > do optimize code not preserving/achieving any specific propagation. > CPUs/FPUs differ in how they propagate in binary operations, some > zero > the payload on any operation. Virtualized environments, binary > translations, etc, may not preserve it in any way, either. ?NA has > disclaimers about this, an NA may become NaN (payload lost) even in > unary operations and also in binary operations not involving other > NaN/NAs. > > Writing any new software that would depend on that anything specific > happens to the NaN payloads would not be a good idea. One can only > reliably use the NaN payload bits for storage, that is if one > avoids any > computation at all, avoids passing the values to any external code > unaware of such tagging (including R), etc. If such software wants > any > NaN to be understood as NA by R, it would have to use the > documented R > API for this (so essentially translating) - but given the problems > mentioned above, there is really no point in doing that, because such > NAs become NaNs at any time. > > Best > Tomas > > On 5/23/21 9:56 AM, Adrian Dușa wrote: > > Dear R devs, > > > > I am probably missing something obvious, but still trying to > understand why > > the 1954 from the definition of an NA has to fill 32 bits when > it normally > > doesn't need more than 16. > > > > Wouldn't the code below achieve exactly the same thing? > > > > typedef union > > { > > double value; > > unsigned short word[4]; > > } ieee_double; > > > > > > #ifdef WORDS_BIGENDIAN > > static CONST int hw = 0; > > static CONST int lw = 3; > > #else /* !WORDS_BIGENDIAN */ > > static CONST int hw = 3; > > static CONST int lw = 0; > > #endif /* WORDS_BIGENDIAN */ > > > > > > static double R_ValueOfNA(void) > > { > > volatile ieee_double x; > > x.word[hw] = 0x7ff0; > > x.word[lw] = 1954; > > return x.value; > > } > > > > This question has to do with the tagged NA values from package > haven, on > > which I want to improve. Every available bit co
Re: [Rd] 1954 from NA
Dear Tomas, I understand that perfectly, but that is fine. The payload is not going to be used in any computations anyways, it is strictly an information carrier that differentiates between different types of (tagged) NA values. Having only one NA value in R is extremely limiting for the social sciences, where multiple missing values may exist, because respondents: - did not know what to respond, or - did not want to respond, or perhaps - the question did not apply in a given situation etc. All of these need to be captured, stored, and most importantly treated as if they would be regular missing values. Whether the payload might be lost in computations makes no difference: they were supposed to be "missing values" anyways. The original question is how the payload is currently stored: as an unsigned int of 32 bits, or as an unsigned short of 16 bits. If the R internals would not be affected (and I see no reason why they would be), it would allow an entire universe for the social sciences that is not currently available and which all other major statistical packages do offer. Thank you very much, your attention is greatly appreciated, Adrian On Sun, May 23, 2021 at 7:59 PM Tomas Kalibera wrote: > TLDR: tagging R NAs is not possible. > > External software should not depend on how R currently implements NA, > this may change at any time. Tagging of NA is not supported in R (if it > were, it would have been documented). It would not be possible to > implement such tagging reliably with the current implementation of NA in R. > > NaN payload propagation is not standardized. Compilers are free to and > do optimize code not preserving/achieving any specific propagation. > CPUs/FPUs differ in how they propagate in binary operations, some zero > the payload on any operation. Virtualized environments, binary > translations, etc, may not preserve it in any way, either. ?NA has > disclaimers about this, an NA may become NaN (payload lost) even in > unary operations and also in binary operations not involving other NaN/NAs. > > Writing any new software that would depend on that anything specific > happens to the NaN payloads would not be a good idea. One can only > reliably use the NaN payload bits for storage, that is if one avoids any > computation at all, avoids passing the values to any external code > unaware of such tagging (including R), etc. If such software wants any > NaN to be understood as NA by R, it would have to use the documented R > API for this (so essentially translating) - but given the problems > mentioned above, there is really no point in doing that, because such > NAs become NaNs at any time. > > Best > Tomas > > On 5/23/21 9:56 AM, Adrian Dușa wrote: > > Dear R devs, > > > > I am probably missing something obvious, but still trying to understand > why > > the 1954 from the definition of an NA has to fill 32 bits when it > normally > > doesn't need more than 16. > > > > Wouldn't the code below achieve exactly the same thing? > > > > typedef union > > { > > double value; > > unsigned short word[4]; > > } ieee_double; > > > > > > #ifdef WORDS_BIGENDIAN > > static CONST int hw = 0; > > static CONST int lw = 3; > > #else /* !WORDS_BIGENDIAN */ > > static CONST int hw = 3; > > static CONST int lw = 0; > > #endif /* WORDS_BIGENDIAN */ > > > > > > static double R_ValueOfNA(void) > > { > > volatile ieee_double x; > > x.word[hw] = 0x7ff0; > > x.word[lw] = 1954; > > return x.value; > > } > > > > This question has to do with the tagged NA values from package haven, on > > which I want to improve. Every available bit counts, especially if > > multi-byte characters are going to be involved. > > > > Best wishes, > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] 1954 from NA
TLDR: tagging R NAs is not possible. External software should not depend on how R currently implements NA, this may change at any time. Tagging of NA is not supported in R (if it were, it would have been documented). It would not be possible to implement such tagging reliably with the current implementation of NA in R. NaN payload propagation is not standardized. Compilers are free to and do optimize code not preserving/achieving any specific propagation. CPUs/FPUs differ in how they propagate in binary operations, some zero the payload on any operation. Virtualized environments, binary translations, etc, may not preserve it in any way, either. ?NA has disclaimers about this, an NA may become NaN (payload lost) even in unary operations and also in binary operations not involving other NaN/NAs. Writing any new software that would depend on that anything specific happens to the NaN payloads would not be a good idea. One can only reliably use the NaN payload bits for storage, that is if one avoids any computation at all, avoids passing the values to any external code unaware of such tagging (including R), etc. If such software wants any NaN to be understood as NA by R, it would have to use the documented R API for this (so essentially translating) - but given the problems mentioned above, there is really no point in doing that, because such NAs become NaNs at any time. Best Tomas On 5/23/21 9:56 AM, Adrian Dușa wrote: Dear R devs, I am probably missing something obvious, but still trying to understand why the 1954 from the definition of an NA has to fill 32 bits when it normally doesn't need more than 16. Wouldn't the code below achieve exactly the same thing? typedef union { double value; unsigned short word[4]; } ieee_double; #ifdef WORDS_BIGENDIAN static CONST int hw = 0; static CONST int lw = 3; #else /* !WORDS_BIGENDIAN */ static CONST int hw = 3; static CONST int lw = 0; #endif /* WORDS_BIGENDIAN */ static double R_ValueOfNA(void) { volatile ieee_double x; x.word[hw] = 0x7ff0; x.word[lw] = 1954; return x.value; } This question has to do with the tagged NA values from package haven, on which I want to improve. Every available bit counts, especially if multi-byte characters are going to be involved. Best wishes, __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] 1954 from NA
> On Sunday, May 23, 2021, 10:45:22 AM EDT, Adrian Dușa > wrote: > > On Sun, May 23, 2021 at 4:33 PM brodie gaslam via R-devel > wrote: > > I should add, I don't know that you can rely on this > > particular encoding of R's NA. If I were trying to restore > > an NA from some external format, I would just generate an > > R NA via e.g NA_real_ in the R session I'm restoring the > > external data into, and not try to hand assemble one. > > Thanks for your answer, Brodie, especially on Sunday (much appreciated). Maybe I shouldn't answer on Sunday given I've said several wrong things... > The aim is not to reconstruct an NA, but to "tag" an NA (and yes, I was > referring to an NA_real_ of course), as seen in action here: > https://github.com/tidyverse/haven/blob/master/src/tagged_na.c > > That code: > - preserves the first part 0x7ff0 > - preserves the last part 1954 > - adds one additional byte to store (tag) a character provided in the SEXP > vector > > That is precisely my understanding, that doubles starting with 0x7ff are > all NaNs. My question was related to the additional part 1954 from the > low bits: why does it need 32 bits? It probably doesn't need 32 bits. The code is trying to set all 64 bits. It seems natural to do the high 32 bit, and then the low. But I'm not R Core so don't listen to me too closely. > The binary value of 1954 is 0100010, which is represented by 11 bits > occupying at most 2 bytes... So why does it need 4 bytes? > > Re. the possible overflow, I am not sure: 0x7ff0 is the decimal 32752, > or the binary 111. You are right, I had a moment and wrongly counted hex digits as bytes instead of half-bytes. > That is just about enough to fit in the available 16 bits (actually 15 > to leave one for the sign bit), so I don't really understand why it > would. And in > any case, the union definition uses an unsigned short > which (if my understanding is correct) should certainly not overflow: > > typedef union > { > double value; > unsigned short word[4]; > } ieee_double; > > What is gained with this proposal: 16 additional bits to do something > with. For the moment, only 16 are available (from the lower part of the > high 32 bits). If the value 1954 would be checked as a short instead of > an int, the other 16 bits would become available. And those bits could > be extremely valuable to tag multi-byte characters, for instance, but > also higher numbers than 32767. Note that the stability of the payload portion of NaNs is questionable: https://developer.r-project.org/Blog/public/2020/11/02/will-r-work-on-apple-silicon/#nanan-payload-propagation Also, if I understand correctly, you would be asking R core to formalize the internal representation of the R NA, which I don't think is public? So that you can use those internal bits for your own purposes with a guarantee that R will not disturb them? Obviously only they can answer that. Apologies for confusing the issue. B, PS: the other obviously wrong thing I said was the NA was 0x7ff0 & 1954 when it is really 0x7ff0 & 1954 when. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] 1954 from NA
On Sun, May 23, 2021 at 4:33 PM brodie gaslam via R-devel < r-devel@r-project.org> wrote: > I should add, I don't know that you can rely on this > particular encoding of R's NA. If I were trying to restore > an NA from some external format, I would just generate an > R NA via e.g NA_real_ in the R session I'm restoring the > external data into, and not try to hand assemble one. > Thanks for your answer, Brodie, especially on Sunday (much appreciated). The aim is not to reconstruct an NA, but to "tag" an NA (and yes, I was referring to an NA_real_ of course), as seen in action here: https://github.com/tidyverse/haven/blob/master/src/tagged_na.c That code: - preserves the first part 0x7ff0 - preserves the last part 1954 - adds one additional byte to store (tag) a character provided in the SEXP vector That is precisely my understanding, that doubles starting with 0x7ff are all NaNs. My question was related to the additional part 1954 from the low bits: why does it need 32 bits? The binary value of 1954 is 0100010, which is represented by 11 bits occupying at most 2 bytes... So why does it need 4 bytes? Re. the possible overflow, I am not sure: 0x7ff0 is the decimal 32752, or the binary 111. That is just about enough to fit in the available 16 bits (actually 15 to leave one for the sign bit), so I don't really understand why it would. And in any case, the union definition uses an unsigned short which (if my understanding is correct) should certainly not overflow: typedef union { double value; unsigned short word[4]; } ieee_double; What is gained with this proposal: 16 additional bits to do something with. For the moment, only 16 are available (from the lower part of the high 32 bits). If the value 1954 would be checked as a short instead of an int, the other 16 bits would become available. And those bits could be extremely valuable to tag multi-byte characters, for instance, but also higher numbers than 32767. Best wishes, Adrian [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] 1954 from NA
I wrote about this once over here: http://www.markvanderloo.eu/yaRb/2012/07/08/representation-of-numerical-nas-in-r-and-the-1954-enigma/ -M Op zo 23 mei 2021 15:33 schreef brodie gaslam via R-devel < r-devel@r-project.org>: > I should add, I don't know that you can rely on this > particular encoding of R's NA. If I were trying to restore > an NA from some external format, I would just generate an > R NA via e.g NA_real_ in the R session I'm restoring the > external data into, and not try to hand assemble one. > > Best, > > B. > > > On Sunday, May 23, 2021, 9:23:54 AM EDT, brodie gaslam via R-devel < > r-devel@r-project.org> wrote: > > > > > > This is because the NA in question is NA_real_, which > is encoded in double precision IEEE-754, which uses > 64 bits. The "1954" is just part of the NA. The NA > must also conform to the NaN encoding for double precision > numbers, which requires that the "beginning" portion of > the number be "0x7ff0" (well, I think it should be "0x7ff8" > but that's a different story), as you can see here: > > x.word[hw] = 0x7ff0; > x.word[lw] = 1954; > > Both those components are part of the same double precision > value. They are just accessed this way to make it easy to > set the high bits (63-32) and the low bits (31-0). > > So NA is not just 1954, its 0x7ff0 & 1954 (note I'm > mixing hex and decimals here). > > In IEEE 754 double precision encoding numbers that start > with 0x7ff are all NaNs. The rest of the number except for > the first bit which designates "quiet" vs "signaling" NaNs can > be anything. R has taken advantage of that to designate the > R NA by setting the lower bits to be 1954. > > Note I'm being pretty loose about endianess, etc. here, but > hopefully this conveys the problem. > > In terms of your proposal, I'm not entirely sure what you gain. > You're still attempting to generate a 64 bit representation > in the end. If all you need is to encode the fact that there > was an NA, and restore it later as a 64 bit NA, then you can do > whatever you want so long as the end result conforms to the > expected encoding. > > In terms of using 'short' here (which again, I don't see the > need for as you're using it to generate the final 64 bit encoding), > I see two possible problems. You're adding the dependency that > short will be 16 bits. We already have the (implicit) assumption > in R that double is 64 bits, and explicit that int is 32 bits. > But I think you'd be going a bit on a limb assuming that short > is 16 bits (not sure). More important, if short is indeed 16 bits, > I think in: > > x.word[hw] = 0x7ff0; > > You overflow short. > > Best, > > B. > > > > On Sunday, May 23, 2021, 8:56:18 AM EDT, Adrian Dușa < > dusa.adr...@unibuc.ro> wrote: > > > > > > Dear R devs, > > I am probably missing something obvious, but still trying to understand why > the 1954 from the definition of an NA has to fill 32 bits when it normally > doesn't need more than 16. > > Wouldn't the code below achieve exactly the same thing? > > typedef union > { > double value; > unsigned short word[4]; > } ieee_double; > > > #ifdef WORDS_BIGENDIAN > static CONST int hw = 0; > static CONST int lw = 3; > #else /* !WORDS_BIGENDIAN */ > static CONST int hw = 3; > static CONST int lw = 0; > #endif /* WORDS_BIGENDIAN */ > > > static double R_ValueOfNA(void) > { > volatile ieee_double x; > x.word[hw] = 0x7ff0; > x.word[lw] = 1954; > return x.value; > } > > This question has to do with the tagged NA values from package haven, on > which I want to improve. Every available bit counts, especially if > multi-byte characters are going to be involved. > > Best wishes, > -- > Adrian Dusa > University of Bucharest > Romanian Social Data Archive > Soseaua Panduri nr. 90-92 > 050663 Bucharest sector 5 > Romania > https://adriandusa.eu > > [[alternative HTML version deleted]] > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] 1954 from NA
I should add, I don't know that you can rely on this particular encoding of R's NA. If I were trying to restore an NA from some external format, I would just generate an R NA via e.g NA_real_ in the R session I'm restoring the external data into, and not try to hand assemble one. Best, B. On Sunday, May 23, 2021, 9:23:54 AM EDT, brodie gaslam via R-devel wrote: This is because the NA in question is NA_real_, which is encoded in double precision IEEE-754, which uses 64 bits. The "1954" is just part of the NA. The NA must also conform to the NaN encoding for double precision numbers, which requires that the "beginning" portion of the number be "0x7ff0" (well, I think it should be "0x7ff8" but that's a different story), as you can see here: x.word[hw] = 0x7ff0; x.word[lw] = 1954; Both those components are part of the same double precision value. They are just accessed this way to make it easy to set the high bits (63-32) and the low bits (31-0). So NA is not just 1954, its 0x7ff0 & 1954 (note I'm mixing hex and decimals here). In IEEE 754 double precision encoding numbers that start with 0x7ff are all NaNs. The rest of the number except for the first bit which designates "quiet" vs "signaling" NaNs can be anything. R has taken advantage of that to designate the R NA by setting the lower bits to be 1954. Note I'm being pretty loose about endianess, etc. here, but hopefully this conveys the problem. In terms of your proposal, I'm not entirely sure what you gain. You're still attempting to generate a 64 bit representation in the end. If all you need is to encode the fact that there was an NA, and restore it later as a 64 bit NA, then you can do whatever you want so long as the end result conforms to the expected encoding. In terms of using 'short' here (which again, I don't see the need for as you're using it to generate the final 64 bit encoding), I see two possible problems. You're adding the dependency that short will be 16 bits. We already have the (implicit) assumption in R that double is 64 bits, and explicit that int is 32 bits. But I think you'd be going a bit on a limb assuming that short is 16 bits (not sure). More important, if short is indeed 16 bits, I think in: x.word[hw] = 0x7ff0; You overflow short. Best, B. On Sunday, May 23, 2021, 8:56:18 AM EDT, Adrian Dușa wrote: Dear R devs, I am probably missing something obvious, but still trying to understand why the 1954 from the definition of an NA has to fill 32 bits when it normally doesn't need more than 16. Wouldn't the code below achieve exactly the same thing? typedef union { double value; unsigned short word[4]; } ieee_double; #ifdef WORDS_BIGENDIAN static CONST int hw = 0; static CONST int lw = 3; #else /* !WORDS_BIGENDIAN */ static CONST int hw = 3; static CONST int lw = 0; #endif /* WORDS_BIGENDIAN */ static double R_ValueOfNA(void) { volatile ieee_double x; x.word[hw] = 0x7ff0; x.word[lw] = 1954; return x.value; } This question has to do with the tagged NA values from package haven, on which I want to improve. Every available bit counts, especially if multi-byte characters are going to be involved. Best wishes, -- Adrian Dusa University of Bucharest Romanian Social Data Archive Soseaua Panduri nr. 90-92 050663 Bucharest sector 5 Romania https://adriandusa.eu [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] 1954 from NA
This is because the NA in question is NA_real_, which is encoded in double precision IEEE-754, which uses 64 bits. The "1954" is just part of the NA. The NA must also conform to the NaN encoding for double precision numbers, which requires that the "beginning" portion of the number be "0x7ff0" (well, I think it should be "0x7ff8" but that's a different story), as you can see here: x.word[hw] = 0x7ff0; x.word[lw] = 1954; Both those components are part of the same double precision value. They are just accessed this way to make it easy to set the high bits (63-32) and the low bits (31-0). So NA is not just 1954, its 0x7ff0 & 1954 (note I'm mixing hex and decimals here). In IEEE 754 double precision encoding numbers that start with 0x7ff are all NaNs. The rest of the number except for the first bit which designates "quiet" vs "signaling" NaNs can be anything. R has taken advantage of that to designate the R NA by setting the lower bits to be 1954. Note I'm being pretty loose about endianess, etc. here, but hopefully this conveys the problem. In terms of your proposal, I'm not entirely sure what you gain. You're still attempting to generate a 64 bit representation in the end. If all you need is to encode the fact that there was an NA, and restore it later as a 64 bit NA, then you can do whatever you want so long as the end result conforms to the expected encoding. In terms of using 'short' here (which again, I don't see the need for as you're using it to generate the final 64 bit encoding), I see two possible problems. You're adding the dependency that short will be 16 bits. We already have the (implicit) assumption in R that double is 64 bits, and explicit that int is 32 bits. But I think you'd be going a bit on a limb assuming that short is 16 bits (not sure). More important, if short is indeed 16 bits, I think in: x.word[hw] = 0x7ff0; You overflow short. Best, B. On Sunday, May 23, 2021, 8:56:18 AM EDT, Adrian Dușa wrote: Dear R devs, I am probably missing something obvious, but still trying to understand why the 1954 from the definition of an NA has to fill 32 bits when it normally doesn't need more than 16. Wouldn't the code below achieve exactly the same thing? typedef union { double value; unsigned short word[4]; } ieee_double; #ifdef WORDS_BIGENDIAN static CONST int hw = 0; static CONST int lw = 3; #else /* !WORDS_BIGENDIAN */ static CONST int hw = 3; static CONST int lw = 0; #endif /* WORDS_BIGENDIAN */ static double R_ValueOfNA(void) { volatile ieee_double x; x.word[hw] = 0x7ff0; x.word[lw] = 1954; return x.value; } This question has to do with the tagged NA values from package haven, on which I want to improve. Every available bit counts, especially if multi-byte characters are going to be involved. Best wishes, -- Adrian Dusa University of Bucharest Romanian Social Data Archive Soseaua Panduri nr. 90-92 050663 Bucharest sector 5 Romania https://adriandusa.eu [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] 1954 from NA
Dear R devs, I am probably missing something obvious, but still trying to understand why the 1954 from the definition of an NA has to fill 32 bits when it normally doesn't need more than 16. Wouldn't the code below achieve exactly the same thing? typedef union { double value; unsigned short word[4]; } ieee_double; #ifdef WORDS_BIGENDIAN static CONST int hw = 0; static CONST int lw = 3; #else /* !WORDS_BIGENDIAN */ static CONST int hw = 3; static CONST int lw = 0; #endif /* WORDS_BIGENDIAN */ static double R_ValueOfNA(void) { volatile ieee_double x; x.word[hw] = 0x7ff0; x.word[lw] = 1954; return x.value; } This question has to do with the tagged NA values from package haven, on which I want to improve. Every available bit counts, especially if multi-byte characters are going to be involved. Best wishes, -- Adrian Dusa University of Bucharest Romanian Social Data Archive Soseaua Panduri nr. 90-92 050663 Bucharest sector 5 Romania https://adriandusa.eu [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel