Re: [Rd] 1954 from NA

2021-05-24 Thread Avi Gross via R-devel
Adrian,

 

This is an aside. I note in many machine-learning algorithms they actually do 
something along the lines being discussed. They may take an item like a 
paragraph of words or an email message  and add thousands of columns with each 
one being a Boolean specifying if a particular word is in or not in that item. 
They may then run an analysis trying to heuristically match known SPAM items so 
as to be able to predict if new items might be SPAM. Some may even have a 
column for words taken two or more at a time such as “must” followed by “have” 
or “Your”, “last”, “chance” resulting> column_orig

 bad   worse bad   
worse missing 

  5   1  NA   2  NA   1   2   5  NA   6 
 NA  NA   2  in even more columns. The software than does the analysis 
can work on remarkably large such collections including in some cases taking 
multiple approaches at the same problem and choosing among them in some way.

 

In your case, yes, adding lots of columns seems like added work. But in data 
science, often the easiest way to do some complex things is to loop over 
selected existing columns and create multiple sets of additional columns that 
simplify later calculations by just using these values rather than some 
multi-line complex condition. I have as an example run statistical analyses 
where I have a Boolean column if the analysis failed (as in I caught it using 
try() or else it would kill my process) and another if I was told it did not 
converge properly and yet another column if it failed some post-tests. It 
simplified some queries that excluded rows where any one of the above was TRUE. 
I also stored columns for metrics like RMSEA and chi-squared values, sometimes 
dozens. And for each of the above, I actually had a set of columns for various 
models such as linear versus quadratic and more. Worse, as the analysis 
continued, more derived columns were added as various measures of the above 
results were compared to each other so the different models could be compared 
as in how often each was better. Careful choices of naming conventions and nice 
features of the tidyverse made it fairly simple to operate on many columns in 
the same way fairly easily such as all columns whose names start with a string 
or end with …

 

And, yes, for some efficiency, I often made a narrower version of the above 
with just the fields I needed and was careful not to remove what I might need 
later.

 

So it can be done and fairly trivially if you know what you are doing. If the 
names of all your original columns that behave this way look like *.orig and 
others look different, you can ask for a function to be applied to just those 
that produces another set with the same prefixes but named *.converted and yet 
another called *.annotation and so on. You may want to remove the originals to 
save space but you get the idea. The fact there are six hundred means little 
with such a design as the above can be done in probably a dozen lines of code 
to all of them at once.

 

For me, the above is way less complex than what you want to do and can have 
benefits. For example, if you make a graph of points from my larger 
tibble/data.frame using ggplot(), you can do things like specify what color to 
use for a point using a variable that contains the reason the data was missing 
(albeit that assumes the missing part is not what is being graphed) or add text 
giving the reason just above each such point. Your method of faking multiple 
things YOU claim are an NA may not make it doable in the above example.

 

From: Adrian Dușa mailto:dusa.adr...@unibuc.ro> > 
Sent: Monday, May 24, 2021 8:18 AM
To: Greg Minshall mailto:minsh...@umich.edu> >
Cc: Avi Gross mailto:avigr...@verizon.net> >; r-devel 
mailto:r-devel@r-project.org> >
Subject: Re: [Rd] 1954 from NA

 

On Mon, May 24, 2021 at 2:11 PM Greg Minshall mailto:minsh...@umich.edu> > wrote:

[...]
if you have 500 columns of possibly-NA'd variables, you could have one
column of 500 "bits", where each bit has one of N values, N being the
number of explanations the corresponding column has for why the NA
exists.

 

The mere thought of implementing something like that gives me shivers. Not to 
mention such a solution should also be robust when subsetting, splitting, 
column and row binding, etc. and everything can be lost if the user deletes 
that particular column without realising its importance.

 

Social science datasets are much more alive and complex than one might first 
think: there are multi-wave studies with tens of countries, and aggregating 
such data is already a complex process to add even more complexity on top of 
that.

 

As undocumented as they may be, or even subject to change, I think the R 
internals are much more reliable that this.

 

Best wishes,

Adrian

 

-- 

Adrian Dusa
University of Bucharest
Romanian Social Data Archive

Re: [Rd] 1954 from NA

2021-05-24 Thread Nicholas Tierney
Hi all,

When first hearing about ALTREP I've wondered how it might be able to be
used to store special missing value information - how can we learn more
about implementing ALTREP classes? The idea of carrying around a "meaning
of my NAs" vector, as Gabe said, would be very interesting!

I've done a bit on creating "special missing values", as done in SPSS, SAS,
and STATA, here:
http://naniar.njtierney.com/articles/special-missing-values.html  (Note
this approach requires carrying a duplicated dataframe of missing data
around with the data - which I argue makes it easier to reason with, at the
cost of storage. However this is just my approach, and there are others out
there).

Best,

Nick

On Tue, 25 May 2021 at 01:16, Adrian Dușa  wrote:

> On Mon, May 24, 2021 at 5:47 PM Gabriel Becker 
> wrote:
>
> > Hi Adrian,
> >
> > I had the same thought as Luke. It is possible that you can develop an
> > ALTREP that carries around the tagging information you're looking for in
> a
> > way that is more persistent (in some cases) than R-level attributes and
> > more hidden than additional user-visible columns.
> >
> > The downsides to this, of course, is that you'll in some sense be doing
> > the same "extra vector for each vector you want tagged NA-s within" under
> > the hood, and that only custom machinery you write will recognize things
> as
> > something other than bog-standard NAs/NaNs.  You'll also have some
> problems
> > with the fact that data in ALTREPs isn't currently modifiable without
> > losing ALTREPness. That said, ALTREPs are allowed to carry around
> arbitrary
> > persistent information with them, so from that perspective making an
> ALTREP
> > that carries around a "meaning of my NAs" vector of tags in its metadata
> > would be pretty straightforward.
> >
>
> Oh... now that is extremely interesting.
> It is the first time I came across the ALTREP concept, so I need to study
> the way it works before saying anything, but definitely something to
> consider.
>
> Thanks so much for the pointer,
> Adrian
>
> [[alternative HTML version deleted]]
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] 1954 from NA

2021-05-24 Thread Gabriel Becker
Hi All,

So there is a not particularly active, but closely curated (ie everything
on there should be good in terms of principled examples) github
organization of ALTREP examples: https://github.com/ALTREP-examples.
Currently there are two examples by Luke (including a package version of
the memory map ALTREP he wrote) and one by me.

To elaborate a bit more it looks like you could have read-only vectors with
tagged NAs, because despite my incorrect recollection, It looks like
Extract_subset IS hooked up, so subsetting an ALTREP can, depending on the
altrep class, give you another ALTREP.

They would effectively be subsettable but not mutable, though,
because setting elements in an ALTREP vector still wipes its altrepness.
This is unfortunate but an intentional design decision that itself
currently appears immutable,if you'll excuse the pun, last I heard.

I understand that that is a relatively sizable caveat, but ce la vie

Assuming that things would be useful with that caveat I can try to put a
proof of concept example into that organization that could works as the
starting board for a deeper collaboration soon. I think I have in my head a
way to approach it.

~G

On Mon, May 24, 2021 at 3:00 PM Nicholas Tierney 
wrote:

> Hi all,
>
> When first hearing about ALTREP I've wondered how it might be able to be
> used to store special missing value information - how can we learn more
> about implementing ALTREP classes? The idea of carrying around a "meaning
> of my NAs" vector, as Gabe said, would be very interesting!
>
> I've done a bit on creating "special missing values", as done in SPSS,
> SAS, and STATA, here:
> http://naniar.njtierney.com/articles/special-missing-values.html  (Note
> this approach requires carrying a duplicated dataframe of missing data
> around with the data - which I argue makes it easier to reason with, at the
> cost of storage. However this is just my approach, and there are others out
> there).
>
> Best,
>
> Nick
>
> On Tue, 25 May 2021 at 01:16, Adrian Dușa  wrote:
>
>> On Mon, May 24, 2021 at 5:47 PM Gabriel Becker 
>> wrote:
>>
>> > Hi Adrian,
>> >
>> > I had the same thought as Luke. It is possible that you can develop an
>> > ALTREP that carries around the tagging information you're looking for
>> in a
>> > way that is more persistent (in some cases) than R-level attributes and
>> > more hidden than additional user-visible columns.
>> >
>> > The downsides to this, of course, is that you'll in some sense be doing
>> > the same "extra vector for each vector you want tagged NA-s within"
>> under
>> > the hood, and that only custom machinery you write will recognize
>> things as
>> > something other than bog-standard NAs/NaNs.  You'll also have some
>> problems
>> > with the fact that data in ALTREPs isn't currently modifiable without
>> > losing ALTREPness. That said, ALTREPs are allowed to carry around
>> arbitrary
>> > persistent information with them, so from that perspective making an
>> ALTREP
>> > that carries around a "meaning of my NAs" vector of tags in its metadata
>> > would be pretty straightforward.
>> >
>>
>> Oh... now that is extremely interesting.
>> It is the first time I came across the ALTREP concept, so I need to study
>> the way it works before saying anything, but definitely something to
>> consider.
>>
>> Thanks so much for the pointer,
>> Adrian
>>
>> [[alternative HTML version deleted]]
>>
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] 1954 from NA

2021-05-24 Thread Adrian Dușa
On Mon, May 24, 2021 at 5:47 PM Gabriel Becker 
wrote:

> Hi Adrian,
>
> I had the same thought as Luke. It is possible that you can develop an
> ALTREP that carries around the tagging information you're looking for in a
> way that is more persistent (in some cases) than R-level attributes and
> more hidden than additional user-visible columns.
>
> The downsides to this, of course, is that you'll in some sense be doing
> the same "extra vector for each vector you want tagged NA-s within" under
> the hood, and that only custom machinery you write will recognize things as
> something other than bog-standard NAs/NaNs.  You'll also have some problems
> with the fact that data in ALTREPs isn't currently modifiable without
> losing ALTREPness. That said, ALTREPs are allowed to carry around arbitrary
> persistent information with them, so from that perspective making an ALTREP
> that carries around a "meaning of my NAs" vector of tags in its metadata
> would be pretty straightforward.
>

Oh... now that is extremely interesting.
It is the first time I came across the ALTREP concept, so I need to study
the way it works before saying anything, but definitely something to
consider.

Thanks so much for the pointer,
Adrian

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] 1954 from NA

2021-05-24 Thread Gabriel Becker
Hi Adrian,

I had the same thought as Luke. It is possible that you can develop an
ALTREP that carries around the tagging information you're looking for in a
way that is more persistent (in some cases) than R-level attributes and
more hidden than additional user-visible columns.

The downsides to this, of course, is that you'll in some sense be doing the
same "extra vector for each vector you want tagged NA-s within" under the
hood, and that only custom machinery you write will recognize things as
something other than bog-standard NAs/NaNs.  You'll also have some problems
with the fact that data in ALTREPs isn't currently modifiable without
losing ALTREPness. That said, ALTREPs are allowed to carry around arbitrary
persistent information with them, so from that perspective making an ALTREP
that carries around a "meaning of my NAs" vector of tags in its metadata
would be pretty straightforward.

Best,
~G

On Mon, May 24, 2021 at 7:30 AM Adrian Dușa  wrote:

> Hi Taras,
>
> On Mon, May 24, 2021 at 4:20 PM Taras Zakharko 
> wrote:
>
> > Hi Adrian,
> >
> > Have a look at vctrs package — they have low-level primitives that might
> > simplify your life a bit. I think you can get quite far by creating a
> > custom type that stores NAs in an attribute and utilizes vctrs proxy
> > functionality to preserve these attributes across different operations.
> > Going that route will likely to give you a much more flexible and robust
> > solution.
> >
>
> Yes I am well aware of the primitives from package vctrs, since package
> haven itself uses the vctrs_vctr class.
> They're doing a very interesting work, albeit not a solution for this
> particular problem.
>
> A.
>
> [[alternative HTML version deleted]]
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] 1954 from NA

2021-05-24 Thread Adrian Dușa
On Mon, May 24, 2021 at 4:40 PM Bertram, Alexander via R-devel <
r-devel@r-project.org> wrote:

> Dear Adrian,
> SPSS and other packages handle this problem in a very similar way to what I
> described: they store additional metadata for each variable. You can see
> this in the way that SPSS organizes it's file format: each "variable" has
> additional metadata that indicate how specific values of the variable,
> encoded as an integer or a floating point should be handled in analysis.
> Before you actually run a crosstab in SPSS, the metadata is (presumably)
> applied to the raw data to arrive at an in memory buffer on which the
> actual model is fitted, etc.
>

As far as I am aware, SAS and Stata use "very high" and "very low" values
to signal a missing value. Basically, the same solution using a different
sign bit (not creating attributes metadata, though).

Something similar to the IEEE-754 representation for the NaN:
0x7ff0

only using some other "high" word:
0x7fe0

If I understand this correctly, compilers are likely to mess around with
the payload from the 0x7ff0... stuff, which endangers even the most basic R
structure like a real NA.
Perhaps using a different high word such as 0x7fe would be stable, since
compilers won't confuse it with a NaN. And then any payload would be "safe"
for any specific purpose.

Not sure how SPSS manage its internals, but if they do it that way they
manage it in a standard procedural way. Now, since R's NA payload is at
risk, and if your solution is "good" for specific social science missing
data, would you recommend R creators to adopt it for a regular NA...?

We're looking for a general purpose solution that would create as little
additional work as possible for the end users. Your solution is already
implemented in the package "labelled" with the function user_na_to_na()
before doing any statistical analysis. That still requires users to pay
attention to details which the software should take care of automatically.

Best,
Adrian

The 20 line solution in R looks like this:
>
>
> df <- data.frame(q1 = c(1, 10, 50, 999), q2 = c("Yes", "No", "Don't know",
> "Interviewer napping"), stringsAsFactors = FALSE)
> attr(df$q1, 'missing') <- 999
> attr(df$q2, 'missing') <- c("Don't know", "Interviewer napping")
>
> excludeMissing <- function(df) {
>   for(q in names(df)) {
> v <- df[[q]]
> mv <- attr(v, 'missing')
> if(!is.null(mv)) {
>   df[[q]] <- ifelse(v %in% mv, NA, v)
> }
>   }
>   df
> }
>
> table(excludeMissing(df))
>
> If you want to preserve the missing attribute when subsetting the vectors
> then you will have to take the example further by adding a class and
> `[.withMissing` functions. This might bring the whole project to a few
> hundred lines, but the rules that apply here are well defined and well
> understood, giving you a proper basis on which to build. And perhaps the
> vctrs package might make this even simpler, take a look.
>
> Best,
> Alex
>
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] 1954 from NA

2021-05-24 Thread Adrian Dușa
Hi Taras,

On Mon, May 24, 2021 at 4:20 PM Taras Zakharko 
wrote:

> Hi Adrian,
>
> Have a look at vctrs package — they have low-level primitives that might
> simplify your life a bit. I think you can get quite far by creating a
> custom type that stores NAs in an attribute and utilizes vctrs proxy
> functionality to preserve these attributes across different operations.
> Going that route will likely to give you a much more flexible and robust
> solution.
>

Yes I am well aware of the primitives from package vctrs, since package
haven itself uses the vctrs_vctr class.
They're doing a very interesting work, albeit not a solution for this
particular problem.

A.

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] 1954 from NA

2021-05-24 Thread Bertram, Alexander via R-devel
Dear Adrian,
SPSS and other packages handle this problem in a very similar way to what I
described: they store additional metadata for each variable. You can see
this in the way that SPSS organizes it's file format: each "variable" has
additional metadata that indicate how specific values of the variable,
encoded as an integer or a floating point should be handled in analysis.
Before you actually run a crosstab in SPSS, the metadata is (presumably)
applied to the raw data to arrive at an in memory buffer on which the
actual model is fitted, etc.

The 20 line solution in R looks like this:


df <- data.frame(q1 = c(1, 10, 50, 999), q2 = c("Yes", "No", "Don't know",
"Interviewer napping"), stringsAsFactors = FALSE)
attr(df$q1, 'missing') <- 999
attr(df$q2, 'missing') <- c("Don't know", "Interviewer napping")

excludeMissing <- function(df) {
  for(q in names(df)) {
v <- df[[q]]
mv <- attr(v, 'missing')
if(!is.null(mv)) {
  df[[q]] <- ifelse(v %in% mv, NA, v)
}
  }
  df
}

table(excludeMissing(df))

If you want to preserve the missing attribute when subsetting the vectors
then you will have to take the example further by adding a class and
`[.withMissing` functions. This might bring the whole project to a few
hundred lines, but the rules that apply here are well defined and well
understood, giving you a proper basis on which to build. And perhaps the
vctrs package might make this even simpler, take a look.

Best,
Alex

On Mon, May 24, 2021 at 3:20 PM Taras Zakharko 
wrote:

> Hi Adrian,
>
> Have a look at vctrs package — they have low-level primitives that might
> simplify your life a bit. I think you can get quite far by creating a
> custom type that stores NAs in an attribute and utilizes vctrs proxy
> functionality to preserve these attributes across different operations.
> Going that route will likely to give you a much more flexible and robust
> solution.
>
> Best,
>
> Taras
>
> > On 24 May 2021, at 15:09, Adrian Dușa  wrote:
> >
> > Dear Alex,
> >
> > Thanks for piping in, I am learning with each new message.
> > The problem is clear, the solution escapes me though. I've already tried
> > the attributes route: it is going to triple the data size: along with the
> > additional (logical) variable that specifies which level is missing, one
> > also needs to store an index such that sorting the data would still
> > maintain the correct information.
> >
> > One also needs to think about subsetting (subset the attributes as well),
> > splitting (the same), aggregating multiple datasets (even more
> attention),
> > creating custom vectors out of multiple variables... complexity quickly
> > grows towards infinity.
> >
> > R factors are nice indeed, but:
> > - there are numerical variables which can hold multiple missing values
> (for
> > instance income)
> > - factors convert the original questionnaire values: if a missing value
> was
> > coded 999, turning that into a factor would convert that value into
> > something else
> >
> > I really, and wholeheartedly, do appreciate all advice: but please be
> > assured that I have been thinking about this for more than 10 years and
> > still haven't found a satisfactory solution.
> >
> > Which makes it even more intriguing, since other software like SAS or
> Stata
> > have solved this for decades: what is their implementation, and how come
> > they don't seem to be affected by the new M1 architecture?
> > When package "haven" introduced the tagged NA values I said: ah-haa... so
> > that is how it's done... only to learn that implementation is just as
> > fragile as the R internals.
> >
> > There really should be a robust solution for this seemingly mundane
> > problem, but apparently is far from mundane...
> >
> > Best wishes,
> > Adrian
> >
> >
> > On Mon, May 24, 2021 at 3:29 PM Bertram, Alexander <
> a...@bedatadriven.com>
> > wrote:
> >
> >> Dear Adrian,
> >> I just wanted to pipe in and underscore Thomas' point: the payload bits
> of
> >> IEEE 754 floating point values are no place to store data that you care
> >> about or need to keep. That is not only related to the R APIs, but also
> how
> >> processors handle floating point values and signaling and non-signaling
> >> NaNs. It is very difficult to reason about when and under which
> >> circumstances these bits are preserved. I spent a lot of time working on
> >> Renjin's handling of these values and I can assure that any such scheme
> >> will end in tears.
> >>
> >> A far, far better option is to use R's attributes to store this kind of
> >> metadata. This is exactly what this language feature is for. There is
> >> already a standard 'levels' attribute that holds the labels of factors
> like
> >> "Yes", "No" , "Refused", "Interviewer error'' etc. In the past, I've
> worked
> >> on projects where we stored an additional attribute like "missingLevels"
> >> that stores extra metadata on which levels should be used in which kind
> of
> >> analysis. That way, you can preserve all the 

Re: [Rd] 1954 from NA

2021-05-24 Thread Taras Zakharko
Hi Adrian, 

Have a look at vctrs package — they have low-level primitives that might 
simplify your life a bit. I think you can get quite far by creating a custom 
type that stores NAs in an attribute and utilizes vctrs proxy functionality to 
preserve these attributes across different operations. Going that route will 
likely to give you a much more flexible and robust solution. 

Best, 

Taras

> On 24 May 2021, at 15:09, Adrian Dușa  wrote:
> 
> Dear Alex,
> 
> Thanks for piping in, I am learning with each new message.
> The problem is clear, the solution escapes me though. I've already tried
> the attributes route: it is going to triple the data size: along with the
> additional (logical) variable that specifies which level is missing, one
> also needs to store an index such that sorting the data would still
> maintain the correct information.
> 
> One also needs to think about subsetting (subset the attributes as well),
> splitting (the same), aggregating multiple datasets (even more attention),
> creating custom vectors out of multiple variables... complexity quickly
> grows towards infinity.
> 
> R factors are nice indeed, but:
> - there are numerical variables which can hold multiple missing values (for
> instance income)
> - factors convert the original questionnaire values: if a missing value was
> coded 999, turning that into a factor would convert that value into
> something else
> 
> I really, and wholeheartedly, do appreciate all advice: but please be
> assured that I have been thinking about this for more than 10 years and
> still haven't found a satisfactory solution.
> 
> Which makes it even more intriguing, since other software like SAS or Stata
> have solved this for decades: what is their implementation, and how come
> they don't seem to be affected by the new M1 architecture?
> When package "haven" introduced the tagged NA values I said: ah-haa... so
> that is how it's done... only to learn that implementation is just as
> fragile as the R internals.
> 
> There really should be a robust solution for this seemingly mundane
> problem, but apparently is far from mundane...
> 
> Best wishes,
> Adrian
> 
> 
> On Mon, May 24, 2021 at 3:29 PM Bertram, Alexander 
> wrote:
> 
>> Dear Adrian,
>> I just wanted to pipe in and underscore Thomas' point: the payload bits of
>> IEEE 754 floating point values are no place to store data that you care
>> about or need to keep. That is not only related to the R APIs, but also how
>> processors handle floating point values and signaling and non-signaling
>> NaNs. It is very difficult to reason about when and under which
>> circumstances these bits are preserved. I spent a lot of time working on
>> Renjin's handling of these values and I can assure that any such scheme
>> will end in tears.
>> 
>> A far, far better option is to use R's attributes to store this kind of
>> metadata. This is exactly what this language feature is for. There is
>> already a standard 'levels' attribute that holds the labels of factors like
>> "Yes", "No" , "Refused", "Interviewer error'' etc. In the past, I've worked
>> on projects where we stored an additional attribute like "missingLevels"
>> that stores extra metadata on which levels should be used in which kind of
>> analysis. That way, you can preserve all the information, and then write a
>> utility function which automatically applies certain logic to a whole
>> dataframe just before passing the data to an analysis function. This is
>> also important because in surveys like this, different values should be
>> excluded at different times. For example, you might want to include all
>> responses in a data quality report, but exclude interviewer error and
>> refusals when conducting a PCA or fitting a model.
>> 
>> Best,
>> Alex
>> 
>> On Mon, May 24, 2021 at 2:03 PM Adrian Dușa  wrote:
>> 
>>> On Mon, May 24, 2021 at 1:31 PM Tomas Kalibera 
>>> wrote:
>>> 
 [...]
 
 For the reasons I explained, I would be against such a change. Keeping
>>> the
 data on the side, as also recommended by others on this list, would
>>> allow
 you for a reliable implementation. I don't want to support fragile
>>> package
 code building on unspecified R internals, and in this case particularly
 internals that themselves have not stood the test of time, so are at
>>> high
 risk of change.
 
>>> I understand, and it makes sense.
>>> We'll have to wait for the R internals to settle (this really is
>>> surprising, I wonder how other software have solved this). In the
>>> meantime,
>>> I will probably go ahead with NaNs.
>>> 
>>> Thank you again,
>>> Adrian
>>> 
>>>[[alternative HTML version deleted]]
>>> 
>>> __
>>> R-devel@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>> 
>> 
>> 
>> --
>> Alexander Bertram
>> Technical Director
>> *BeDataDriven BV*
>> 
>> Web: http://bedatadriven.com
>> Email: a...@bedatadriven.com
>> Tel. 

Re: [Rd] 1954 from NA

2021-05-24 Thread Adrian Dușa
Dear Alex,

Thanks for piping in, I am learning with each new message.
The problem is clear, the solution escapes me though. I've already tried
the attributes route: it is going to triple the data size: along with the
additional (logical) variable that specifies which level is missing, one
also needs to store an index such that sorting the data would still
maintain the correct information.

One also needs to think about subsetting (subset the attributes as well),
splitting (the same), aggregating multiple datasets (even more attention),
creating custom vectors out of multiple variables... complexity quickly
grows towards infinity.

R factors are nice indeed, but:
- there are numerical variables which can hold multiple missing values (for
instance income)
- factors convert the original questionnaire values: if a missing value was
coded 999, turning that into a factor would convert that value into
something else

I really, and wholeheartedly, do appreciate all advice: but please be
assured that I have been thinking about this for more than 10 years and
still haven't found a satisfactory solution.

Which makes it even more intriguing, since other software like SAS or Stata
have solved this for decades: what is their implementation, and how come
they don't seem to be affected by the new M1 architecture?
When package "haven" introduced the tagged NA values I said: ah-haa... so
that is how it's done... only to learn that implementation is just as
fragile as the R internals.

There really should be a robust solution for this seemingly mundane
problem, but apparently is far from mundane...

Best wishes,
Adrian


On Mon, May 24, 2021 at 3:29 PM Bertram, Alexander 
wrote:

> Dear Adrian,
> I just wanted to pipe in and underscore Thomas' point: the payload bits of
> IEEE 754 floating point values are no place to store data that you care
> about or need to keep. That is not only related to the R APIs, but also how
> processors handle floating point values and signaling and non-signaling
> NaNs. It is very difficult to reason about when and under which
> circumstances these bits are preserved. I spent a lot of time working on
> Renjin's handling of these values and I can assure that any such scheme
> will end in tears.
>
> A far, far better option is to use R's attributes to store this kind of
> metadata. This is exactly what this language feature is for. There is
> already a standard 'levels' attribute that holds the labels of factors like
> "Yes", "No" , "Refused", "Interviewer error'' etc. In the past, I've worked
> on projects where we stored an additional attribute like "missingLevels"
> that stores extra metadata on which levels should be used in which kind of
> analysis. That way, you can preserve all the information, and then write a
> utility function which automatically applies certain logic to a whole
> dataframe just before passing the data to an analysis function. This is
> also important because in surveys like this, different values should be
> excluded at different times. For example, you might want to include all
> responses in a data quality report, but exclude interviewer error and
> refusals when conducting a PCA or fitting a model.
>
> Best,
> Alex
>
> On Mon, May 24, 2021 at 2:03 PM Adrian Dușa  wrote:
>
>> On Mon, May 24, 2021 at 1:31 PM Tomas Kalibera 
>> wrote:
>>
>> > [...]
>> >
>> > For the reasons I explained, I would be against such a change. Keeping
>> the
>> > data on the side, as also recommended by others on this list, would
>> allow
>> > you for a reliable implementation. I don't want to support fragile
>> package
>> > code building on unspecified R internals, and in this case particularly
>> > internals that themselves have not stood the test of time, so are at
>> high
>> > risk of change.
>> >
>> I understand, and it makes sense.
>> We'll have to wait for the R internals to settle (this really is
>> surprising, I wonder how other software have solved this). In the
>> meantime,
>> I will probably go ahead with NaNs.
>>
>> Thank you again,
>> Adrian
>>
>> [[alternative HTML version deleted]]
>>
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
>
> --
> Alexander Bertram
> Technical Director
> *BeDataDriven BV*
>
> Web: http://bedatadriven.com
> Email: a...@bedatadriven.com
> Tel. Nederlands: +31(0)647205388
> Skype: akbertram
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] 1954 from NA

2021-05-24 Thread Adrian Dușa
On Mon, May 24, 2021 at 2:11 PM Greg Minshall  wrote:

> [...]
> if you have 500 columns of possibly-NA'd variables, you could have one
> column of 500 "bits", where each bit has one of N values, N being the
> number of explanations the corresponding column has for why the NA
> exists.
>

The mere thought of implementing something like that gives me shivers. Not
to mention such a solution should also be robust when subsetting,
splitting, column and row binding, etc. and everything can be lost if the
user deletes that particular column without realising its importance.

Social science datasets are much more alive and complex than one might
first think: there are multi-wave studies with tens of countries, and
aggregating such data is already a complex process to add even more
complexity on top of that.

As undocumented as they may be, or even subject to change, I think the R
internals are much more reliable that this.

Best wishes,
Adrian

-- 
Adrian Dusa
University of Bucharest
Romanian Social Data Archive
Soseaua Panduri nr. 90-92
050663 Bucharest sector 5
Romania
https://adriandusa.eu

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] 1954 from NA

2021-05-24 Thread Bertram, Alexander via R-devel
Dear Adrian,
I just wanted to pipe in and underscore Thomas' point: the payload bits of
IEEE 754 floating point values are no place to store data that you care
about or need to keep. That is not only related to the R APIs, but also how
processors handle floating point values and signaling and non-signaling
NaNs. It is very difficult to reason about when and under which
circumstances these bits are preserved. I spent a lot of time working on
Renjin's handling of these values and I can assure that any such scheme
will end in tears.

A far, far better option is to use R's attributes to store this kind of
metadata. This is exactly what this language feature is for. There is
already a standard 'levels' attribute that holds the labels of factors like
"Yes", "No" , "Refused", "Interviewer error'' etc. In the past, I've worked
on projects where we stored an additional attribute like "missingLevels"
that stores extra metadata on which levels should be used in which kind of
analysis. That way, you can preserve all the information, and then write a
utility function which automatically applies certain logic to a whole
dataframe just before passing the data to an analysis function. This is
also important because in surveys like this, different values should be
excluded at different times. For example, you might want to include all
responses in a data quality report, but exclude interviewer error and
refusals when conducting a PCA or fitting a model.

Best,
Alex

On Mon, May 24, 2021 at 2:03 PM Adrian Dușa  wrote:

> On Mon, May 24, 2021 at 1:31 PM Tomas Kalibera 
> wrote:
>
> > [...]
> >
> > For the reasons I explained, I would be against such a change. Keeping
> the
> > data on the side, as also recommended by others on this list, would allow
> > you for a reliable implementation. I don't want to support fragile
> package
> > code building on unspecified R internals, and in this case particularly
> > internals that themselves have not stood the test of time, so are at high
> > risk of change.
> >
> I understand, and it makes sense.
> We'll have to wait for the R internals to settle (this really is
> surprising, I wonder how other software have solved this). In the meantime,
> I will probably go ahead with NaNs.
>
> Thank you again,
> Adrian
>
> [[alternative HTML version deleted]]
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>


-- 
Alexander Bertram
Technical Director
*BeDataDriven BV*

Web: http://bedatadriven.com
Email: a...@bedatadriven.com
Tel. Nederlands: +31(0)647205388
Skype: akbertram

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] 1954 from NA

2021-05-24 Thread Adrian Dușa
On Mon, May 24, 2021 at 1:31 PM Tomas Kalibera 
wrote:

> [...]
>
> For the reasons I explained, I would be against such a change. Keeping the
> data on the side, as also recommended by others on this list, would allow
> you for a reliable implementation. I don't want to support fragile package
> code building on unspecified R internals, and in this case particularly
> internals that themselves have not stood the test of time, so are at high
> risk of change.
>
I understand, and it makes sense.
We'll have to wait for the R internals to settle (this really is
surprising, I wonder how other software have solved this). In the meantime,
I will probably go ahead with NaNs.

Thank you again,
Adrian

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] 1954 from NA

2021-05-24 Thread Greg Minshall
Adrian,

> If it was only one column then your solution is neat. But with 5-600
> variables, each of which can contain multiple missing values, to
> double this number of variables just to describe NA values seems to me
> excessive.  Not to mention we should be able to quickly convert /
> import / export from one software package to another. This would imply
> maintaining some sort of metadata reference of which explanatory
> additional factor describes which original variable.

one thing *i* should keep in mind is the old saying: "The difference
between theory and practice is that in theory there is no difference,
but in practice, there is."

but, in theory:

if you have 500 columns of possibly-NA'd variables, you could have one
column of 500 "bits", where each bit has one of N values, N being the
number of explanations the corresponding column has for why the NA
exists.

i guess the CS'y thing that comes to my mind here is that one thing is
the *semantics* of what you are trying to convey, and the other is how
those semantics are *encoded* in whatever representation you are using.

cheers, Greg

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] 1954 from NA

2021-05-24 Thread Adrian Dușa
Hmm...
If it was only one column then your solution is neat. But with 5-600
variables, each of which can contain multiple missing values, to double
this number of variables just to describe NA values seems to me excessive.
Not to mention we should be able to quickly convert / import / export from
one software package to another. This would imply maintaining some sort of
metadata reference of which explanatory additional factor describes which
original variable.

All of this strikes me as a lot of hassle compared to storing some
information within a tagged NA value... I just need a little bit more bits
to play with.

Best wishes,
Adrian

On Sun, May 23, 2021 at 10:21 PM Avi Gross via R-devel <
r-devel@r-project.org> wrote:

> Arguably, R was not developed to satisfy some needs in the way intended.
>
> When I have had to work with datasets from some of the social sciences I
> have had to adapt to subtleties in how they did things with software like
> SPSS in which an NA was done using an out of bounds marker like 999 or "."
> or even a blank cell. The problem is that R has a concept where data such
> as integers or floating point numbers is not stored as text normally but in
> their own formats and a vector by definition can only contain ONE data
> type. So the various forms of NA as well as Nan and Inf had to be grafted
> on to be considered VALID to share the same storage area as if they sort of
> were an integer or floating point number or text or whatever.
>
> It does strike me as possible to simply have a column that is something
> like a factor that can contain as many NA excuses as you wish such as "NOT
> ANSWERED" to "CANNOT READ THE SQUIGLE" to "NOT SURE" to "WILL BE FILLED IN
> LATER" to "I DON'T SPEAK ENGLISH AND CANNOT ANSWER STUPID QUESTIONS". This
> additional column would presumably only have content when the other column
> has an NA. Your queries and other changes would work on something like a
> data.frame where both such columns coexisted.
>
> Note reading in data with multiple NA reasons may take extra work. If your
> errors codes are text, it will all become text. If the errors are 999 and
> 998 and 997, it may all be treated as numeric and you may not want to
> convert all such codes to an NA immediately. Rather, you would use the
> first vector/column to make the second vector and THEN replace everything
> that should be an NA with an actual NA and reparse the entire vector to
> become properly numeric unless you like working with text and will convert
> to numbers as needed on the fly.
>
> Now this form of annotation may not be pleasing but I suggest that an
> implementation that does allow annotation may use up space too. Of course,
> if your NA values are rare and space is only used then, you might save
> space. But if you could make a factor column and have it use the smallest
> int it can get as a basis, it may be a way to save on space.
>
> People who have done work with R, especially those using the tidyverse,
> are quite used to using one column to explain another. So if you are asked
> to say tabulate what percent of missing values are due to reasons A/B/C
> then the added columns works fine for that calculation too.
>

-- 
Adrian Dusa
University of Bucharest
Romanian Social Data Archive
Soseaua Panduri nr. 90-92
050663 Bucharest sector 5
Romania
https://adriandusa.eu

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] 1954 from NA

2021-05-24 Thread Tomas Kalibera


On 5/24/21 11:46 AM, Adrian Dușa wrote:
> On Sun, May 23, 2021 at 10:14 PM Tomas Kalibera 
> mailto:tomas.kalib...@gmail.com>> wrote:
>
> [...]
>
> Good, but unfortunately the delineation between computation and
> non-computation is not always transparent. Even if an operation
> doesn't look like "computation" on the high-level, it may
> internally involve computation - so, really, an R NA can become R
> NaN and vice versa, at any point (this is not a "feature", but it
> is how things are now).
>
>
> I see.
> Well, this is a risk we'll have to consider when the time comes. For 
> the moment, storing some metadata within the payload seems to work.
>
>> [...]
>
> Ok, then I would probably keep the meta-data on the missing values
> on the side to implement such missing values in such code, and
> treat them explicitly in supported operations.
>
> But. in principle, you can use the floating-point NaN payloads,
> and you can pass such values to R. You just need to be prepared
> that not only you would loose your payloads/tags, but also the
> difference between R NA and R NaNs. Thanks to value semantics of
> R, you would not loose the tags in input values with proper
> reference counts (e.g. marked immutable), because those values
> will not be modified.
>
> NaNs are fine of course, but then some (social science?) users might 
> get confused about the difference between NAs and NaNs, and for this 
> reason only I would still like to preserve the 1954 payload.
> If at all possible, however, the extra 16 bits from this payload would 
> make a whole lot of a difference.
>
> Please forgive my persistence, but would it be possible to use an 
> unsigned short instead of an unsigned int for the 1954 payload?
> That is, if it doesn't break anything, but I don't really see what it 
> could. The corresponding check function seems to work just fine and it 
> doesn't need to be changed at all:
>
> int R_IsNA(double x)
> {
>     if (isnan(x)) {
> ieee_double y;
> y.value = x;
> return (y.word[lw] == 1954);
>     }
>     return 0;
> }

For the reasons I explained, I would be against such a change. Keeping 
the data on the side, as also recommended by others on this list, would 
allow you for a reliable implementation. I don't want to support fragile 
package code building on unspecified R internals, and in this case 
particularly internals that themselves have not stood the test of time, 
so are at high risk of change.

Best
Tomas

>
> Best wishes,
> Adrian
>
>
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] 1954 from NA

2021-05-24 Thread Adrian Dușa
On Sun, May 23, 2021 at 10:14 PM Tomas Kalibera 
wrote:

> [...]
>
> Good, but unfortunately the delineation between computation and
> non-computation is not always transparent. Even if an operation doesn't
> look like "computation" on the high-level, it may internally involve
> computation - so, really, an R NA can become R NaN and vice versa, at any
> point (this is not a "feature", but it is how things are now).
>

I see.
Well, this is a risk we'll have to consider when the time comes. For the
moment, storing some metadata within the payload seems to work.



> [...]
>
> Ok, then I would probably keep the meta-data on the missing values on the
> side to implement such missing values in such code, and treat them
> explicitly in supported operations.
>
> But. in principle, you can use the floating-point NaN payloads, and you
> can pass such values to R. You just need to be prepared that not only you
> would loose your payloads/tags, but also the difference between R NA and R
> NaNs. Thanks to value semantics of R, you would not loose the tags in input
> values with proper reference counts (e.g. marked immutable), because those
> values will not be modified.
>
NaNs are fine of course, but then some (social science?) users might get
confused about the difference between NAs and NaNs, and for this reason
only I would still like to preserve the 1954 payload.
If at all possible, however, the extra 16 bits from this payload would make
a whole lot of a difference.

Please forgive my persistence, but would it be possible to use an unsigned
short instead of an unsigned int for the 1954 payload?
That is, if it doesn't break anything, but I don't really see what it
could. The corresponding check function seems to work just fine and it
doesn't need to be changed at all:

int R_IsNA(double x)
{
if (isnan(x)) {
ieee_double y;
y.value = x;
return (y.word[lw] == 1954);
}
return 0;
}

Best wishes,
Adrian

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] 1954 from NA

2021-05-23 Thread Greg Minshall
+1

Avi Gross via R-devel  wrote:

> Arguably, R was not developed to satisfy some needs in the way intended.
> 
> When I have had to work with datasets from some of the social sciences I have 
> had to adapt to subtleties in how they did things with software like SPSS in 
> which an NA was done using an out of bounds marker like 999 or "." or even a 
> blank cell. The problem is that R has a concept where data such as integers 
> or floating point numbers is not stored as text normally but in their own 
> formats and a vector by definition can only contain ONE data type. So the 
> various forms of NA as well as Nan and Inf had to be grafted on to be 
> considered VALID to share the same storage area as if they sort of were an 
> integer or floating point number or text or whatever.
> 
> It does strike me as possible to simply have a column that is something like 
> a factor that can contain as many NA excuses as you wish such as "NOT 
> ANSWERED" to "CANNOT READ THE SQUIGLE" to "NOT SURE" to "WILL BE FILLED IN 
> LATER" to "I DON'T SPEAK ENGLISH AND CANNOT ANSWER STUPID QUESTIONS". This 
> additional column would presumably only have content when the other column 
> has an NA. Your queries and other changes would work on something like a 
> data.frame where both such columns coexisted.
> 
> Note reading in data with multiple NA reasons may take extra work. If your 
> errors codes are text, it will all become text. If the errors are 999 and 998 
> and 997, it may all be treated as numeric and you may not want to convert all 
> such codes to an NA immediately. Rather, you would use the first 
> vector/column to make the second vector and THEN replace everything that 
> should be an NA with an actual NA and reparse the entire vector to become 
> properly numeric unless you like working with text and will convert to 
> numbers as needed on the fly.
> 
> Now this form of annotation may not be pleasing but I suggest that an 
> implementation that does allow annotation may use up space too. Of course, if 
> your NA values are rare and space is only used then, you might save space. 
> But if you could make a factor column and have it use the smallest int it can 
> get as a basis, it may be a way to save on space.
> 
> People who have done work with R, especially those using the tidyverse, are 
> quite used to using one column to explain another. So if you are asked to say 
> tabulate what percent of missing values are due to reasons A/B/C then the 
> added columns works fine for that calculation too.
> 
> 
> -Original Message-
> From: R-devel  On Behalf Of Adrian Du?a
> Sent: Sunday, May 23, 2021 2:04 PM
> To: Tomas Kalibera 
> Cc: r-devel 
> Subject: Re: [Rd] 1954 from NA
> 
> Dear Tomas,
> 
> I understand that perfectly, but that is fine.
> The payload is not going to be used in any computations anyways, it is 
> strictly an information carrier that differentiates between different types 
> of (tagged) NA values.
> 
> Having only one NA value in R is extremely limiting for the social sciences, 
> where multiple missing values may exist, because respondents:
> - did not know what to respond, or
> - did not want to respond, or perhaps
> - the question did not apply in a given situation etc.
> 
> All of these need to be captured, stored, and most importantly treated as if 
> they would be regular missing values. Whether the payload might be lost in 
> computations makes no difference: they were supposed to be "missing values" 
> anyways.
> 
> The original question is how the payload is currently stored: as an unsigned 
> int of 32 bits, or as an unsigned short of 16 bits. If the R internals would 
> not be affected (and I see no reason why they would be), it would allow an 
> entire universe for the social sciences that is not currently available and 
> which all other major statistical packages do offer.
> 
> Thank you very much, your attention is greatly appreciated, Adrian
> 
> On Sun, May 23, 2021 at 7:59 PM Tomas Kalibera 
> wrote:
> 
> > TLDR: tagging R NAs is not possible.
> >
> > External software should not depend on how R currently implements NA, 
> > this may change at any time. Tagging of NA is not supported in R (if 
> > it were, it would have been documented). It would not be possible to 
> > implement such tagging reliably with the current implementation of NA in R.
> >
> > NaN payload propagation is not standardized. Compilers are free to and 
> > do optimize code not preserving/achieving any specific propagation.
> > CPUs/FPUs differ in how they propagate in binary operations, some zero 
> > the payload on any operation. Virtual

Re: [Rd] 1954 from NA

2021-05-23 Thread Avi Gross via R-devel
Arguably, R was not developed to satisfy some needs in the way intended.

When I have had to work with datasets from some of the social sciences I have 
had to adapt to subtleties in how they did things with software like SPSS in 
which an NA was done using an out of bounds marker like 999 or "." or even a 
blank cell. The problem is that R has a concept where data such as integers or 
floating point numbers is not stored as text normally but in their own formats 
and a vector by definition can only contain ONE data type. So the various forms 
of NA as well as Nan and Inf had to be grafted on to be considered VALID to 
share the same storage area as if they sort of were an integer or floating 
point number or text or whatever.

It does strike me as possible to simply have a column that is something like a 
factor that can contain as many NA excuses as you wish such as "NOT ANSWERED" 
to "CANNOT READ THE SQUIGLE" to "NOT SURE" to "WILL BE FILLED IN LATER" to "I 
DON'T SPEAK ENGLISH AND CANNOT ANSWER STUPID QUESTIONS". This additional column 
would presumably only have content when the other column has an NA. Your 
queries and other changes would work on something like a data.frame where both 
such columns coexisted.

Note reading in data with multiple NA reasons may take extra work. If your 
errors codes are text, it will all become text. If the errors are 999 and 998 
and 997, it may all be treated as numeric and you may not want to convert all 
such codes to an NA immediately. Rather, you would use the first vector/column 
to make the second vector and THEN replace everything that should be an NA with 
an actual NA and reparse the entire vector to become properly numeric unless 
you like working with text and will convert to numbers as needed on the fly.

Now this form of annotation may not be pleasing but I suggest that an 
implementation that does allow annotation may use up space too. Of course, if 
your NA values are rare and space is only used then, you might save space. But 
if you could make a factor column and have it use the smallest int it can get 
as a basis, it may be a way to save on space.

People who have done work with R, especially those using the tidyverse, are 
quite used to using one column to explain another. So if you are asked to say 
tabulate what percent of missing values are due to reasons A/B/C then the added 
columns works fine for that calculation too.


-Original Message-
From: R-devel  On Behalf Of Adrian Du?a
Sent: Sunday, May 23, 2021 2:04 PM
To: Tomas Kalibera 
Cc: r-devel 
Subject: Re: [Rd] 1954 from NA

Dear Tomas,

I understand that perfectly, but that is fine.
The payload is not going to be used in any computations anyways, it is strictly 
an information carrier that differentiates between different types of (tagged) 
NA values.

Having only one NA value in R is extremely limiting for the social sciences, 
where multiple missing values may exist, because respondents:
- did not know what to respond, or
- did not want to respond, or perhaps
- the question did not apply in a given situation etc.

All of these need to be captured, stored, and most importantly treated as if 
they would be regular missing values. Whether the payload might be lost in 
computations makes no difference: they were supposed to be "missing values" 
anyways.

The original question is how the payload is currently stored: as an unsigned 
int of 32 bits, or as an unsigned short of 16 bits. If the R internals would 
not be affected (and I see no reason why they would be), it would allow an 
entire universe for the social sciences that is not currently available and 
which all other major statistical packages do offer.

Thank you very much, your attention is greatly appreciated, Adrian

On Sun, May 23, 2021 at 7:59 PM Tomas Kalibera 
wrote:

> TLDR: tagging R NAs is not possible.
>
> External software should not depend on how R currently implements NA, 
> this may change at any time. Tagging of NA is not supported in R (if 
> it were, it would have been documented). It would not be possible to 
> implement such tagging reliably with the current implementation of NA in R.
>
> NaN payload propagation is not standardized. Compilers are free to and 
> do optimize code not preserving/achieving any specific propagation.
> CPUs/FPUs differ in how they propagate in binary operations, some zero 
> the payload on any operation. Virtualized environments, binary 
> translations, etc, may not preserve it in any way, either. ?NA has 
> disclaimers about this, an NA may become NaN (payload lost) even in 
> unary operations and also in binary operations not involving other NaN/NAs.
>
> Writing any new software that would depend on that anything specific 
> happens to the NaN payloads would not be a good idea. One can only 
> reliably use the NaN payload bits for storage, that is if 

Re: [Rd] 1954 from NA

2021-05-23 Thread Tomas Kalibera


On 5/23/21 8:04 PM, Adrian Dușa wrote:
> Dear Tomas,
>
> I understand that perfectly, but that is fine.
> The payload is not going to be used in any computations anyways, it is 
> strictly an information carrier that differentiates between different 
> types of (tagged) NA values.
Good, but unfortunately the delineation between computation and 
non-computation is not always transparent. Even if an operation doesn't 
look like "computation" on the high-level, it may internally involve 
computation - so, really, an R NA can become R NaN and vice versa, at 
any point (this is not a "feature", but it is how things are now).
> Having only one NA value in R is extremely limiting for the social 
> sciences, where multiple missing values may exist, because respondents:
> - did not know what to respond, or
> - did not want to respond, or perhaps
> - the question did not apply in a given situation etc.
>
> All of these need to be captured, stored, and most importantly treated 
> as if they would be regular missing values. Whether the payload might 
> be lost in computations makes no difference: they were supposed to be 
> "missing values" anyways.

Ok, then I would probably keep the meta-data on the missing values on 
the side to implement such missing values in such code, and treat them 
explicitly in supported operations.

But. in principle, you can use the floating-point NaN payloads, and you 
can pass such values to R. You just need to be prepared that not only 
you would loose your payloads/tags, but also the difference between R NA 
and R NaNs. Thanks to value semantics of R, you would not loose the tags 
in input values with proper reference counts (e.g. marked immutable), 
because those values will not be modified.

Best
Tomas

> The original question is how the payload is currently stored: as an 
> unsigned int of 32 bits, or as an unsigned short of 16 bits. If the R 
> internals would not be affected (and I see no reason why they would 
> be), it would allow an entire universe for the social sciences that is 
> not currently available and which all other major statistical packages 
> do offer.

>
> Thank you very much, your attention is greatly appreciated,
> Adrian
>
> On Sun, May 23, 2021 at 7:59 PM Tomas Kalibera 
> mailto:tomas.kalib...@gmail.com>> wrote:
>
> TLDR: tagging R NAs is not possible.
>
> External software should not depend on how R currently implements NA,
> this may change at any time. Tagging of NA is not supported in R
> (if it
> were, it would have been documented). It would not be possible to
> implement such tagging reliably with the current implementation of
> NA in R.
>
> NaN payload propagation is not standardized. Compilers are free to
> and
> do optimize code not preserving/achieving any specific propagation.
> CPUs/FPUs differ in how they propagate in binary operations, some
> zero
> the payload on any operation. Virtualized environments, binary
> translations, etc, may not preserve it in any way, either. ?NA has
> disclaimers about this, an NA may become NaN (payload lost) even in
> unary operations and also in binary operations not involving other
> NaN/NAs.
>
> Writing any new software that would depend on that anything specific
> happens to the NaN payloads would not be a good idea. One can only
> reliably use the NaN payload bits for storage, that is if one
> avoids any
> computation at all, avoids passing the values to any external code
> unaware of such tagging (including R), etc. If such software wants
> any
> NaN to be understood as NA by R, it would have to use the
> documented R
> API for this (so essentially translating) - but given the problems
> mentioned above, there is really no point in doing that, because such
> NAs become NaNs at any time.
>
> Best
> Tomas
>
> On 5/23/21 9:56 AM, Adrian Dușa wrote:
> > Dear R devs,
> >
> > I am probably missing something obvious, but still trying to
> understand why
> > the 1954 from the definition of an NA has to fill 32 bits when
> it normally
> > doesn't need more than 16.
> >
> > Wouldn't the code below achieve exactly the same thing?
> >
> > typedef union
> > {
> >      double value;
> >      unsigned short word[4];
> > } ieee_double;
> >
> >
> > #ifdef WORDS_BIGENDIAN
> > static CONST int hw = 0;
> > static CONST int lw = 3;
> > #else  /* !WORDS_BIGENDIAN */
> > static CONST int hw = 3;
> > static CONST int lw = 0;
> > #endif /* WORDS_BIGENDIAN */
> >
> >
> > static double R_ValueOfNA(void)
> > {
> >      volatile ieee_double x;
> >      x.word[hw] = 0x7ff0;
> >      x.word[lw] = 1954;
> >      return x.value;
> > }
> >
> > This question has to do with the tagged NA values from package
> haven, on
> > which I want to improve. Every available bit 

Re: [Rd] 1954 from NA

2021-05-23 Thread Adrian Dușa
Dear Tomas,

I understand that perfectly, but that is fine.
The payload is not going to be used in any computations anyways, it is
strictly an information carrier that differentiates between different types
of (tagged) NA values.

Having only one NA value in R is extremely limiting for the social
sciences, where multiple missing values may exist, because respondents:
- did not know what to respond, or
- did not want to respond, or perhaps
- the question did not apply in a given situation etc.

All of these need to be captured, stored, and most importantly treated as
if they would be regular missing values. Whether the payload might be lost
in computations makes no difference: they were supposed to be "missing
values" anyways.

The original question is how the payload is currently stored: as an
unsigned int of 32 bits, or as an unsigned short of 16 bits. If the R
internals would not be affected (and I see no reason why they would be), it
would allow an entire universe for the social sciences that is not
currently available and which all other major statistical packages do offer.

Thank you very much, your attention is greatly appreciated,
Adrian

On Sun, May 23, 2021 at 7:59 PM Tomas Kalibera 
wrote:

> TLDR: tagging R NAs is not possible.
>
> External software should not depend on how R currently implements NA,
> this may change at any time. Tagging of NA is not supported in R (if it
> were, it would have been documented). It would not be possible to
> implement such tagging reliably with the current implementation of NA in R.
>
> NaN payload propagation is not standardized. Compilers are free to and
> do optimize code not preserving/achieving any specific propagation.
> CPUs/FPUs differ in how they propagate in binary operations, some zero
> the payload on any operation. Virtualized environments, binary
> translations, etc, may not preserve it in any way, either. ?NA has
> disclaimers about this, an NA may become NaN (payload lost) even in
> unary operations and also in binary operations not involving other NaN/NAs.
>
> Writing any new software that would depend on that anything specific
> happens to the NaN payloads would not be a good idea. One can only
> reliably use the NaN payload bits for storage, that is if one avoids any
> computation at all, avoids passing the values to any external code
> unaware of such tagging (including R), etc. If such software wants any
> NaN to be understood as NA by R, it would have to use the documented R
> API for this (so essentially translating) - but given the problems
> mentioned above, there is really no point in doing that, because such
> NAs become NaNs at any time.
>
> Best
> Tomas
>
> On 5/23/21 9:56 AM, Adrian Dușa wrote:
> > Dear R devs,
> >
> > I am probably missing something obvious, but still trying to understand
> why
> > the 1954 from the definition of an NA has to fill 32 bits when it
> normally
> > doesn't need more than 16.
> >
> > Wouldn't the code below achieve exactly the same thing?
> >
> > typedef union
> > {
> >  double value;
> >  unsigned short word[4];
> > } ieee_double;
> >
> >
> > #ifdef WORDS_BIGENDIAN
> > static CONST int hw = 0;
> > static CONST int lw = 3;
> > #else  /* !WORDS_BIGENDIAN */
> > static CONST int hw = 3;
> > static CONST int lw = 0;
> > #endif /* WORDS_BIGENDIAN */
> >
> >
> > static double R_ValueOfNA(void)
> > {
> >  volatile ieee_double x;
> >  x.word[hw] = 0x7ff0;
> >  x.word[lw] = 1954;
> >  return x.value;
> > }
> >
> > This question has to do with the tagged NA values from package haven, on
> > which I want to improve. Every available bit counts, especially if
> > multi-byte characters are going to be involved.
> >
> > Best wishes,
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] 1954 from NA

2021-05-23 Thread Tomas Kalibera

TLDR: tagging R NAs is not possible.

External software should not depend on how R currently implements NA, 
this may change at any time. Tagging of NA is not supported in R (if it 
were, it would have been documented). It would not be possible to 
implement such tagging reliably with the current implementation of NA in R.


NaN payload propagation is not standardized. Compilers are free to and 
do optimize code not preserving/achieving any specific propagation. 
CPUs/FPUs differ in how they propagate in binary operations, some zero 
the payload on any operation. Virtualized environments, binary 
translations, etc, may not preserve it in any way, either. ?NA has 
disclaimers about this, an NA may become NaN (payload lost) even in 
unary operations and also in binary operations not involving other NaN/NAs.


Writing any new software that would depend on that anything specific 
happens to the NaN payloads would not be a good idea. One can only 
reliably use the NaN payload bits for storage, that is if one avoids any 
computation at all, avoids passing the values to any external code 
unaware of such tagging (including R), etc. If such software wants any 
NaN to be understood as NA by R, it would have to use the documented R 
API for this (so essentially translating) - but given the problems 
mentioned above, there is really no point in doing that, because such 
NAs become NaNs at any time.


Best
Tomas

On 5/23/21 9:56 AM, Adrian Dușa wrote:

Dear R devs,

I am probably missing something obvious, but still trying to understand why
the 1954 from the definition of an NA has to fill 32 bits when it normally
doesn't need more than 16.

Wouldn't the code below achieve exactly the same thing?

typedef union
{
 double value;
 unsigned short word[4];
} ieee_double;


#ifdef WORDS_BIGENDIAN
static CONST int hw = 0;
static CONST int lw = 3;
#else  /* !WORDS_BIGENDIAN */
static CONST int hw = 3;
static CONST int lw = 0;
#endif /* WORDS_BIGENDIAN */


static double R_ValueOfNA(void)
{
 volatile ieee_double x;
 x.word[hw] = 0x7ff0;
 x.word[lw] = 1954;
 return x.value;
}

This question has to do with the tagged NA values from package haven, on
which I want to improve. Every available bit counts, especially if
multi-byte characters are going to be involved.

Best wishes,


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] 1954 from NA

2021-05-23 Thread brodie gaslam via R-devel
> On Sunday, May 23, 2021, 10:45:22 AM EDT, Adrian Dușa  
> wrote:
>
> On Sun, May 23, 2021 at 4:33 PM brodie gaslam via R-devel 
>  wrote:
> > I should add, I don't know that you can rely on this
> > particular encoding of R's NA.  If I were trying to restore
> > an NA from some external format, I would just generate an
> > R NA via e.g NA_real_ in the R session I'm restoring the
> > external data into, and not try to hand assemble one.
>
> Thanks for your answer, Brodie, especially on Sunday (much appreciated).

Maybe I shouldn't answer on Sunday given I've said several wrong things...

> The aim is not to reconstruct an NA, but to "tag" an NA (and yes, I was
> referring to an NA_real_ of course), as seen in action here:
> https://github.com/tidyverse/haven/blob/master/src/tagged_na.c
>
> That code:
> - preserves the first part 0x7ff0
> - preserves the last part 1954
> - adds one additional byte to store (tag) a character provided in the SEXP 
> vector
>
> That is precisely my understanding, that doubles starting with 0x7ff are
> all NaNs. My question was related to the additional part 1954 from the
> low bits: why does it need 32 bits?

It probably doesn't need 32 bits.  The code is trying to set all 64 bits.
It seems natural to do the high 32 bit, and then the low.  But I'm not R
Core so don't listen to me too closely.

> The binary value of 1954 is 0100010, which is represented by 11 bits
> occupying at most 2 bytes... So why does it need 4 bytes?
>
> Re. the possible overflow, I am not sure: 0x7ff0 is the decimal 32752,
> or the binary 111.

You are right, I had a moment and wrongly counted hex digits as bytes
instead of half-bytes.

> That is just about enough to fit in the available 16 bits (actually 15
> to leave one for the sign bit), so I don't really understand why it
> would. And in > any case, the union definition uses an unsigned short
> which (if my understanding is correct) should certainly not overflow:
>
> typedef union
> {
> double value;
> unsigned short word[4];
> } ieee_double;
>
> What is gained with this proposal: 16 additional bits to do something
> with. For the moment, only 16 are available (from the lower part of the
> high 32 bits). If the value 1954 would be checked as a short instead of
> an int, the other 16 bits would become available. And those bits could
> be extremely valuable to tag multi-byte characters, for instance, but
> also higher numbers than 32767.

Note that the stability of the payload portion of NaNs is questionable:

https://developer.r-project.org/Blog/public/2020/11/02/will-r-work-on-apple-silicon/#nanan-payload-propagation

Also, if I understand correctly, you would be asking R core to formalize
the internal representation of the R NA, which I don't think is public?
So that you can use those internal bits for your own purposes with a
guarantee that R will not disturb them?  Obviously only they can answer
that.

Apologies for confusing the issue.

B,

PS: the other obviously wrong thing I said was the NA was 0x7ff0  &
1954 when it is really 0x7ff0    & 1954 when.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] 1954 from NA

2021-05-23 Thread Adrian Dușa
On Sun, May 23, 2021 at 4:33 PM brodie gaslam via R-devel <
r-devel@r-project.org> wrote:

> I should add, I don't know that you can rely on this
> particular encoding of R's NA.  If I were trying to restore
> an NA from some external format, I would just generate an
> R NA via e.g NA_real_ in the R session I'm restoring the
> external data into, and not try to hand assemble one.
>

Thanks for your answer, Brodie, especially on Sunday (much appreciated).
The aim is not to reconstruct an NA, but to "tag" an NA (and yes, I was
referring to an NA_real_ of course), as seen in action here:
https://github.com/tidyverse/haven/blob/master/src/tagged_na.c

That code:
- preserves the first part 0x7ff0
- preserves the last part 1954
- adds one additional byte to store (tag) a character provided in the SEXP
vector

That is precisely my understanding, that doubles starting with 0x7ff are
all NaNs. My question was related to the additional part 1954 from the low
bits: why does it need 32 bits?

The binary value of 1954 is 0100010, which is represented by 11 bits
occupying at most 2 bytes... So why does it need 4 bytes?

Re. the possible overflow, I am not sure: 0x7ff0 is the decimal 32752, or
the binary 111.
That is just about enough to fit in the available 16 bits (actually 15 to
leave one for the sign bit), so I don't really understand why it would. And
in any case, the union definition uses an unsigned short which (if my
understanding is correct) should certainly not overflow:

typedef union
{
double value;
unsigned short word[4];
} ieee_double;

What is gained with this proposal: 16 additional bits to do something with.
For the moment, only 16 are available (from the lower part of the high 32
bits). If the value 1954 would be checked as a short instead of an int, the
other 16 bits would become available. And those bits could be extremely
valuable to tag multi-byte characters, for instance, but also higher
numbers than 32767.

Best wishes,
Adrian

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] 1954 from NA

2021-05-23 Thread Mark van der Loo
I wrote about this once over here:
http://www.markvanderloo.eu/yaRb/2012/07/08/representation-of-numerical-nas-in-r-and-the-1954-enigma/

-M



Op zo 23 mei 2021 15:33 schreef brodie gaslam via R-devel <
r-devel@r-project.org>:

> I should add, I don't know that you can rely on this
> particular encoding of R's NA.  If I were trying to restore
> an NA from some external format, I would just generate an
> R NA via e.g NA_real_ in the R session I'm restoring the
> external data into, and not try to hand assemble one.
>
> Best,
>
> B.
>
>
> On Sunday, May 23, 2021, 9:23:54 AM EDT, brodie gaslam via R-devel <
> r-devel@r-project.org> wrote:
>
>
>
>
>
> This is because the NA in question is NA_real_, which
> is encoded in double precision IEEE-754, which uses
> 64 bits.  The "1954" is just part of the NA.  The NA
> must also conform to the NaN encoding for double precision
> numbers, which requires that the "beginning" portion of
> the number be "0x7ff0" (well, I think it should be "0x7ff8"
> but that's a different story), as you can see here:
>
> x.word[hw] = 0x7ff0;
> x.word[lw] = 1954;
>
> Both those components are part of the same double precision
> value.  They are just accessed this way to make it easy to
> set the high bits (63-32) and the low bits (31-0).
>
> So NA is not just 1954, its 0x7ff0  & 1954 (note I'm
> mixing hex and decimals here).
>
> In IEEE 754 double precision encoding numbers that start
> with 0x7ff are all NaNs.  The rest of the number except for
> the first bit which designates "quiet" vs "signaling" NaNs can
> be anything.  R has taken advantage of that to designate the
> R NA by setting the lower bits to be 1954.
>
> Note I'm being pretty loose about endianess, etc. here, but
> hopefully this conveys the problem.
>
> In terms of your proposal, I'm not entirely sure what you gain.
> You're still attempting to generate a 64 bit representation
> in the end.  If all you need is to encode the fact that there
> was an NA, and restore it later as a 64 bit NA, then you can do
> whatever you want so long as the end result conforms to the
> expected encoding.
>
> In terms of using 'short' here (which again, I don't see the
> need for as you're using it to generate the final 64 bit encoding),
> I see two possible problems.  You're adding the dependency that
> short will be 16 bits.  We already have the (implicit) assumption
> in R that double is 64 bits, and explicit that int is 32 bits.
> But I think you'd be going a bit on a limb assuming that short
> is 16 bits (not sure).  More important, if short is indeed 16 bits,
> I think in:
>
> x.word[hw] = 0x7ff0;
>
> You overflow short.
>
> Best,
>
> B.
>
>
>
> On Sunday, May 23, 2021, 8:56:18 AM EDT, Adrian Dușa <
> dusa.adr...@unibuc.ro> wrote:
>
>
>
>
>
> Dear R devs,
>
> I am probably missing something obvious, but still trying to understand why
> the 1954 from the definition of an NA has to fill 32 bits when it normally
> doesn't need more than 16.
>
> Wouldn't the code below achieve exactly the same thing?
>
> typedef union
> {
> double value;
> unsigned short word[4];
> } ieee_double;
>
>
> #ifdef WORDS_BIGENDIAN
> static CONST int hw = 0;
> static CONST int lw = 3;
> #else  /* !WORDS_BIGENDIAN */
> static CONST int hw = 3;
> static CONST int lw = 0;
> #endif /* WORDS_BIGENDIAN */
>
>
> static double R_ValueOfNA(void)
> {
> volatile ieee_double x;
> x.word[hw] = 0x7ff0;
> x.word[lw] = 1954;
> return x.value;
> }
>
> This question has to do with the tagged NA values from package haven, on
> which I want to improve. Every available bit counts, especially if
> multi-byte characters are going to be involved.
>
> Best wishes,
> --
> Adrian Dusa
> University of Bucharest
> Romanian Social Data Archive
> Soseaua Panduri nr. 90-92
> 050663 Bucharest sector 5
> Romania
> https://adriandusa.eu
>
> [[alternative HTML version deleted]]
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] 1954 from NA

2021-05-23 Thread brodie gaslam via R-devel
I should add, I don't know that you can rely on this
particular encoding of R's NA.  If I were trying to restore
an NA from some external format, I would just generate an
R NA via e.g NA_real_ in the R session I'm restoring the 
external data into, and not try to hand assemble one.

Best,

B.


On Sunday, May 23, 2021, 9:23:54 AM EDT, brodie gaslam via R-devel 
 wrote: 





This is because the NA in question is NA_real_, which
is encoded in double precision IEEE-754, which uses
64 bits.  The "1954" is just part of the NA.  The NA
must also conform to the NaN encoding for double precision
numbers, which requires that the "beginning" portion of
the number be "0x7ff0" (well, I think it should be "0x7ff8"
but that's a different story), as you can see here:

    x.word[hw] = 0x7ff0;
    x.word[lw] = 1954;

Both those components are part of the same double precision
value.  They are just accessed this way to make it easy to
set the high bits (63-32) and the low bits (31-0).

So NA is not just 1954, its 0x7ff0  & 1954 (note I'm
mixing hex and decimals here).

In IEEE 754 double precision encoding numbers that start
with 0x7ff are all NaNs.  The rest of the number except for
the first bit which designates "quiet" vs "signaling" NaNs can
be anything.  R has taken advantage of that to designate the
R NA by setting the lower bits to be 1954.

Note I'm being pretty loose about endianess, etc. here, but
hopefully this conveys the problem.

In terms of your proposal, I'm not entirely sure what you gain.
You're still attempting to generate a 64 bit representation
in the end.  If all you need is to encode the fact that there
was an NA, and restore it later as a 64 bit NA, then you can do
whatever you want so long as the end result conforms to the
expected encoding.

In terms of using 'short' here (which again, I don't see the
need for as you're using it to generate the final 64 bit encoding),
I see two possible problems.  You're adding the dependency that
short will be 16 bits.  We already have the (implicit) assumption
in R that double is 64 bits, and explicit that int is 32 bits.
But I think you'd be going a bit on a limb assuming that short
is 16 bits (not sure).  More important, if short is indeed 16 bits,
I think in:

    x.word[hw] = 0x7ff0;

You overflow short.

Best,

B.



On Sunday, May 23, 2021, 8:56:18 AM EDT, Adrian Dușa  
wrote: 





Dear R devs,

I am probably missing something obvious, but still trying to understand why
the 1954 from the definition of an NA has to fill 32 bits when it normally
doesn't need more than 16.

Wouldn't the code below achieve exactly the same thing?

typedef union
{
    double value;
    unsigned short word[4];
} ieee_double;


#ifdef WORDS_BIGENDIAN
static CONST int hw = 0;
static CONST int lw = 3;
#else  /* !WORDS_BIGENDIAN */
static CONST int hw = 3;
static CONST int lw = 0;
#endif /* WORDS_BIGENDIAN */


static double R_ValueOfNA(void)
{
    volatile ieee_double x;
    x.word[hw] = 0x7ff0;
    x.word[lw] = 1954;
    return x.value;
}

This question has to do with the tagged NA values from package haven, on
which I want to improve. Every available bit counts, especially if
multi-byte characters are going to be involved.

Best wishes,
-- 
Adrian Dusa
University of Bucharest
Romanian Social Data Archive
Soseaua Panduri nr. 90-92
050663 Bucharest sector 5
Romania
https://adriandusa.eu

    [[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] 1954 from NA

2021-05-23 Thread brodie gaslam via R-devel
This is because the NA in question is NA_real_, which
is encoded in double precision IEEE-754, which uses
64 bits.  The "1954" is just part of the NA.  The NA
must also conform to the NaN encoding for double precision
numbers, which requires that the "beginning" portion of
the number be "0x7ff0" (well, I think it should be "0x7ff8"
but that's a different story), as you can see here:

    x.word[hw] = 0x7ff0;
    x.word[lw] = 1954;

Both those components are part of the same double precision
value.  They are just accessed this way to make it easy to
set the high bits (63-32) and the low bits (31-0).

So NA is not just 1954, its 0x7ff0  & 1954 (note I'm
mixing hex and decimals here).

In IEEE 754 double precision encoding numbers that start
with 0x7ff are all NaNs.  The rest of the number except for
the first bit which designates "quiet" vs "signaling" NaNs can
be anything.  R has taken advantage of that to designate the
R NA by setting the lower bits to be 1954.

Note I'm being pretty loose about endianess, etc. here, but
hopefully this conveys the problem.

In terms of your proposal, I'm not entirely sure what you gain.
You're still attempting to generate a 64 bit representation
in the end.  If all you need is to encode the fact that there
was an NA, and restore it later as a 64 bit NA, then you can do
whatever you want so long as the end result conforms to the
expected encoding.

In terms of using 'short' here (which again, I don't see the
need for as you're using it to generate the final 64 bit encoding),
I see two possible problems.  You're adding the dependency that
short will be 16 bits.  We already have the (implicit) assumption
in R that double is 64 bits, and explicit that int is 32 bits.
But I think you'd be going a bit on a limb assuming that short
is 16 bits (not sure).  More important, if short is indeed 16 bits,
I think in:

    x.word[hw] = 0x7ff0;

You overflow short.

Best,

B.



On Sunday, May 23, 2021, 8:56:18 AM EDT, Adrian Dușa  
wrote: 





Dear R devs,

I am probably missing something obvious, but still trying to understand why
the 1954 from the definition of an NA has to fill 32 bits when it normally
doesn't need more than 16.

Wouldn't the code below achieve exactly the same thing?

typedef union
{
    double value;
    unsigned short word[4];
} ieee_double;


#ifdef WORDS_BIGENDIAN
static CONST int hw = 0;
static CONST int lw = 3;
#else  /* !WORDS_BIGENDIAN */
static CONST int hw = 3;
static CONST int lw = 0;
#endif /* WORDS_BIGENDIAN */


static double R_ValueOfNA(void)
{
    volatile ieee_double x;
    x.word[hw] = 0x7ff0;
    x.word[lw] = 1954;
    return x.value;
}

This question has to do with the tagged NA values from package haven, on
which I want to improve. Every available bit counts, especially if
multi-byte characters are going to be involved.

Best wishes,
-- 
Adrian Dusa
University of Bucharest
Romanian Social Data Archive
Soseaua Panduri nr. 90-92
050663 Bucharest sector 5
Romania
https://adriandusa.eu

    [[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel