Re: [Rd] [External] Re: New pipe operator

2020-12-05 Thread Avi Gross via R-devel
Luke and others,

Can anyone comment on how this new pipe operator will interoperate with 
existing pipe methods or packages like the tidyverse that currently do things 
using them?

What differences might it make for efficiency? For example, making an anonymous 
function just so you can call another function and pass along the results to 
somewhere other than the first argument sounds like extra overhead. But the 
anonymous function does provide some interesting scenarios that allow things 
like a sort of tee functionality that may print/graph  a mid-stream result as 
well as pass it along the pipeline or do amusing things like apply multiple 
steps to the data and perhaps concatenate some of the results in the output. Of 
course, that can be done now with a non-anonymous function.

Perhaps we should name one version (or the other) a pipette or a pipe dream 😉

The name "pipe" feels better in the new version as "|>" has the UNIX pipe 
symbol "|" in it. 


-Original Message-
From: R-devel  On Behalf Of 
luke-tier...@uiowa.edu
Sent: Friday, December 4, 2020 9:11 PM
To: Duncan Murdoch 
Cc: r-devel@r-project.org
Subject: Re: [Rd] [External] Re: New pipe operator

On Sat, 5 Dec 2020, Duncan Murdoch wrote:

> On 04/12/2020 2:26 p.m., luke-tier...@uiowa.edu wrote:
>> On Fri, 4 Dec 2020, Dénes Tóth wrote:
>> 
>>> 
>>> On 12/4/20 3:05 PM, Duncan Murdoch wrote:
 ...
 
 It's tempting to suggest it should allow something like

 mtcars |> subset(cyl == 4) |> lm(mpg ~ disp, data = .)
 
 which would be expanded to something equivalent to the other versions: 
 but
 that makes it quite a bit more complicated.  (Maybe _ or \. should 
 be used instead of ., since those are not legal variable names.)
>>> 
>>> I support the idea of using an underscore (_) as the placeholder symbol.
>> 
>> I strongly oppose adding a placeholder. Allowing for an optional 
>> placeholder significantly complicates both implementing and 
>> explaining the semantics. For a simple syntax transformation to be 
>> viable it would also require some restrictions, such as only allowing 
>> a placeholder as a top level argument and only once. Checking that 
>> these restrictions are met, and accurately signaling when they are 
>> not with reasonable error messages, is essentially an unsolvable 
>> problem given R's semantics.
>
> I don't think you read my suggestion, but that's okay:  you're 
> maintaining it, not me.

I thought I did but maybe I missed something. You are right that supporting a 
placeholder makes things a lot more complicated. For being able to easily 
recognize the non-standard cases _ is better than . but for me at least not by 
much.

We did try a number of variations; the code is in the R-syntax branch.
At the root of that branch are two .md files with some notes as of around 
useR20. Once things settle down I may update those and look into turning them 
into a blog post.

Best,

luke

>
> Duncan Murdoch
>
>> 
>> The case where the LHS is to be passed as something other than the 
>> first argument is unusual. For me, having that case stand out by 
>> using a function expression makes it much easier to see and so makes 
>> the code easier to understand. As a wearer of progressive bifocals 
>> and someone whose screen is not always free of small dust particles, 
>> having to spot the non-standard pipe stages by seeing a placeholder, 
>> especially a . placeholder, is be a bug, not a feature.
>> 
>> Best,
>> 
>> luke
>> 
>>> Syntactic sugars work the the best if 1) they require less 
>>> keystrokes and/or
>>> 2) are easier to read compared to the "normal" syntax, and 3) can 
>>> not lead to unexpected bugs (which is a major problem with the 
>>> magrittr pipe). Using '_'
>>> fulfills all of these criteria since '_' can not clash with any 
>>> variable in the environment.
>>> 
>>> Denes
>>> 
>>> __
>>> R-devel@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>> 
>> 
>
>

--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa  Phone: 319-335-3386
Department of Statistics andFax:   319-335-3017
Actuarial Science
241 Schaeffer Hall  email:   luke-tier...@uiowa.edu
Iowa City, IA 52242 WWW:  http://www.stat.uiowa.edu
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Scanned by McAfee and confirmed virus-free. 
Find out more here: https://bit.ly/2zCJMrO

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] New pipe operator

2020-12-06 Thread Avi Gross via R-devel
Naming is another whole topic.

I have seen suggestions that the current pipeline symbol used be phrased as 
THEN so

data %>% f1 %>% f2()

would be said as something like:
take data then apply f1 then f2

or some variants.

There are words other than pipe or pipeline that might also work such as 
"assembly line" or "conveyor belt" that might fit some kinds of pipelining 
better than others. My original exposure to UNIX in the early 80's used a 
pipeline of multiple processes whose standard input and/or standard output (and 
sometimes also standard error) were redirected to an anonymous "pipe" device 
that buffered whatever (usually) text that was thrown at it and the processes 
reading and writing from it were paused and restarted as needed when data was 
ready. Problems often could be decomposed into multiple parts that had a 
solution using some program and it was not unusual to do something like:

cat *.c | grep -v ... | grep ... | sed ... | cut ... >output

Of course something like the above was often rewritten to be done within a 
single awk script or perl or whatever. You could view the above though from the 
perspective of "data" in some form, often text, being passed from one 
function(ality) to another and changing a bit each step of the way. A very 
common use of this form of pipeline was used to deal with embedded text in a 
different language in typsetting:

tbl filename | eqn | pic | troff | ...

The above would open a file, pass through all lines except those between 
markers that specified a table starting and ending. Those lines would be 
processed and transformed into the troff language equivalent. The old plus new 
lines now went to eqn which found and transformed equations similarly then to 
pic which transformed instructions it knew to image descriptions in troff and 
finally troff processed the whole mess and then off to the printer.

Clearly the above can be seen as a data pipeline using full processes as nodes.

The way R is using the pipeline may just use functions but you can imagine it 
as having similarities and differences. Current implementations may be linear 
with lazy evaluation and with every part running to completion before the next 
part starts. Every "object" is fully made, then used, then often removed as a 
temporary object. There is no buffering. But in principle, you can make 
UNIX-like pipelines using parallelism within a process too. 

Would there be scenarios where phrases like "assembly line" or "conveyor belt" 
make sense to describe the method properly? The word pipe suggests a linearity 
to some whereas conveyor belts these days also can be used to selectively shunt 
things one way or another as in assembling all parts of your order from 
different parts of a warehouse and arranging they all end up in the same 
delivery area. Making applications do that dynamically may have other names. 
Think flowchart!

Time to go do something useful.

-Original Message-
From: R-devel  On Behalf Of Hiroaki Yutani
Sent: Saturday, December 5, 2020 10:29 PM
To: Abby Spurdle 
Cc: r-devel 
Subject: Re: [Rd] New pipe operator

It is common practice to call |> as pipe (or pipeline operator) among many 
languages including ones that recently introduced it as an experimental feature.
Pipeline is a
common feature for functional programming, not just for "data pipeline."

F#: 
https://docs.microsoft.com/en-us/dotnet/fsharp/language-reference/symbol-and-operator-reference/
Elixir: https://hexdocs.pm/elixir/operators.html#general-operators
Typescript:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Pipeline_operator
Ruby: https://bugs.ruby-lang.org/issues/15799

(This blog post about the history of pipe operator might be
interesting: 
https://mamememo.blogspot.com/2019/06/a-brief-history-of-pipeline-operator.html
)

I agree this is a bit confusing for those who are familiar with other "pipe" 
concepts, but there's no other appropriate term to call |>.

2020年12月6日(日) 12:22 Gregory Warnes :
>
> If we’re being mathematically pedantic, the “pipe” operator is 
> actually function composition.
>
> That being said, pipes are a simple and well-known idiom. While being less
> than mathematically exact, it seems a reasonable   label for the (very
> useful) behavior.
>
> On Sat, Dec 5, 2020 at 9:43 PM Abby Spurdle  wrote:
>
> > > This is a good addition
> >
> > I can't understand why so many people are calling this a "pipe".
> > Pipes connect processes, via their I/O streams.
> > Arguably, a more general interpretation would include sockets and files.
> >
> > https://en.wikipedia.org/wiki/Pipeline_(Unix)
> > https://en.wikipedia.org/wiki/Named_pipe
> > https://en.wikipedia.org/wiki/Anonymous_pipe
> >
> > As far as I can tell, the magrittr-like operators are functions (not 
> > pipes), with nonstandard syntax.
> > This is not consistent with R's original design philosophy, building 
> > on C, Lisp and S, along with lots of *import

Re: [Rd] New pipe operator

2020-12-06 Thread Avi Gross via R-devel
Topic is more about anonymous functions but also pipes.

Rui thought the proposed syntax was a bit ugly. I assume the \(x) ... was what 
he means, not the function(x)... version.

Many current languages have played games on adding some form of anonymous 
function that is defined and used in place. Some go to great pains to make 
various parts optional to the point where there are many valid way to create a 
function that takes no arguments so you can leave out almost everything else as 
optional.

I admit having to type "lambda" all the time (in some languages)  is not 
preferable but in English, something shorter like fun(...) or func(...) instead 
of function(...) might be more readable than the weird choice of \(. Yes. You 
can view the combo to bring attention to the fact the "(" is meant not as any 
old paren for other uses but specifically for function invocation/definition 
purposes. But overuse of the backslash to mean other things such as in regular 
expressions and the parentheses for so many things, makes parsing for humans 
harder. So does "|>" for the new pipe symbol as it can also look like "or 
greater than" and since some humans do not insert spaces to make code even 
shorter, it can be a challenge to rapidly see a line of code as tokens.

If programming were being invented today with a larger set of symbols, it might 
use more of them and perhaps look more like APL. We might have all of the 
built-in to the language tokens be single symbols including real arrows instead 
of -> and a not-equals symbol like  ≠ instead of != or ~= s some languages use. 
In that system, what might the pipe symbol look like?

ǂ

But although making things concise is nice, sometimes there is clarity in using 
enough room, to make things clear or we might as well code in binary.

-Original Message-
From: R-devel  On Behalf Of Rui Barradas
Sent: Sunday, December 6, 2020 2:51 AM
To: Gregory Warnes ; Abby Spurdle 
Cc: r-devel 
Subject: Re: [Rd] New pipe operator

Hello,

If Hilbert liked beer, I like "pipe".

More seriously, a new addition like this one is going to cause problems yet 
unknown. But it's a good idea to have a pipe operator available. As someone 
used to magrittr's data pipelines, I will play with this base one before making 
up my mind. I don't expect its behavior to be exactly like magrittr "%>%" (and 
it's not). For the moment all I can say is that it is something R users are 
used to and that it now avoids loading a package.
As for the new way to define anonymous functions, I am less sure. Too much 
syntatic sugar? Or am I finding the syntax ugly?

Hope this helps,

Rui Barradas


Às 03:22 de 06/12/20, Gregory Warnes escreveu:
> If we’re being mathematically pedantic, the “pipe” operator is 
> actually function composition > That being said, pipes are a simple 
> and well-known idiom. While being less
> than mathematically exact, it seems a reasonable   label for the (very
> useful) behavior.
> 
> On Sat, Dec 5, 2020 at 9:43 PM Abby Spurdle  wrote:
> 
>>> This is a good addition
>>
>> I can't understand why so many people are calling this a "pipe".
>> Pipes connect processes, via their I/O streams.
>> Arguably, a more general interpretation would include sockets and files.
>>
>> https://en.wikipedia.org/wiki/Pipeline_(Unix)
>> https://en.wikipedia.org/wiki/Named_pipe
>> https://en.wikipedia.org/wiki/Anonymous_pipe
>>
>> As far as I can tell, the magrittr-like operators are functions (not 
>> pipes), with nonstandard syntax.
>> This is not consistent with R's original design philosophy, building 
>> on C, Lisp and S, along with lots of *important* math and stats.
>>
>> It's possible that some parties are interested in creating a kind of 
>> "data pipeline".
>> I'm interested in this myself, and I think we could discuss this more.
>> But I'm not convinced the magrittr-like operators help to achieve 
>> this goal.
>> Which, in my opinion, would require one to model programs as directed 
>> graphs, along with some degree of asynchronous input.
>>
>> Presumably, these operators will be added to R anyway, and (almost) 
>> no one will listen to me.
>>
>> So, I would like to make one suggestion:
>> Is it possible for these operators to *not* be named:
>>  The R Pipe
>>  The S Pipe
>>  Or anything with a similar meaning.
>>
>> Maybe tidy pipe, or something else that links it to its proponents?
>>
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Scanned by McAfee and confirmed virus-free. 
Find out more here: https://bit.ly/2zCJMrO

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] New pipe operator and gg plotz

2020-12-06 Thread Avi Gross via R-devel
As someone who switches back and forth between using standard R methods and 
those of the tidyverse, depending on the problem, my mood and whether Jupiter 
aligns with Saturn in the new age of Aquarius, I have a question about the 
forthcoming built-in pipe. Will it motivate anyone to eventually change or 
enhance the ggplot functionality to have a version that gets rid of the odd use 
of the addition symbol?

I mean I now sometimes have a pipeline that looks like:

Data %>%
Do_this %>%
Do_that(whatever) %>%
ggplot(...) +
geom_whatever(...) +
...

My understanding is this is a bit of a historical anomaly that might someday be 
modified back.

As I understand it, the call to ggplot() creates a partially filled-in object 
that holds all kinds of useful info. The additional calls to geom_point() and 
so on will add/change that hidden object. Nothing much happens till the object 
is implicitly or explicitly given to print() which switches to the print 
function for objects of that type and creates a graph based on the contents of 
the object at that time. So, in theory, you could have a pipelined version of 
ggplot where the first function accepts something like a  data.frame or tibble 
as the default first argument and at the end returns the object we have been 
describing. All additional functions would then accept such an object as the 
(hidden?) first argument and return the modified object. The final function in 
the pipe would either have the value captured in a variable for later use or 
print implicitly generating a graph.

So the above silly example might become:

Data %>%
Do_this %>%
Do_that(whatever) %>%
ggplot(...) %>%
geom_whatever(...) %>%
...

Or, am I missing something here? 

The language and extensions such as are now in the tidyverse might be more 
streamlined and easier to read when using consistent notation. If we now build 
a reasonable version of the pipeline in, might we encourage other uses to 
gradually migrate back closer to the mainstream?

-Original Message-
From: R-devel  On Behalf Of Rui Barradas
Sent: Sunday, December 6, 2020 2:51 AM
To: Gregory Warnes ; Abby Spurdle 
Cc: r-devel 
Subject: Re: [Rd] New pipe operator

Hello,

If Hilbert liked beer, I like "pipe".

More seriously, a new addition like this one is going to cause problems yet 
unknown. But it's a good idea to have a pipe operator available. As someone 
used to magrittr's data pipelines, I will play with this base one before making 
up my mind. I don't expect its behavior to be exactly like magrittr "%>%" (and 
it's not). For the moment all I can say is that it is something R users are 
used to and that it now avoids loading a package.
As for the new way to define anonymous functions, I am less sure. Too much 
syntatic sugar? Or am I finding the syntax ugly?

Hope this helps,

Rui Barradas


Às 03:22 de 06/12/20, Gregory Warnes escreveu:
> If we’re being mathematically pedantic, the “pipe” operator is 
> actually function composition > That being said, pipes are a simple 
> and well-known idiom. While being less
> than mathematically exact, it seems a reasonable   label for the (very
> useful) behavior.
> 
> On Sat, Dec 5, 2020 at 9:43 PM Abby Spurdle  wrote:
> 
>>> This is a good addition
>>
>> I can't understand why so many people are calling this a "pipe".
>> Pipes connect processes, via their I/O streams.
>> Arguably, a more general interpretation would include sockets and files.
>>
>> https://en.wikipedia.org/wiki/Pipeline_(Unix)
>> https://en.wikipedia.org/wiki/Named_pipe
>> https://en.wikipedia.org/wiki/Anonymous_pipe
>>
>> As far as I can tell, the magrittr-like operators are functions (not 
>> pipes), with nonstandard syntax.
>> This is not consistent with R's original design philosophy, building 
>> on C, Lisp and S, along with lots of *important* math and stats.
>>
>> It's possible that some parties are interested in creating a kind of 
>> "data pipeline".
>> I'm interested in this myself, and I think we could discuss this more.
>> But I'm not convinced the magrittr-like operators help to achieve 
>> this goal.
>> Which, in my opinion, would require one to model programs as directed 
>> graphs, along with some degree of asynchronous input.
>>
>> Presumably, these operators will be added to R anyway, and (almost) 
>> no one will listen to me.
>>
>> So, I would like to make one suggestion:
>> Is it possible for these operators to *not* be named:
>>  The R Pipe
>>  The S Pipe
>>  Or anything with a similar meaning.
>>
>> Maybe tidy pipe, or something else that links it to its proponents?
>>
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Scanned by McAfee and confirmed virus-fr

Re: [Rd] New pipe operator and gg plotz

2020-12-06 Thread Avi Gross via R-devel
Thanks, Duncan. That answers my question fairly definitively.

Although it can be DONE it likely won't be for the reasons Hadley mentioned 
until we get some other product that replaces it entirely. There are some 
interesting work-arounds mentioned. 

I was thinking of one that has overhead but might be a pain. Hadley mentioned a 
slight variant. The first argument to a function now is expected to be the data 
argument. The second might be the mapping. Now if the function is called with a 
new first argument that is a ggplot object, it could be possible to test the 
type and if it is a ggplot object than slide over carefully any additional 
matched arguments that were not explicitly named. Not sure that is at all easy 
to do.

Alternately, you can ask that when used in such a pipeline that the user call 
all other arguments using names like data=whatever, mapping=aes(whatever) so no 
other args need to be adjusted by position.

But all this is academic and I concede will likely not be done. I can live with 
the plus signs.


-Original Message-
From: Duncan Murdoch  
Sent: Sunday, December 6, 2020 2:50 PM
To: Avi Gross ; 'r-devel' 
Subject: Re: [Rd] New pipe operator and gg plotz

Hadley's answer (#7 here: 
https://community.rstudio.com/t/why-cant-ggplot2-use/4372) makes it pretty 
clear that he thinks it would have been nice now if he had made that choice 
when ggplot2 came out, but it's not worth the effort now to change it.

Duncan Murdoch

On 06/12/2020 2:34 p.m., Avi Gross via R-devel wrote:
> As someone who switches back and forth between using standard R methods and 
> those of the tidyverse, depending on the problem, my mood and whether Jupiter 
> aligns with Saturn in the new age of Aquarius, I have a question about the 
> forthcoming built-in pipe. Will it motivate anyone to eventually change or 
> enhance the ggplot functionality to have a version that gets rid of the odd 
> use of the addition symbol?
> 
> I mean I now sometimes have a pipeline that looks like:
> 
> Data %>%
>   Do_this %>%
>   Do_that(whatever) %>%
>   ggplot(...) +
>   geom_whatever(...) +
>   ...
> 
> My understanding is this is a bit of a historical anomaly that might someday 
> be modified back.
> 
> As I understand it, the call to ggplot() creates a partially filled-in object 
> that holds all kinds of useful info. The additional calls to geom_point() and 
> so on will add/change that hidden object. Nothing much happens till the 
> object is implicitly or explicitly given to print() which switches to the 
> print function for objects of that type and creates a graph based on the 
> contents of the object at that time. So, in theory, you could have a 
> pipelined version of ggplot where the first function accepts something like a 
>  data.frame or tibble as the default first argument and at the end returns 
> the object we have been describing. All additional functions would then 
> accept such an object as the (hidden?) first argument and return the modified 
> object. The final function in the pipe would either have the value captured 
> in a variable for later use or print implicitly generating a graph.
> 
> So the above silly example might become:
> 
> Data %>%
>   Do_this %>%
>   Do_that(whatever) %>%
>   ggplot(...) %>%
>   geom_whatever(...) %>%
>   ...
> 
> Or, am I missing something here?
> 
> The language and extensions such as are now in the tidyverse might be more 
> streamlined and easier to read when using consistent notation. If we now 
> build a reasonable version of the pipeline in, might we encourage other uses 
> to gradually migrate back closer to the mainstream?
> 
> -Original Message-
> From: R-devel  On Behalf Of Rui 
> Barradas
> Sent: Sunday, December 6, 2020 2:51 AM
> To: Gregory Warnes ; Abby Spurdle 
> 
> Cc: r-devel 
> Subject: Re: [Rd] New pipe operator
> 
> Hello,
> 
> If Hilbert liked beer, I like "pipe".
> 
> More seriously, a new addition like this one is going to cause problems yet 
> unknown. But it's a good idea to have a pipe operator available. As someone 
> used to magrittr's data pipelines, I will play with this base one before 
> making up my mind. I don't expect its behavior to be exactly like magrittr 
> "%>%" (and it's not). For the moment all I can say is that it is something R 
> users are used to and that it now avoids loading a package.
> As for the new way to define anonymous functions, I am less sure. Too much 
> syntatic sugar? Or am I finding the syntax ugly?
> 
> Hope this helps,
> 
> Rui Barradas
> 
> 
> Às 03:22 de 06/12/20, Gregory Warnes escreveu:
>> If we’re being mathema

[Rd] sequential chained operator thoughts

2020-12-07 Thread Avi Gross via R-devel
It has been very enlightening watching the discussion not only about the
existing and proposed variations of a data "pipe" operator in R but also
cognates in many other languages.

So I am throwing out a QUESTION that just asks if the pipeline as done is
pretty much what could also be done without the need for an operator using a
sort of one-time brac]keted  construct where you call a function with a
sequence of operations you want performed and just have it handle the
in-between parts.

I mean something like:

return_val <- do_chain_sequence( { initial_data,
function1(_VAL_);
function2(_VAL_, more_args);
function3(args, 2 * _VAL_, more_args);
...
function_n(_VAL_)
})

The above is not meant to be taken literally. I don't care if the symbol is
_VAL_ or you use semi-colon characters between statements. There are many
possible variants such as each step being in its own curly braces. The idea
is to hand over one or more unevaluated blocks of code. There are such
functions in use in R already.

And yes, it can be written with explicit BEFORE/AFTER clauses to handle
things but those are implementation details and I want to focus on a
concept.

The point is you can potentially write a function that given such a series
of arguments, delays evaluation of them until each is needed or used. About
all it might need to do is set the value of something like _VAL_ from the
first argument if present and then take the text of each subsequent argument
and run it while saving the result back into _VAL_ and at the end, return
the last _VAL_. Along the way, of course, the temporary values stored each
time in _VAL_ would disappear.

Is something like this any improvement over this done by the user:

Initial <- whatever
Temp1 <- function1(initial)
Temp2 <- function2(Temp1, ...)
rm(Temp1)
...

Well, maybe not much. But it does hide some details and allows you to insert
or delete steps without worrying about pesky details like variable names
being in the right sequence or not over-riding other things in your
namespace. It makes your intent clear.

Now obviously being evaluated inside a function is not necessarily the same
as remaining in the original environment so having something like this as a
built-in running in place might be a better idea.

I admit the details of how to get one piece at a time as some unevaluated
form and recognize clearly what each piece is takes some careful thought. If
you want to automatically throw in a first argument of _VAL_ after the first
parenthesis found or inserted in new parens if just the name of a function
was presented,  or other such manipulations as already seem to happen with
the Magritrr pipe where a period is the placeholder, that can be delicate
work and also fail for some lines of code.  There may be many reasons
various versions of this proposal can fail for some cases. But functionally,
it would be a way to specify in a linear fashion that a sequence of steps is
to be connected with data being passed along as it changes.

I can also imagine how this kind of method might allow twists like asking
for _VAL_$second or other changes such as sorted(_VAL_) or minmax(_VAL_)
that would shrink the sequence.

This general idea looks like something that some programming language may
already do in some form and functionally and is a bit like the pipe idea,
albeit with different overhead.

And do note many languages already support this in subtle ways. R has a
variable called ".Last.value" that always holds the result of the last
statement evaluated. If the template above is used properly, that alone
might work, albeit be a bit wordy. But it may be more transient in some
cases such as a multi-part statement where it ends up being reset within the
statement.

I am NOT asking for a new feature in R, or any language. I am just asking if
the various pipeline ideas  used could be done in a general way like I
describe as a sequence where the statements are chained as described and
intermediate results are transient. But, yes, some implementations might
require some changes to the language to be implemented properly and it might
not satisfy people used to thinking a certain way.

I end by saying that R is a language that only returns one (sometimes
complex) return value. Other languages allow multiple return values and
pipelines there might be hard to implement or have weird features that allow
various of the returns to be captured or even a more general graph of
command sequences  rather than just a linear pipeline. My thoughts here are
for R alone. And I shudder at what happens if you allow exceptions and other
kinds of breaks/returns out of such a sequential grouping in mid-stride. I
view most such additions and changes as needing careful thought to make sure
they have the functionality most people want, are as

Re: [Rd] quantile() names

2020-12-14 Thread Avi Gross via R-devel
Question: is the part that Ed Merkle is asking about the change in the
expected NAME associated with the output?

He changed a sort of global parameter affecting how many digits he wants any
compliant function to display. So when he asked for a named vector, the
chosen name was based on his request and limited when possible to two
digits.

x <- 1:1000
temp <- quantile(x, .975)

If you examine temp, you will see it is a vector containing (as it happens)
a single numeric item (as it happens a double) with the value of 975. But
the name associated is a character string with a "%" appended as shown
below:

str(temp)
Named num 975
- attr(*, "names")= chr "98%"

If you do not want a name attached to the vector, add an option:

quantile(x, .975, names=FALSE)

If you want the name to be longer or different, you can do that after. 

names(temp)
[1] "98%"

So change it yourself:

temp
98% 
975 
 names(temp) <- paste(round(temp, 3), "%", sep="")
temp
975.025% 
975

The above is for illustration with tabs inserted to show what is in the
output. You probably do not need a name for your purposes and if you ask for
multiple quantiles you might need to adjust the above. 

Of course if you wanted another non-default "type" of calculation, what Abby
offered may also apply. 

-Original Message-
From: R-devel  On Behalf Of Abby Spurdle
Sent: Monday, December 14, 2020 4:48 PM
To: Merkle, Edgar C. 
Cc: r-devel@r-project.org
Subject: Re: [Rd] quantile() names

The "value" is *not* 975.
It's 975.025.

The results that you're observing, are merely the byproduct of formatting.

Maybe, you should try:

quantile (x, .975, type=4)

Which perhaps, using default options, produces the result you're expecting?


On Tue, Dec 15, 2020 at 8:55 AM Merkle, Edgar C. 
wrote:
>
> All,
>
> Consider the code below
>
> options(digits=2)
> x <- 1:1000
> quantile(x, .975)
>
> The value returned is 975 (the 97.5th percentile), but the name has been
shortened to "98%" due to the digits option. Is this intended? I would have
expected the name to also be "97.5%" here. Alternatively, the returned value
might be 980 in order to match the name of "98%".
>
> Best,
> Ed
>
>
> [[alternative HTML version deleted]]
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Scanned by McAfee and confirmed virus-free. 
Find out more here: https://bit.ly/2zCJMrO

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] quantile() names

2020-12-15 Thread Avi Gross via R-devel
Thank you for explaining, Ed. It makes looking at the issue raised much
easier.

 

As I understand it, you are not really asking about some thing fully in your
control. You are asking how any function like quantile() should behave when
a user has altered something global or at least global within a package,
such as this:

 

> quantile(x, c(.95, .975, .99000))

95%   97.5% 99% 

950.050 975.025 990.010 

> dig.it <- options(digits=2)

> dig.it

$digits

[1] 7

 

I did it that way so I could re-set it!

 

I looked to see if quantile() is written in base R and it seems to be a
generic that I would have to hunt down so I stopped for now.

 

Here is what I get BEFORE changing the option for digits:

 

> x <- 1:1000

> quantile(x, probs=c(.95, .975, .99000))

95%   97.5% 99% 

950.050 975.025 990.010

 

Note I used the fuller version asking for multiple thresholds so I could see
what happened if I used more zeroes. Note that trailing zeroes are not shown
in the name of the third element of the vector. So I can suggest the program
is not getting the unevaluated text to use but is using the value of the
vector. Now I set the number of digits to 2, globally, and repeat:

 

> quantile(x, probs=c(.95, .975, .99000))

95% 98% 99% 

950 975 990

 

I notice several things as others have pointed out. There seems to be a
truncation in the values shown so nothing is now shown past the decimal
point. But maybe not as adding an argument of 1/3 gives 334 rather than 333.

 

> quantile(x, probs=c(.95, .975, .99000, 1/3))

95% 98% 99% 33% 

950 975 990 334

 

Now the names are apparently rounded as discussed, with the percent symbol
appended.

 

So what would you propose? Within the function there seem to be two parts
dealing with displaying the result and it looks like the original number
loses precision as handing the above to round(., 7) shows no change. So are
you asking it to parse the name different than the value even though there
is a global variable set specifying the digits they want?

 

If it really mattered, I suggest one solution may be to allow one or two
additional arguments to a function like quantile like:

 

quantile(x, ., digits=5, names=c("95%", "97.5%", .) )

 

So if a user really wanted to live in their own world of fewer digits they
could specify what labels they wanted and could ask for "high", "Higher" and
"HIGHEST" or whatever makes them happy. But, as noted, any user wanting that
level of control can change the labels afterward. But you are correct in
some package using quantile() and calling out the results individually by
name will not be able to consistently and reliably use that technique. But
can they use it now? I tried using variations on $.95% such as this and they
fail such as for quantile(x, c(.95, .975, .99000))$`95%` and the same for
using [] notation. These identifiers were not chosen to be used this way.
You can get them positionally:

 

> quantile(x, c(.95, .975, .99000))[1]

95% 

950 

> quantile(x, c(.95, .975, .99000))[2]

98% 

975

 

If you convert the darn out put from a vector to a list, though, it works,
using grave accents:

 

> as.list(quantile(x, c(.95, .975, .99000)))$`98%`

[1] 975

 

So, I doubt many would play games like me to find some way to select by
name. Odds are they might use position or get one at a time. The name is
more for humans to read, I would think.

 

 

Just my two cents. When an instruction impacts multiple places, it can be
ambiguous and changing global variables is, well, global.

 

Which raise another question here is why did the people making choices
choose silly names that are all numeric with maybe a decimal point and
ending in a character like % that has other uses? A cousin of quantile is
fivenum() that returns Tukey's five number summary as useful in making
boxplots:

 

> fivenum(x)

[1]1  250  500  750 1000

 

This returned a vector with no names. You can only index it by number,
albeit the columns are always in a fixed order and you know what to expect
in each. Another cousin returns a more complex structure 

 

> boxplot.stats(x)

$stats

[1]1  250  500  750 1000

 

$n

[1] 1000

 

$conf

[1] 476 525

 

$out

integer(0)

 

> boxplot.stats(x)$stats

[1]1  250  500  750 1000

 

That is a list of items but the first item is a vector with no names that is
the same as for fivenum().

 

Would it make more sense for the column names of the output looked more
like: 

 

> temp <- quantile(x, c(.95, .975, .99000))

> names(temp) <- c("perc95", "perc98", "perc99")

> temp

perc95 perc98 perc99 

   950975990

 

So you could do this to a vector:

 

> temp["perc98"]

perc98 

   975

Or do even more to a list:

 

> as.list(temp)$perc98

[1] 975

 

My feeling is some things are not really bugs but more like FEATURES you
normally live with and if it matters, work around it. I had trouble a while
ago with a laavan() case I ran where very rarely the program simply broke.
When in a big loo

Re: [Rd] `merge()` not consistent in how it treats list columns

2021-01-02 Thread Avi Gross via R-devel
Antoine,

Have you considered converting the non-list to a list explicitly so this
does not matter?

For a long time, few people used lists in this context, albeit in the
tidyverse it is now better supported and probably more common.

This is an area many have found annoying when you have implicit conversions.
What if one ID field was character and the other was numeric? In some
languages the conversion always goes to character (as in R) but in some it
might go numeric in one direction and in some it may refuse and demand you
convert it yourself. 

Do you suggest that a unique solution exists for complex cases so that the
software should know you want to convert a vector to list? What if one side
is a list containing a list containing a list, many levels deep and the
other has no or fewer or more levels. Is it obvious to take the deepest case
and change all others to match? Do you lose things in the process?

When things may not work, sure you can suggest someone change, but you can
consider it as a case where YOU should make sure the types are compatible
before a merge. 



-Original Message-
From: R-devel  On Behalf Of Antoine Fabri
Sent: Saturday, January 2, 2021 2:16 PM
To: R-devel 
Subject: [Rd] `merge()` not consistent in how it treats list columns

Dear R-devel,

When trying to merge 2 data frames by an "id" column, with this column a
character in one of them, and a list of character in the other, merge
behaves differently depending which is given first.

Example :

```
df1 <- data.frame(a=1)
df2 <- data.frame(b=2)
df1$id <- "ID"
df2$id <- list("ID")

# these print in a similar way, so the upcoming error will be hard to
diagnose
df1
#>   a id
#> 1 1 ID
df2
#>   b id
#> 1 2 ID

# especially as this works well, df2$id is treated as an atomic vector
merge(df1, df2)
#>   id a b
#> 1 ID 1 2

# But this fails with a cryptic error message merge(df2, df1) #> Error in
sort.list(bx[m$xi]): 'x' must be atomic for 'sort.list', method "shell" and
"quick"
#> Have you called 'sort' on a list?
```

I believe that if we let it work one way it should work the other, and that
if it works neither an explicit error  mentioning how we can't join by list
column would be helpful.

Many thanks and happy new year to all the R community,

Antoine

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] brief update on the pipe operator in R-devel

2021-01-15 Thread Avi Gross via R-devel
Gabor,

Although it might be nice if all imagined cases worked, there are many ways to 
work around and get the results you want. 

You may want to consider that it is easier to recognize the symbol you use (x 
in the examples) if it is alone and used only exactly once and it the list of 
function arguments.  If you want the x used multiple times, you can make a 
function that accepts the x once and then invokes another function and reuses 
the x as often as needed. Similarly for 1+x. 

I do not know if the above choice was made to make it easier and faster to 
apply the above, or to avoid possible bad edge cases. Have you tested other 
ideas like:

3 |> x => f(x=5)
Or
3 |> x => f(x, y=x)

I mean ones where a default is supplied, not that it makes much sense here?

I am thinking of the concept of substitution as is often done for text or 
symbols. Often the substitution is done for the first instance found unless you 
specify you want a global change. In your examples, if only the first use of x 
would be replaced, the second naked x being left alone would be an error. If 
all instances were changed, what anomalies might happen? Giving a vector of 
length 1 containing the number 3 seems harmless enough to duplicate. But the 
pipeline can send all kinds of interesting data structures through including 
data.frames and arbitrary objects. 


-Original Message-
From: R-devel  On Behalf Of Gabor Grothendieck
Sent: Friday, January 15, 2021 7:28 AM
To: Tierney, Luke 
Cc: R-devel@r-project.org
Subject: Re: [Rd] brief update on the pipe operator in R-devel

These are documented but still seem like serious deficiencies:

> f <- function(x, y) x + 10*y
> 3 |> x => f(x, x)
Error in f(x, x) : pipe placeholder may only appear once

> 3 |> x => f(1+x, 1)
Error in f(1 + x, 1) :
  pipe placeholder must only appear as a top-level argument in the RHS call

Also note:

 ?"=>"
No documentation for ‘=>’ in specified packages and libraries:
you could try ‘??=>’

On Tue, Dec 22, 2020 at 5:28 PM  wrote:
>
> It turns out that allowing a bare function expression on the 
> right-hand side (RHS) of a pipe creates opportunities for confusion 
> and mistakes that are too risky. So we will be dropping support for 
> this from the pipe operator.
>
> The case of a RHS call that wants to receive the LHS result in an 
> argument other than the first can be handled with just implicit first 
> argument passing along the lines of
>
>  mtcars |> subset(cyl == 4) |> (\(d) lm(mpg ~ disp, data = d))()
>
> It was hoped that allowing a bare function expression would make this 
> more convenient, but it has issues as outlined below. We are exploring 
> some alternatives, and will hopefully settle on one soon after the 
> holidays.
>
> The basic problem, pointed out in a comment on Twitter, is that in 
> expressions of the form
>
>  1 |> \(x) x + 1 -> y
>  1 |> \(x) x + 1 |> \(y) x + y
>
> everything after the \(x) is parsed as part of the body of the 
> function.  So these are parsed along the lines of
>
>  1 |> \(x) { x + 1 -> y }
>  1 |> \(x) { x + 1 |> \(y) x + y }
>
> In the first case the result is assigned to a (useless) local 
> variable.  Someone writing this is more likely to have intended to 
> assign the result to a global variable, as this would:
>
>  (1 |> \(x) x + 1) -> y
>
> In the second case the 'x' in 'x + y' refers to the local variable 'x'
> in the first RHS function. Someone writing this is more likely to have 
> meant
>
>  (1 |> \(x) x + 1) |> \(y) x + y
>
> with 'x' in 'x + y' now referring to a global variable:
>
>  > x <- 2
>  > 1 |> \(x) x + 1 |> \(y) x + y
>  [1] 3
>  > (1 |> \(x) x + 1) |> \(y) x + y
>  [1] 4
>
> These issues arise with any approach in R that allows a bare function 
> expression on the RHS of a pipe operation. It also arises in other 
> languages with pipe operators. For example, here is the last example 
> in Julia:
>
>  julia> x = 2
>  2
>  julia> 1 |> x -> x + 1 |> y -> x + y
>  3
>  julia> ( 1 |> x -> x + 1 ) |> y -> x + y
>  4
>
> Even though proper use of parentheses can work around these issues, 
> the likelihood of making mistakes that are hard to track down is too 
> high. So we will disallow the use of bare function expressions on the 
> right hand side of a pipe.
>
> Best,
>
> luke
>
> --
> Luke Tierney
> Ralph E. Wareham Professor of Mathematical Sciences
> University of Iowa  Phone: 319-335-3386
> Department of Statistics andFax:   319-335-3017
> Actuarial Science
> 241 Schaeffer Hall  email:   luke-tier...@uiowa.edu
> Iowa City, IA 52242 WWW:  http://www.stat.uiowa.edu
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at g

Re: [Rd] Unexpected behavior of '[' in an apply instruction

2021-02-12 Thread Avi Gross via R-devel
Just to be different, the premise was that you do not know how many dimensions 
the array had. But that is easily available using dim() including how many 
items are in each dimension. So, in principle, you can use a normal indexing 
method perhaps in a loop to get what you want. Not sexy but doable. You can 
treat the array x as a vector just like lower level R does and access the 
contents using the formula it uses.

-Original Message-
From: R-devel  On Behalf Of Sokol Serguei
Sent: Friday, February 12, 2021 5:50 PM
To: r-devel@r-project.org
Subject: Re: [Rd] Unexpected behavior of '[' in an apply instruction

Le 12/02/2021 à 22:23, Rui Barradas a écrit :
> Hello,
>
> Yes, although there is an accepted solution, I believe you should post 
> this solution there. It's a base R solution, what the question asks for.
>
> And thanks, I would have never reminded myself of slice.index.

There is another approach -- produce a call to `[`() putting there "required 
number of commas in their proper places" programmatically. 
Even if it does not lead to a very readable expression, I think it merits to be 
mentioned.

   x <- array(1:60, dim = c(10, 2, 3))
   ld=length(dim(x))
   i=1 # i.e. the first row but can be a slice 1:5, whatever
   do.call(`[`, c(alist(x, i), alist(,)[rep(1,ld-1)], alist(drop=FALSE)))

Best,
Serguei.

>
> Rui Barradas
>
> Às 20:45 de 12/02/21, robin hankin escreveu:
>> Rui
>>
>>  > x <- array(runif(60), dim = c(10, 2, 3))
>>  > array(x[slice.index(x,1) %in% 1:5],c(5,dim(x)[-1]))
>>
>> (I don't see this on stackoverflow; should I post this there too?) 
>> Most of the magic package is devoted to handling arrays of arbitrary 
>> dimensions and this functionality might be good to include if anyone 
>> would find it useful.
>>
>> HTH
>>
>> Robin
>>
>>
>> 
>>
>>
>> On Sat, Feb 13, 2021 at 12:26 AM Rui Barradas > > wrote:
>>
>> Hello,
>>
>> This came up in this StackOverflow post [1].
>>
>> If x is an array with n dimensions, how to subset by just one 
>> dimension?
>> If n is known, it's simple, add the required number of commas in 
>> their
>> proper places.
>> But what if the user doesn't know the value of n?
>>
>> The example below has n = 3, and subsets by the 1st dim. The 
>> apply loop
>> solves the problem as expected but note that the index i has
>> length(i) > 1.
>>
>>
>> x <- array(1:60, dim = c(10, 2, 3))
>>
>> d <- 1L
>> i <- 1:5
>> apply(x, MARGIN = -d, '[', i)
>> x[i, , ]
>>
>>
>> If length(i) == 1, argument drop = FALSE doesn't work as I 
>> expected it
>> to work, only the other way does:
>>
>>
>> i <- 1L
>> apply(x, MARGIN = -d, '[', i, drop = FALSE)
>> x[i, , drop = FALSE]
>>
>>
>> What am I missing?
>>
>> [1]
>> https://stackoverflow.com/questions/66168564/is-there-a-native-r-synt
>> ax-to-extract-rows-of-an-array
>>
>> Thanks in advance,
>>
>> Rui Barradas
>>
>> __
>> R-devel@r-project.org  mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] 1954 from NA

2021-05-23 Thread Avi Gross via R-devel
Arguably, R was not developed to satisfy some needs in the way intended.

When I have had to work with datasets from some of the social sciences I have 
had to adapt to subtleties in how they did things with software like SPSS in 
which an NA was done using an out of bounds marker like 999 or "." or even a 
blank cell. The problem is that R has a concept where data such as integers or 
floating point numbers is not stored as text normally but in their own formats 
and a vector by definition can only contain ONE data type. So the various forms 
of NA as well as Nan and Inf had to be grafted on to be considered VALID to 
share the same storage area as if they sort of were an integer or floating 
point number or text or whatever.

It does strike me as possible to simply have a column that is something like a 
factor that can contain as many NA excuses as you wish such as "NOT ANSWERED" 
to "CANNOT READ THE SQUIGLE" to "NOT SURE" to "WILL BE FILLED IN LATER" to "I 
DON'T SPEAK ENGLISH AND CANNOT ANSWER STUPID QUESTIONS". This additional column 
would presumably only have content when the other column has an NA. Your 
queries and other changes would work on something like a data.frame where both 
such columns coexisted.

Note reading in data with multiple NA reasons may take extra work. If your 
errors codes are text, it will all become text. If the errors are 999 and 998 
and 997, it may all be treated as numeric and you may not want to convert all 
such codes to an NA immediately. Rather, you would use the first vector/column 
to make the second vector and THEN replace everything that should be an NA with 
an actual NA and reparse the entire vector to become properly numeric unless 
you like working with text and will convert to numbers as needed on the fly.

Now this form of annotation may not be pleasing but I suggest that an 
implementation that does allow annotation may use up space too. Of course, if 
your NA values are rare and space is only used then, you might save space. But 
if you could make a factor column and have it use the smallest int it can get 
as a basis, it may be a way to save on space.

People who have done work with R, especially those using the tidyverse, are 
quite used to using one column to explain another. So if you are asked to say 
tabulate what percent of missing values are due to reasons A/B/C then the added 
columns works fine for that calculation too.


-Original Message-
From: R-devel  On Behalf Of Adrian Du?a
Sent: Sunday, May 23, 2021 2:04 PM
To: Tomas Kalibera 
Cc: r-devel 
Subject: Re: [Rd] 1954 from NA

Dear Tomas,

I understand that perfectly, but that is fine.
The payload is not going to be used in any computations anyways, it is strictly 
an information carrier that differentiates between different types of (tagged) 
NA values.

Having only one NA value in R is extremely limiting for the social sciences, 
where multiple missing values may exist, because respondents:
- did not know what to respond, or
- did not want to respond, or perhaps
- the question did not apply in a given situation etc.

All of these need to be captured, stored, and most importantly treated as if 
they would be regular missing values. Whether the payload might be lost in 
computations makes no difference: they were supposed to be "missing values" 
anyways.

The original question is how the payload is currently stored: as an unsigned 
int of 32 bits, or as an unsigned short of 16 bits. If the R internals would 
not be affected (and I see no reason why they would be), it would allow an 
entire universe for the social sciences that is not currently available and 
which all other major statistical packages do offer.

Thank you very much, your attention is greatly appreciated, Adrian

On Sun, May 23, 2021 at 7:59 PM Tomas Kalibera 
wrote:

> TLDR: tagging R NAs is not possible.
>
> External software should not depend on how R currently implements NA, 
> this may change at any time. Tagging of NA is not supported in R (if 
> it were, it would have been documented). It would not be possible to 
> implement such tagging reliably with the current implementation of NA in R.
>
> NaN payload propagation is not standardized. Compilers are free to and 
> do optimize code not preserving/achieving any specific propagation.
> CPUs/FPUs differ in how they propagate in binary operations, some zero 
> the payload on any operation. Virtualized environments, binary 
> translations, etc, may not preserve it in any way, either. ?NA has 
> disclaimers about this, an NA may become NaN (payload lost) even in 
> unary operations and also in binary operations not involving other NaN/NAs.
>
> Writing any new software that would depend on that anything specific 
> happens to the NaN payloads would not be a good idea. One can only 
> reliably use the NaN payload bits for storage, that is if one avoids 
> any computation at all, avoids passing the values to any external code 
> unaware of such tagging (including R), etc

[Rd] FW: 1954 from NA

2021-05-24 Thread Avi Gross via R-devel
o: Avi Gross mailto:avigr...@verizon.net> >
Cc: r-devel mailto:r-devel@r-project.org> >
Subject: Re: [Rd] 1954 from NA

 

Hmm...

If it was only one column then your solution is neat. But with 5-600 variables, 
each of which can contain multiple missing values, to double this number of 
variables just to describe NA values seems to me excessive.

Not to mention we should be able to quickly convert / import / export from one 
software package to another. This would imply maintaining some sort of metadata 
reference of which explanatory additional factor describes which original 
variable.

 

All of this strikes me as a lot of hassle compared to storing some information 
within a tagged NA value... I just need a little bit more bits to play with.

 

Best wishes,

Adrian

 

On Sun, May 23, 2021 at 10:21 PM Avi Gross via R-devel mailto:r-devel@r-project.org> > wrote:

Arguably, R was not developed to satisfy some needs in the way intended.

When I have had to work with datasets from some of the social sciences I have 
had to adapt to subtleties in how they did things with software like SPSS in 
which an NA was done using an out of bounds marker like 999 or "." or even a 
blank cell. The problem is that R has a concept where data such as integers or 
floating point numbers is not stored as text normally but in their own formats 
and a vector by definition can only contain ONE data type. So the various forms 
of NA as well as Nan and Inf had to be grafted on to be considered VALID to 
share the same storage area as if they sort of were an integer or floating 
point number or text or whatever.

It does strike me as possible to simply have a column that is something like a 
factor that can contain as many NA excuses as you wish such as "NOT ANSWERED" 
to "CANNOT READ THE SQUIGLE" to "NOT SURE" to "WILL BE FILLED IN LATER" to "I 
DON'T SPEAK ENGLISH AND CANNOT ANSWER STUPID QUESTIONS". This additional column 
would presumably only have content when the other column has an NA. Your 
queries and other changes would work on something like a data.frame where both 
such columns coexisted.

Note reading in data with multiple NA reasons may take extra work. If your 
errors codes are text, it will all become text. If the errors are 999 and 998 
and 997, it may all be treated as numeric and you may not want to convert all 
such codes to an NA immediately. Rather, you would use the first vector/column 
to make the second vector and THEN replace everything that should be an NA with 
an actual NA and reparse the entire vector to become properly numeric unless 
you like working with text and will convert to numbers as needed on the fly.

Now this form of annotation may not be pleasing but I suggest that an 
implementation that does allow annotation may use up space too. Of course, if 
your NA values are rare and space is only used then, you might save space. But 
if you could make a factor column and have it use the smallest int it can get 
as a basis, it may be a way to save on space.

People who have done work with R, especially those using the tidyverse, are 
quite used to using one column to explain another. So if you are asked to say 
tabulate what percent of missing values are due to reasons A/B/C then the added 
columns works fine for that calculation too.

 

-- 

Adrian Dusa
University of Bucharest
Romanian Social Data Archive
Soseaua Panduri nr. 90-92
050663 Bucharest sector 5
Romania

https://adriandusa.eu


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] 1954 from NA

2021-05-24 Thread Avi Gross via R-devel
Adrian,

 

This is an aside. I note in many machine-learning algorithms they actually do 
something along the lines being discussed. They may take an item like a 
paragraph of words or an email message  and add thousands of columns with each 
one being a Boolean specifying if a particular word is in or not in that item. 
They may then run an analysis trying to heuristically match known SPAM items so 
as to be able to predict if new items might be SPAM. Some may even have a 
column for words taken two or more at a time such as “must” followed by “have” 
or “Your”, “last”, “chance” resulting> column_orig

 bad   worse bad   
worse missing 

  5   1  NA   2  NA   1   2   5  NA   6 
 NA  NA   2  in even more columns. The software than does the analysis 
can work on remarkably large such collections including in some cases taking 
multiple approaches at the same problem and choosing among them in some way.

 

In your case, yes, adding lots of columns seems like added work. But in data 
science, often the easiest way to do some complex things is to loop over 
selected existing columns and create multiple sets of additional columns that 
simplify later calculations by just using these values rather than some 
multi-line complex condition. I have as an example run statistical analyses 
where I have a Boolean column if the analysis failed (as in I caught it using 
try() or else it would kill my process) and another if I was told it did not 
converge properly and yet another column if it failed some post-tests. It 
simplified some queries that excluded rows where any one of the above was TRUE. 
I also stored columns for metrics like RMSEA and chi-squared values, sometimes 
dozens. And for each of the above, I actually had a set of columns for various 
models such as linear versus quadratic and more. Worse, as the analysis 
continued, more derived columns were added as various measures of the above 
results were compared to each other so the different models could be compared 
as in how often each was better. Careful choices of naming conventions and nice 
features of the tidyverse made it fairly simple to operate on many columns in 
the same way fairly easily such as all columns whose names start with a string 
or end with …

 

And, yes, for some efficiency, I often made a narrower version of the above 
with just the fields I needed and was careful not to remove what I might need 
later.

 

So it can be done and fairly trivially if you know what you are doing. If the 
names of all your original columns that behave this way look like *.orig and 
others look different, you can ask for a function to be applied to just those 
that produces another set with the same prefixes but named *.converted and yet 
another called *.annotation and so on. You may want to remove the originals to 
save space but you get the idea. The fact there are six hundred means little 
with such a design as the above can be done in probably a dozen lines of code 
to all of them at once.

 

For me, the above is way less complex than what you want to do and can have 
benefits. For example, if you make a graph of points from my larger 
tibble/data.frame using ggplot(), you can do things like specify what color to 
use for a point using a variable that contains the reason the data was missing 
(albeit that assumes the missing part is not what is being graphed) or add text 
giving the reason just above each such point. Your method of faking multiple 
things YOU claim are an NA may not make it doable in the above example.

 

From: Adrian Dușa mailto:dusa.adr...@unibuc.ro> > 
Sent: Monday, May 24, 2021 8:18 AM
To: Greg Minshall mailto:minsh...@umich.edu> >
Cc: Avi Gross mailto:avigr...@verizon.net> >; r-devel 
mailto:r-devel@r-project.org> >
Subject: Re: [Rd] 1954 from NA

 

On Mon, May 24, 2021 at 2:11 PM Greg Minshall mailto:minsh...@umich.edu> > wrote:

[...]
if you have 500 columns of possibly-NA'd variables, you could have one
column of 500 "bits", where each bit has one of N values, N being the
number of explanations the corresponding column has for why the NA
exists.

 

The mere thought of implementing something like that gives me shivers. Not to 
mention such a solution should also be robust when subsetting, splitting, 
column and row binding, etc. and everything can be lost if the user deletes 
that particular column without realising its importance.

 

Social science datasets are much more alive and complex than one might first 
think: there are multi-wave studies with tens of countries, and aggregating 
such data is already a complex process to add even more complexity on top of 
that.

 

As undocumented as they may be, or even subject to change, I think the R 
internals are much more reliable that this.

 

Best wishes,

Adrian

 

-- 

Adrian Dusa
University of Bucharest
Romanian Social Data Archive
Soseaua Panduri nr. 90-92
050663 Bucharest sector

Re: [Rd] [External] Re: 1954 from NA

2021-05-24 Thread Avi Gross via R-devel
I was thinking about how one does things in a language that is properly 
object-oriented versus R that makes various half-assed attempts at being such.

Clearly in some such languages you can make an object that is a wrapper that 
allows you to save an item that is the main payload as well as anything else 
you want. You might need a way to convince everything else to allow you to make 
things like lists and vectors and other collections of the objects and perhaps 
automatically unbox them for many purposes. As an example in a language like 
Python, you might provide methods so that adding A and B actually gets the 
value out of A and/or B and adds them properly.  But there may be too many edge 
cases to handle and some software may not pay attention to what you want 
including some libraries written in other languages.

I mention Python for the odd reason that it is now possible to combine Python 
and R in the same program and sort of switch back and forth between data 
representations. This may provide some openings for preserving and accessing 
metadata when needed.

Realistically, if R was being designed from scratch TODAY, many things might be 
done differently. But I recall it being developed at Bell Labs for purposes 
where it was sort of revolutionary at the time (back when it was S) and 
designed to do things in a vectorized way and probably primarily for the kinds 
of scientific and mathematical operations where a single NA (of several types 
depending on the data) was enough when augmented by a few things like a Nan and 
Inf and -Inf. I doubt they seriously saw a need for an unlimited number of NA 
that were all the same AND also all different that they felt had to be 
built-in. As noted, had they had a reason to make it fully object-oriented too 
and made the base types such as integer into full-fledged objects with room for 
additional metadata, then things may be different. I note I have seen languages 
which have both a data type called integer as lower case and Integer as upper 
case. One of them is regularly boxed and unboxed automagically when used in a 
context that needs the other. As far as efficiency goes, this invisibly adds 
many steps. So do languages that sometimes take a variable that is a pointer 
and invisibly reference it to provide the underlying field rather than make you 
do extra typing and so on.

So is there any reason only an NA should have such meta-data? Why not have 
reasons associated with Inf stating it was an Inf because you asked for one or 
the result of a calculation such as dividing by Zero (albeit maybe that might 
be a NaN) and so on. Maybe I could annotate integers with whether they are 
prime or even  versus odd  or a factor of 144 or anything else I can imagine. 
But at some point, the overhead from allowing all this can become substantial. 
I was amused at how python allows a function to be annotated including by 
itself since it is an object. So it can store such metadata perhaps in an 
attached dictionary so a complex costly calculation can have the results cached 
and when you ask for the same thing in the same session, it checks if it has 
done it and just returns the result in linear time. But after a while, how many 
cached results can there be?

-Original Message-
From: R-devel  On Behalf Of 
luke-tier...@uiowa.edu
Sent: Monday, May 24, 2021 9:15 AM
To: Adrian Dușa 
Cc: Greg Minshall ; r-devel 
Subject: Re: [Rd] [External] Re: 1954 from NA

On Mon, 24 May 2021, Adrian Dușa wrote:

> On Mon, May 24, 2021 at 2:11 PM Greg Minshall  wrote:
>
>> [...]
>> if you have 500 columns of possibly-NA'd variables, you could have 
>> one column of 500 "bits", where each bit has one of N values, N being 
>> the number of explanations the corresponding column has for why the 
>> NA exists.
>>

PLEASE DO NOT DO THIS!

It will not work reliably, as has been explained to you ad nauseam in this 
thread.

If you distribute code that does this it will only lead to bug reports on R 
that will waste R-core time.

As Alex explained, you can use attributes for this. If you need operations to 
preserve attributes across subsetting you can define subsetting methods that do 
that.

If you are dead set on doing something in C you can try to develop an ALTREP 
class that provides augmented missing value information.

Best,

luke



>
> The mere thought of implementing something like that gives me shivers. 
> Not to mention such a solution should also be robust when subsetting, 
> splitting, column and row binding, etc. and everything can be lost if 
> the user deletes that particular column without realising its importance.
>
> Social science datasets are much more alive and complex than one might 
> first think: there are multi-wave studies with tens of countries, and 
> aggregating such data is already a complex process to add even more 
> complexity on top of that.
>
> As undocumented as they may be, or even subject to change, I think the 
> R internals are much more 

Re: [Rd] [External] Re: 1954 from NA

2021-05-25 Thread Avi Gross via R-devel
 job done and be fairly certain others accept the 
results ad then do other activities they are better suited for, or at least 
think they are.

 

There are intermediates I have used where I let them do various kinds of 
processing on SPSS and save the result in some format I can read into R for 
additional processing. The latter may not be stuff that requires keeping track 
of multiple NA equivalents. Of course if you want to save the results and move 
them back, that is  a challenge. Hybrid approaches may tempt them to try 
something and maybe later do more and more and move over.

 

From: Adrian Dușa  
Sent: Tuesday, May 25, 2021 2:17 AM
To: Avi Gross 
Cc: r-devel 
Subject: Re: [Rd] [External] Re: 1954 from NA

 

Dear Avi,

 

Thank you so much for the extended messages, I read them carefully.

While partially offering a solution (I've already been there), it creates 
additional work for the user, and some of that is unnecessary.

 

What I am trying to achieve is best described in this draft vignette:

 

devtools::install_github("dusadrian/mixed")

vignette("mixed")

 

Once a value is declared to be missing, the user should not do anything else 
about it. Despite being present, the value should automatically be treated as 
missing by the software. That is the way it's done in all major statistical 
packages like SAS, Stata and even SPSS.

 

My end goal is to make R attractive for my faculty peers (and beyond), almost 
all of whom are massively using SPSS and sometimes Stata. But in order to 
convince them to (finally) make the switch, I need to provide similar 
functionality, not additional work.

 

Re. your first part of the message, I am definitely not trying to change the R 
internals. The NA will still be NA, exactly as currently defined.

My initial proposal was based on the observation that the 1954 payload was 
stored as an unsigned int (thus occupying 32 bits) when it is obvious it 
doesn't need more than 16. That was the only proposed modification, and 
everything else stays the same.

 

I now learned, thanks to all contributors in this list, that building something 
around that payload is risky because we do not know exactly what the compilers 
will do. One possible solution that I can think of, while (still) maintaining 
the current functionality around the NA, is to use a different high word for 
the NA that would not trigger compilation issues. But I have absolutely no idea 
what that implies for the other inner workings of R.

 

I very much trust the R core will eventually find a robust solution, they've 
solved much more complicated problems than this. I just hope the current thread 
will push the idea of tagged NAs on the table, for when they will discuss this.

 

Once that will be solved, and despite the current advice discouraging this 
route, I believe tagging NAs is a valuable idea that should not be discarded.

After all, the NA is nothing but a tagged NaN.

 

All the best,

Adrian

 

 

On Tue, May 25, 2021 at 7:05 AM Avi Gross via R-devel mailto:r-devel@r-project.org> > wrote:

I was thinking about how one does things in a language that is properly 
object-oriented versus R that makes various half-assed attempts at being such.

Clearly in some such languages you can make an object that is a wrapper that 
allows you to save an item that is the main payload as well as anything else 
you want. You might need a way to convince everything else to allow you to make 
things like lists and vectors and other collections of the objects and perhaps 
automatically unbox them for many purposes. As an example in a language like 
Python, you might provide methods so that adding A and B actually gets the 
value out of A and/or B and adds them properly.  But there may be too many edge 
cases to handle and some software may not pay attention to what you want 
including some libraries written in other languages.

I mention Python for the odd reason that it is now possible to combine Python 
and R in the same program and sort of switch back and forth between data 
representations. This may provide some openings for preserving and accessing 
metadata when needed.

Realistically, if R was being designed from scratch TODAY, many things might be 
done differently. But I recall it being developed at Bell Labs for purposes 
where it was sort of revolutionary at the time (back when it was S) and 
designed to do things in a vectorized way and probably primarily for the kinds 
of scientific and mathematical operations where a single NA (of several types 
depending on the data) was enough when augmented by a few things like a Nan and 
Inf and -Inf. I doubt they seriously saw a need for an unlimited number of NA 
that were all the same AND also all different that they felt had to be 
built-in. As noted, had they had a reason to make it fully object-oriented too 
and made the base types such as integer into full-fledged objects with room for 
additional me

Re: [Rd] [External] Re: 1954 from NA

2021-05-25 Thread Avi Gross via R-devel
Greg,

I am curious what they suggest you use multiple NaN values for. Or, is it 
simply like how text messages on your phone started because standard size 
packets were bigger than what some uses required so they piggy-backed messages 
on the "empty" space.

If by NaN you include the various flavors of NA such as NA_logical_ and 
NA_complex_ I have sometimes wondered if they are slightly different bitstreams 
or all the same but interpreted by programs as being the right kind for their 
context. Sounds like maybe they are different and there is one for pretty much 
each basic type except perhaps raw.

But if you add more, in that case, will it be seen as the right NA for the 
environment it is in? Heck, if R adds yet another basic type (like a 
quaternion) or a nibble, could they use the same bits you took without asking 
for your application?

It does sound like some suggest you use a method with existing abilities and 
tightly control that all functions used to manipulate the data will behave and 
preserve those attributes. I am not so sure the clients using it will obey. I 
have seen plenty of people say use some tidyverse functions for various 
purposes then use something more base-R like complete.cases() or rbind() that 
may, but also may not, preserve what they want. And once lost, ...

Now, of course, you could write wrapper functions that will take the data, copy 
the attributes, allow whatever changes, and carefully put them back before 
returning. This may not be trivial though if you want to do something like 
delete lots of rows as you might need to first identify what rows will be kept, 
then adjust the vector of attributes accordingly before returning it. Sorting 
is another such annoyance. Many things do conversions such as making copies or 
converting a copy to a factor, that may mess things up. If it has already been 
done and people have experience, great. If not, good luck.

-Original Message-
From: Gregory Warnes  
Sent: Tuesday, May 25, 2021 9:13 PM
To: Avi Gross 
Cc: r-devel 
Subject: Re: [Rd] [External] Re: 1954 from NA

As a side note, for floating point values, the IEEE 754 standard provides for a 
large set of NaN values, making it possible to have multiple types of NAs for 
floating point values...

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] order of operations

2021-08-27 Thread Avi Gross via R-devel
Does anyone have a case where this construct has a valid use? 

Didn't Python  add a := operator recently that might be intended more for
such uses as compared to using the standard assignment operators? I wonder
if that has explicit guarantees of what happens in such cases, but that is
outside what this forum cares about. Just for the heck of it, I tried the
example there:

>>> (x := 1) * (x := 2)
2
>>> x
2

Back to R, ...

The constructs can get arbitrarily complex as in:

(x <- (x <- 0) + 1) * (x <- (x <-2) + 1)

My impression is that when evaluation is left to right and also innermost
parentheses before outer ones, then something like the above goes in stages.
The first of two parenthetical expressions is evaluated first.

(x <- (x <- 0) + 1)

The inner parenthesis set x to zero then the outer one increments x to 1.
The full sub-expression evaluates to 1 and that value is set aside for a
later multiplication.

But then the second parenthesis evaluates similarly, from inside out:

(x <- (x <-2) + 1)

It clearly resets x to 2 then increments it by 1 to 3 and returns a value of
3. That is multiplied by the first sub-expression to result in 3.

So for simple addition, even though it is commutative, is there any reason
any compiler or interpreter should not follow rules like the above?
Obviously with something like matrices, some operations are not abelian and
require more strict interpretation in the right order.

And note the expressions like the above can run into more complex quandaries
such as when you have a conditional with OR or AND parts that may be
short-circuited and in some cases, a variable you expected to be set, may
remain unset or ...

This reminds me a bit of languages that allow pre/post increment/decrement
operators like ++ and -- and questions about what order things happen.
Ideally, anything in which a deterministic order is not guaranteed should be
flagged by the language at compile time (or when interpreted) and refuse to
go on. 

All I can say with computer languages and adding ever more features, 
with greater power comes greater responsibility and often greater
confusion.


-Original Message-
From: R-devel  On Behalf Of Gabor
Grothendieck
Sent: Friday, August 27, 2021 11:32 AM
To: Thierry Onkelinx 
Cc: r-devel@r-project.org
Subject: Re: [Rd] order of operations

I agree and personally never do this but I would still like to know if it is
guaranteed behavior or not.

On Fri, Aug 27, 2021 at 11:28 AM Thierry Onkelinx 
wrote:

> IMHO this is just bad practice. Whether the result is guaranteed or 
> not, doesn't matter.
>
> ir. Thierry Onkelinx
> Statisticus / Statistician
>
> Vlaamse Overheid / Government of Flanders INSTITUUT VOOR NATUUR- EN 
> BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE AND FOREST Team Biometrie 
> & Kwaliteitszorg / Team Biometrics & Quality Assurance 
> thierry.onkel...@inbo.be Havenlaan 88 bus 73, 1000 Brussel www.inbo.be
>
>
> //
> / To call in the statistician after the experiment 
> is done may be no more than asking him to perform a post-mortem 
> examination: he may be able to say what the experiment died of. ~ Sir 
> Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger 
> Brinner The combination of some data and an aching desire for an 
> answer does not ensure that a reasonable answer can be extracted from 
> a given body of data.
> ~ John Tukey
>
> //
> /
>
> 
>
>
> Op vr 27 aug. 2021 om 17:18 schreef Gabor Grothendieck <
> ggrothendi...@gmail.com>:
>
>> Are there any guarantees of whether x will equal 1 or 2 after this is
run?
>>
>> (x <- 1) * (x <- 2)
>> ## [1] 2
>> x
>> ## [1] 2
>>
>> --
>> Statistics & Software Consulting
>> GKX Group, GKX Associates Inc.
>> tel: 1-877-GKX-GROUP
>> email: ggrothendieck at gmail.com
>>
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>

--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] order of operations

2021-08-27 Thread Avi Gross via R-devel
Running things in various forms of parallel opens up all kinds of issues. 
Currently, programs that use forms like "threads" often need to carefully 
protect any variables that can be changed using things like locks.

So what would they do in the scenario being discussed? Would they need to 
analyze the entire part of the program before splitting off parts and add code 
to protect not only from simultaneous access to the variable but set up a 
guarantee so that one of multiple threads would get to change it first and 
others freeze until it is their turn?

Strikes me as a bit too complex given the scenario does not look like one that 
is likely to have serious uses. 

I understand the question is more academic and there are multiple reasonable 
answers with tradeoffs. And one answer is to make it totally deterministic even 
if that precludes any ability to speed things up.  Another is to simply declare 
such use to be either illegal or unsupported.

And, perhaps, there can be support for ways to do this kind of thing more 
safely. Clearly, the methods of parallelism vary from threads within a program 
running on the same processor that just interleave, to running on multiple 
processors and even multiple machines across the world. Darned if I know what 
issues would come up on  quantum computers which have yet other aspects of the 
concept of parallelism.


-Original Message-
From: Gabor Grothendieck  
Sent: Friday, August 27, 2021 1:58 PM
To: Avi Gross 
Cc: r-devel@r-project.org
Subject: Re: [Rd] order of operations

It could be that the two sides of * are run in parallel in the future and maybe 
not having a guarantee would simplify implementation?


On Fri, Aug 27, 2021 at 12:35 PM Avi Gross via R-devel  
wrote:
>
> Does anyone have a case where this construct has a valid use?
>
> Didn't Python  add a := operator recently that might be intended more 
> for such uses as compared to using the standard assignment operators? 
> I wonder if that has explicit guarantees of what happens in such 
> cases, but that is outside what this forum cares about. Just for the 
> heck of it, I tried the example there:
>
> >>> (x := 1) * (x := 2)
> 2
> >>> x
> 2
>
> Back to R, ...
>
> The constructs can get arbitrarily complex as in:
>
> (x <- (x <- 0) + 1) * (x <- (x <-2) + 1)
>
> My impression is that when evaluation is left to right and also 
> innermost parentheses before outer ones, then something like the above goes 
> in stages.
> The first of two parenthetical expressions is evaluated first.
>
> (x <- (x <- 0) + 1)
>
> The inner parenthesis set x to zero then the outer one increments x to 1.
> The full sub-expression evaluates to 1 and that value is set aside for 
> a later multiplication.
>
> But then the second parenthesis evaluates similarly, from inside out:
>
> (x <- (x <-2) + 1)
>
> It clearly resets x to 2 then increments it by 1 to 3 and returns a 
> value of 3. That is multiplied by the first sub-expression to result in 3.
>
> So for simple addition, even though it is commutative, is there any 
> reason any compiler or interpreter should not follow rules like the above?
> Obviously with something like matrices, some operations are not 
> abelian and require more strict interpretation in the right order.
>
> And note the expressions like the above can run into more complex 
> quandaries such as when you have a conditional with OR or AND parts 
> that may be short-circuited and in some cases, a variable you expected 
> to be set, may remain unset or ...
>
> This reminds me a bit of languages that allow pre/post 
> increment/decrement operators like ++ and -- and questions about what order 
> things happen.
> Ideally, anything in which a deterministic order is not guaranteed 
> should be flagged by the language at compile time (or when 
> interpreted) and refuse to go on.
>
> All I can say with computer languages and adding ever more features,
> with greater power comes greater responsibility and often 
> greater confusion.
>
>
> -Original Message-
> From: R-devel  On Behalf Of Gabor 
> Grothendieck
> Sent: Friday, August 27, 2021 11:32 AM
> To: Thierry Onkelinx 
> Cc: r-devel@r-project.org
> Subject: Re: [Rd] order of operations
>
> I agree and personally never do this but I would still like to know if 
> it is guaranteed behavior or not.
>
> On Fri, Aug 27, 2021 at 11:28 AM Thierry Onkelinx 
> 
> wrote:
>
> > IMHO this is just bad practice. Whether the result is guaranteed or 
> > not, doesn't matter.
> >
> > ir. Thierry Onkelinx
> > St

Re: [Rd] WISH: set.seed(seed) to produce error if length(seed) != 1 (now silent)

2021-09-17 Thread Avi Gross via R-devel
R wobbles a bit as there is no normal datatype that is a singleton variable.  
Saying x <- 5 just creates a vector of current length 1. It is perfectly legal 
to then write x [2] <- 6 and so on. The vector lengthens. You can truncate it 
back to 1, if you wish: length(x) <- 1

So the question here is what happens if you supply more info than is needed? If 
it is an integer vector of length greater than one, should it ignore everything 
but the first entry? I note it happily accepts not-quite integers like TRUE and 
FALSE.  it also accepts floating point numbers like 1.23 or 1.2e5. 

The goal seems to be to set a unique starting point, rounded or transformed if 
needed. The visible part of the function does not even look at the seed before 
calling the internal representation. So although superficially choosing the 
first integer in a vector makes some sense, it can be a problem if a program 
assumes the entire vector is consumed and perhaps hashed in some way to make a 
seed. If the program later changes parts of the vector other than the first 
entry, it may assume re-setting the seed gets something else and yet it may be 
exactly the same.

So, yes, I suspect it is an ERROR to take anything that cannot be coerced by 
something like as.integer() into a vector of length 1.

I have noted other places in R where I may get a warning when giving a longer 
vector that only the fist element will be used.  Are they all problems that 
need to be addressed?

Here is a short one:

> x <- c(1:3)
> if (x > 2) y <- TRUE
Warning message:
  In if (x > 2) y <- TRUE :
  the condition has length > 1 and only the first element will be used
> y
Error: object 'y' not found

The above is not vectorized and makes the choice of x==1 and thus does not set 
y.

Now a vectorized variant works as expected, making a vector of length 3 for y:

> x
[1] 1 2 3

> y <- ifelse(x > 2, TRUE, FALSE)
> y
[1] FALSE FALSE  TRUE

I have no doubt fixing lots of this stuff, if indeed it is a fix, can break 
lots of existing code. Sure, it is not harmful to ask a programmer to always 
say x[1] to guarantee they are getting what they want, or to add a function 
like first(x) that does the same. 

R has some compromises or features I sometimes wonder about. If it had a 
concept of a numeric scalar, then some things that now happen might start being 
an error.

What happens when you multiply a vector by a scalar as in 5*x is that every 
component of x is multiplied by 5. but x*x does componentwise multiplication.  
So say x is c(1:3) what should this do using a twosome times a threesome?

x[1:2]*x
[1] 1 4 3
Warning message:
  In x[1:2] * x :
  longer object length is not a multiple of shorter object length

Is it recycling to get a 1 in pseudo-position 3?

Yep, this shows recycling:

> x[1:2]*x
[1]  1  4  3  8  5 12  7 16  9
Warning message:
  In x[1:2] * x :
  longer object length is not a multiple of shorter object length

You do get a warning but not telling you what it did.

In essence, the earlier case of 5*x arguably recycled the 5 as many times as 
needed but with no warning. 

My point is that many languages, especially older ones, were designed a certain 
way and have been updated but we may be stuck with what we have. A brand new 
language might come up with a new way that includes vectorizing the heck out of 
things but allowing and even demanding that you explicitly convert things to a 
scalar in a context that needs it or to explicitly asking for recycling when 
you want it or ...




-Original Message-
From: R-devel  On Behalf Of Henrik Bengtsson
Sent: Friday, September 17, 2021 8:39 AM
To: GILLIBERT, Andre 
Cc: R-devel 
Subject: Re: [Rd] WISH: set.seed(seed) to produce error if length(seed) != 1 
(now silent)

> I’m curious, other than proper programming practice, why?

Life's too short for troubleshooting silent mistakes - mine or others.

While at it, searching the interwebs for use of set.seed(), gives 
mistakes/misunderstandings like using set.seed(), e.g.

> set.seed(6.1); sum(.Random.seed)
[1] 73930104
> set.seed(6.2); sum(.Random.seed)
[1] 73930104

which clearly is not what the user expected.  There are also a few cases of 
set.seed(), e.g.

> set.seed("42"); sum(.Random.seed)
[1] -2119381568
> set.seed(42); sum(.Random.seed)
[1] -2119381568

which works just because as.numeric("42") is used.

/Henrik

On Fri, Sep 17, 2021 at 12:55 PM GILLIBERT, Andre 
 wrote:
>
> Hello,
>
> A vector with a length >= 2 to set.seed would probably be a bug. An error 
> message will help the user to fix his R code. The bug may be accidental or 
> due to bad understanding of the set.seed function. For instance, a user may 
> think that the whole state of the PRNG can be passed to set.seed.
>
> The "if" instruction, emits a warning when the condition has length >= 2, 
> because it is often a bug. I would expect a warning or error with set.seed().
>
> Validating inputs and emitting errors early is a good practice.
>
> Just my 2 cents.
>
> Sincer

Re: [Rd] string concatenation operator (revisited)

2021-12-04 Thread Avi Gross via R-devel
Grant,

One nit to consider is that the default behavior of pasteo() to include a space 
as a separator would not be a perfect choice for the usual meaning of plus. 

I would prefer a+b to be "helloworld" in your example and to get what you say 
would be 

a + " " + b

Which I assume would put in a space where you want it and not where you don't.

As I am sure you have been told, you already can make an operator like this:

`%+%` <- function(x, y) paste0(x, y)

And then use:

a %+% b

And to do it this way, you might have two such functions where %+% does NOT add 
a space but the odd version with a space in it, % +% or %++% does add a space!

`%+%` <- function(x, y) paste0(x, y, sep="")
`%++%` <- function(x, y) paste0(x, " ",  y)
`% +%` <- function(x, y) paste0(x, " ",  y)

Now testing it with:

a = "hello"; b = "world" # NOTE I removed the trailing space you had in "a".

> a %+% b
[1] "helloworld"
> a %++% b
[1] "hello world"
> a % +% b
[1] "hello world"

It also seems to work with multiple units mixed in a row as shown below:

> a %+% b % +% a %++% b
[1] "helloworld hello world"

And it sort of works with vectors of strings or numbers using string 
concatenation:

> a <- letters[1:3]
> b <- seq(from=101, to = 301, by = 100)
> a %+% b %+% a
[1] "a101a" "b201b" "c301c"

But are you asking for a naked "+" sign to be vectorized like that?

And what if someone accidentally types something like:

a = "text"
a = a + 1

The addition now looks like adding an integer to a text string. In many 
languages, like PERL, this results in implicated conversion to make "text1" the 
result. My work-around does that:

> a = a %+% 1
> a
[1] "text1"

BUT what you are asking for is for R to do normal addition if a and b are both 
numeric and presumably do (as languages like Python do) text concatenation when 
they are both text. What do you suggest happen if one is numeric and the other 
is text or perhaps some arbitrary data type? 

I checked to see what Python version 3.9 does:

>>> 5 + 4
9
>>> "5" + "4"
'54'
>>> "5" + 4
Traceback (most recent call last):
  File "", line 1, in 
"5" + 4
TypeError: can only concatenate str (not "int") to str

It is clear it does not normally support such mixed methods, albeit I can 
probably easily create an object sub-class where I create a dunder method that 
perhaps checks if one of the two things being added can be coerced into a 
string or into a number as needed to convert so the two types match.

But this is about R.

As others have said, the underlying early philosophy of R being created as a 
language did not head the same way as some other languages and R is mainly not 
the same kind of object-oriented as some others and thus some things are not 
trivially done but can be done using other ways like the %+% technique above.

But R also allows weird things like this: 
# VERY CAREFULLY as overwriting "+" means you cannot use it in your other ...
# So not a suggested idea but if done you must preserve the original meaning of 
plus elsewhere like I do.

flexible_plus <- function(first, second) {
  if (all(is.numeric(first), is.numeric(second))) return(first + second)
  if (all(is.character(first), is.character(second))) return(paste0(first, 
second))
  # If you reach here, there is an error
  print("ERROR: both arguments must be numeric or both character")
  return(NULL)
}

Now define things carefully to use something like the function flexible_plus I 
created becoming the MEANING of a naked plus sign.  But note it will now be 
used in other ways and places in any code that does addition so it is not an 
ideal solution. It does sort of work, FWIW.

`%+++%` <- `+`
`+` <- flexible_plus

Finally some testing:

> 5 %+++% 3
[1] 8
> flexible_plus(5, 3)
[1] 8
> 5 + 3
[1] 8
> "hello" + "world"
[1] "helloworld"
> "hello" + 5
[1] "ERROR: both arguments must be numeric or both character"
NULL

It does seem to do approximately what I said it would do but also does some 
vectorized things as well as long as all are the same type:

> c(1,2,3) + 4
[1] 5 6 7
> c(1,2,3) + c(4,5,6)
[1] 5 7 9
> c("word1", "word2", "word3") + "more"
[1] "word1more" "word2more" "word3more"
> c("word1", "word2", "word3") + c("more", "snore")
[1] "word1more"  "word2snore" "word3more"

Again, the above code is for illustration purposes only. I would be beyond 
shocked if the above did not break something somewhere and it certainly is not 
as efficient as the built-in adder. As an exercise, it looks reasonable. LOL!


-Original Message-
From: R-devel  On Behalf Of Grant McDermott
Sent: Saturday, December 4, 2021 5:37 PM
To: r-devel@r-project.org
Subject: [Rd] string concatenation operator (revisited)

Hi all,

I wonder if the R Core team might reconsider an old feature request, as 
detailed in this 2005 thread: 
https://stat.ethz.ch/pipermail/r-help/2005-February/thread.html#66698

The TL;DR version is base R support for a `+.character` method. This would 
essentially provide a shortcut to `paste​0`, in much the same w

Re: [Rd] string concatenation operator (revisited)

2021-12-06 Thread Avi Gross via R-devel
After seeing what others are saying, it is clear that you need to carefully
think things out before designing any implementation of a more native
concatenation operator whether it is called "+' or anything else. There may
not be any ONE right solution but unlike a function version like paste()
there is nowhere to place any options that specify what you mean.

You can obviously expand paste() to accept arguments like replace.NA="" or
replace.NA="" and similar arguments on what to do if you see a NaN, and
Inf or -Inf, a NULL or even an NA.character_ and so on. Heck, you might tell
to make other substitutions as in substitute=list(100=99, D=F) or any other
nonsense you can come up with.

But you have nowhere to put options when saying:

c <- a + b

Sure, you could set various global options before the addition and maybe
rest them after, but that is not a way I like to go for something this
basic.

And enough such tinkering makes me wonder if it is easier to ask a user to
use a slightly different function like this:

paste.no.na <- function(...) do.call(paste, Filter(Negate(is.na),
list(...)))

The above one-line function removes any NA from the argument list to make a
potentially shorter list before calling the real paste() using it.

Variations can, of course, be made that allow functionality as above. 

If R was a true object-oriented language in the same sense as others like
Python, operator overloading of "+" might be doable in more complex ways but
we can only work with what we have. I tend to agree with others that in some
places R is so lenient that all kinds of errors can happen because it makes
a guess on how to correct it. Generally, if you really want to mix numeric
and character, many languages require you to transform any arguments to make
all of compatible types. The paste() function is clearly stated to coerce
all arguments to be of type character for you. Whereas a+b makes no such
promises and also is not properly defined even if a and b are both of type
character. Sure, we can expand the language but it may still do things some
find not to be quite what they wanted as in "2"+"3" becoming "23" rather
than 5. Right now, I can use as.numeric("2")+as.numeric("3") and get the
intended result after making very clear to anyone reading the code that I
wanted strings converted to floating point before the addition.

As has been pointed out, the plus operator if used to concatenate does not
have a cognate for other operations like -*/ and R has used most other
special symbols for other purposes. So, sure, we can use something like 
(4 periods) if it is not already being used for something but using + here
is a tad confusing. Having said that, the makers of Python did make that
choice.

-Original Message-
From: R-devel  On Behalf Of Gabriel Becker
Sent: Monday, December 6, 2021 7:21 PM
To: Bill Dunlap 
Cc: Radford Neal ; r-devel 
Subject: Re: [Rd] string concatenation operator (revisited)

As I recall, there was a large discussion related to that which resulted in
the recycle0 argument being added (but defaulting to FALSE) for
paste/paste0.

I think a lot of these things ultimately mean that if there were to be a
string concatenation operator, it probably shouldn't have behavior identical
to paste0. Was that what you were getting at as well, Bill?

~G

On Mon, Dec 6, 2021 at 4:11 PM Bill Dunlap  wrote:

> Should paste0(character(0), c("a","b")) give character(0)?
> There is a fair bit of code that assumes that paste("X",NULL) gives "X"
> but c(1,2)+NULL gives numeric(0).
>
> -Bill
>
> On Mon, Dec 6, 2021 at 1:32 PM Duncan Murdoch 
> 
> wrote:
>
>> On 06/12/2021 4:21 p.m., Avraham Adler wrote:
>> > Gabe, I agree that missingness is important to factor in. To 
>> > somewhat
>> abuse
>> > the terminology, NA is often used to represent missingness. Perhaps 
>> > concatenating character something with character something missing
>> should
>> > result in the original character?
>>
>> I think that's a bad idea.  If you wanted to represent an empty 
>> string, you should use "" or NULL, not NA.
>>
>> I'd agree with Gabe, paste0("abc", NA) shouldn't give "abcNA", it 
>> should give NA.
>>
>> Duncan Murdoch
>>
>> >
>> > Avi
>> >
>> > On Mon, Dec 6, 2021 at 3:35 PM Gabriel Becker 
>> > 
>> wrote:
>> >
>> >> Hi All,
>> >>
>> >> Seeing this and the other thread (and admittedly not having 
>> >> clicked
>> through
>> >> to the linked r-help thread), I wonder about NAs.
>> >>
>> >> Should NA  "hi there"  not result in NA_character_? This 
>> >> is not what any of the paste functions do, but in my opinoin, NA +
>> 
>> >> seems like it should be NA  (not "NA"), particularly if we are 
>> >> talking about `+` overloading, but potentially even in the case of 
>> >> a distinct concatenation operator?
>> >>
>> >> I guess what I'm saying is that in my head missingness propagation
>> rules
>> >> should take priority in such an operator (ie NA +  
>> >> should *always * be NA).
>> >>
>> >> Is that something others 

Re: [Rd] string concatenation operator (revisited)

2021-12-07 Thread Avi Gross via R-devel
on an older version.

The reality is that there can be significant costs in a tradeoff between ease 
of use with many choices and in the expense of running a bloated application 
that takes longer to load and more memory and spends more time searching 
namespaces and so on. 

Does adding a properly designed "+" cause much bloat? Maybe not. But the 
guardians of the language get so many requests, that realistically they can 
only approve a small number for each release and often then have to spend more 
time fixing bugs after getting complaints about code that does not work the 
same anymore!


-Original Message-
From: R-devel  On Behalf Of Taras Zakharko
Sent: Tuesday, December 7, 2021 4:09 AM
To: r-devel 
Subject: Re: [Rd] string concatenation operator (revisited)

Great summary, Avi. 

String concatenation cold be trivially added to R, but it probably should not 
be. You will notice that modern languages tend not to use “+” to do string 
concatenation (they either have a custom operator or a special kind of pattern 
to do it) due to practical issues such an approach brings (implicit type 
casting, lack of commutativity, performance etc.). These issues will be felt 
even more so in R with it’s weak typing, idiosyncratic casting behavior and 
NAs. 

As other’s have pointed out, any kind of behavior one wants from string 
concatenation can be implemented by custom operators as needed. This is not 
something that needs to be in the base R. I would rather like the efforts to be 
directed on improving string formatting (such as glue-style built-in string 
interpolation).

— Taras


> On 7 Dec 2021, at 02:27, Avi Gross via R-devel  wrote:
> 
> After seeing what others are saying, it is clear that you need to 
> carefully think things out before designing any implementation of a 
> more native concatenation operator whether it is called "+' or 
> anything else. There may not be any ONE right solution but unlike a 
> function version like paste() there is nowhere to place any options that 
> specify what you mean.
> 
> You can obviously expand paste() to accept arguments like 
> replace.NA="" or replace.NA="" and similar arguments on what to do 
> if you see a NaN, and Inf or -Inf, a NULL or even an NA.character_ and 
> so on. Heck, you might tell to make other substitutions as in 
> substitute=list(100=99, D=F) or any other nonsense you can come up with.
> 
> But you have nowhere to put options when saying:
> 
> c <- a + b
> 
> Sure, you could set various global options before the addition and 
> maybe rest them after, but that is not a way I like to go for 
> something this basic.
> 
> And enough such tinkering makes me wonder if it is easier to ask a 
> user to use a slightly different function like this:
> 
> paste.no.na <- function(...) do.call(paste, Filter(Negate(is.na),
> list(...)))
> 
> The above one-line function removes any NA from the argument list to 
> make a potentially shorter list before calling the real paste() using it.
> 
> Variations can, of course, be made that allow functionality as above. 
> 
> If R was a true object-oriented language in the same sense as others 
> like Python, operator overloading of "+" might be doable in more 
> complex ways but we can only work with what we have. I tend to agree 
> with others that in some places R is so lenient that all kinds of 
> errors can happen because it makes a guess on how to correct it.
> Generally, if you really want to mix numeric and character, many 
> languages require you to transform any arguments to make all of 
> compatible types. The paste() function is clearly stated to coerce all 
> arguments to be of type character for you. Whereas a+b makes no such 
> promises and also is not properly defined even if a and b are both of 
> type character. Sure, we can expand the language but it may still do 
> things some find not to be quite what they wanted as in "2"+"3"
> becoming "23" rather than 5. Right now, I can use
> as.numeric("2")+as.numeric("3") and get the intended result after making very 
> clear to anyone reading the code that I wanted strings converted to floating 
> point before the addition.
> 
> As has been pointed out, the plus operator if used to concatenate does 
> not have a cognate for other operations like -*/ and R has used most 
> other special symbols for other purposes. So, sure, we can use something like 
> 
> (4 periods) if it is not already being used for something but using + 
> here is a tad confusing. Having said that, the makers of Python did 
> make that choice.
> 
> -Original Message-
> From: R-devel  On Behalf Of Gabriel 
> Becker
> Sent: Monday, December 6, 2021 7:21 PM
> To: Bill Dunlap

Re: [Rd] Documentation for floor, ceiling & trunc

2022-01-01 Thread Avi Gross via R-devel
Excellent reason, Duncan. R does not have an unlimited integer type as in
Python so truncating or rounding activities can well produce a result that
would be out of bounds.

If someone really wants an array of integers, other than efficiency reasons,
they could process the output from something like floor carefully to see if
all number returned were less than .Machine$integer.max (and similarly for
negatives in the other direction) and then have the option to make a vector
of as.integer(whatever) for later uses. If any numbers where out of the
range, they could presumably do other things like make then NA or Inf or
switch to some larger integer format they can find or create. Of course, any
such alterations may well not work well if fed to anything not expecting
them.

Now consider the purpose of the R functions round(), floor(), ceiling() and
trunc() and perhaps even signif() taken as a group. Clearly some of them can
be used only in a floating point context as rounding something to three
significant digits beyond the decimal point will not usually result in an
integer. Sure, some of them are normally used in real life to mean round it
to the nearest integer and in those cases it could be reasonable to have a
function with a restricted domain that maps into the restricted integer
range.  You can make your own such function easily enough.

-Original Message-
From: R-devel  On Behalf Of Duncan Murdoch
Sent: Saturday, January 1, 2022 3:04 PM
To: Colin Gillespie ; r-devel@r-project.org
Subject: Re: [Rd] Documentation for floor, ceiling & trunc

On 01/01/2022 2:24 p.m., Colin Gillespie wrote:
> Hi,
> 
> The documentation for floor, ceiling and trunc is slightly ambiguous.
> 
> "floor takes ... and returns a numeric vector containing the largest 
> integers ..."
> 
> My initial thought was that floor() would return a vector of integers.

That would be described as "an integer vector".  I think the docs are pretty
consistent about this:  if an output is described as "a numeric vector",
that's the type you get.  ("numeric" and "double" refer to the same type in
R.  This naming inconsistency is discussed in the ?double help page.)

> Instead, it returns a vector of doubles, i.e c(1L, 2L) vs c(1, 2)
> 
>   * Could the docs be changed
>   * Would it be worth returning integers instead?

The range of inputs is much larger than the range of 32 bit integers, so
this would just make things more complicated, and would mean that code that
cares about the difference between numeric and integer would need extra
tests.

For example 3e9 + 0.1 is not an integer, and if you take the floor you get
3e9. That number can't be represented in the integer type, but can be
exactly represented as a mathematical integer in the numeric/double type.

Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] A patchwork indeed

2022-01-03 Thread Avi Gross via R-devel
Let me be clear up front that I do not want to start any major discussions,
merely to share some observations.

 

We discussed at length what it would mean if R was extended to allow a plus
sign to concatenate text when the operands were both of the right types that
made sense for the purpose so that, as in a language like Python:

 

"Hello " + "World!"

 

would result in the obvious concatenation and not as an error. It might be a
way to call perhaps a limited functionality of paste0() as an example. 

 

So, I was studying an R package called patchwork and looking at it from a
perspective in that it slightly extends the way ggplot uses the plus sign
when applied to objects of certain classes. Patchwork does allow something
like some form of (many types) of graphic objects to be displayed left to
right (or in a grid) by just typing 

p1 + p2 + p3

 

BUT it goes a bit nuts and overlays lots of operators so that:

 

(p1 | p2) / p3

 

results in the first two taking up half each of a top row and the third in
the next row and wide. You can of course make all kinds of adjustments but
the point is that those symbols are in a sense overlaid from their default
meanings. there is also a meaning (a tad obscure) for a unary negative sign
as in

- p1 

 

And, without explanation here, the symbols * and & also are used in new
ways. 

 

I note the obvious that the normal precedence rules in R for these
symbols/operators are NOT changed so you often need to use extra levels of
parentheses to guarantee the order of evaluation.

 

Clearly anyone reading your code that has not thoroughly read the manual for
the package will be even more mystified than people are about ggplot and the
plus sign, or the pipe symbols used in the tidyverse and even the new one
now in base R. 

 

But my point is that it looks like doing it is quite possible and small
isolated worlds can benefit from the notational simplicity. Having said
that, this package also allows you to bypass all of this and use more
standard functions that generally get you the same results. Since
manipulating graphs and subgraphs generally does not require combining the
above symbols alongside their other normal usage, this may look harmless and
if you come from languages that routinely allow operators to be overloaded
or polymorphic, looks fine.

 

I am providing this info, not to make a case for doing anything but to ask
if it makes sense to document acceptable methods for others, perhaps using
their own created objects, to do such effects.

 

In case anyone is curious, start here for a sort of tutorial:

 

https://patchwork.data-imaginist.com/

 

Again, not advocating, just providing an example, no doubt among many
others, where R is used in an extended way that can be useful. But of course
moving R to be fully object-oriented in the same way as some other specific
language is not a valid goal.

 


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] I've written a big review of R. Can I get some feedback?

2022-04-12 Thread Avi Gross via R-devel
JC,
Are you going to call this new abbreviated language by the name "Q" or keep 
calling itby the name "R" as "S" is taken?
As a goal, yes, it is easier to maintain a language that is sparse. It may sort 
of force programmers to go in particular ways to do things and those ways could 
be very reliable.
But it will drive many programmers away from the language as it will often not 
match their way of thinking about problems.
You can presumably build a brand new language with design goals. As you note, 
existing languages come with a millstone around their necks or an albatross.
R is an extendable language. You can look at many of the packages or even 
packages of packages such as the tidyverse as examples of adding on 
functionality to do things other ways that have caught on. Some even partially 
supplant use of perfectly usable base R methods. Many end up being largely 
rewritten as libraries in another language such as a version of C to speed them 
up. 
So I suspect limiting R from doing things multiple ways would encourage making 
more other ways and ignoring the base language.
But different ways of doing things is not just based on command names but on 
techniques within programming. Anyone who wants to can do a matrix 
multiplication using a direct primitive but also by a nested loop and other 
ways. There is nothing wrong with allowing more ways.
Yes, there is a huge problem with teaching too much and with reading code 
others wrote. 
But I suggest that there have been languages that tried to make you use 
relatively pure functional programming methods to solve everything. Others try 
to make you use object-oriented techniques. Obviously some older ones only 
allow procedural methods and some remain in the GOTO stage. 
Modern languages often seem to feel obligated to support multiple modes but 
then sometimes skimp on other things. R had a focus and it left out some things 
while a language like Python had another focus and included many things R left 
out while totallyignoring many it has. BOTH languages have later been extended 
through packages and modules because someone WANTED the darn features. People 
like having concepts they can use like sets and dictionaries, not just lists 
and vectors. They like having the ability to delay evaluation but also to force 
evaluation and so on. If you do not include some things in the language for 
purist reasons, you may find it used anyway and probably less reliably as 
various volunteers supply the need.
Python just added versions of a PIPE. That opens up all kinds of new ways to do 
almost anything. In the process, they already mucked with a new way to create 
an anonymous function, and are now planning to add a new use for a single 
underscore as a placeholder. But a significant number of R users already 
steadily use the various kinds of pipes written before using various methods 
and that can break in many cases. Is it wiser to let a large user body rebel, 
or consider a built-in and efficient way to give them that feature?
What I wonder is that now that we have a pipe in R, will any of the other ways 
wither away and use it internally or is it already too late and we are stuck 
now with even more incompatible ways to do about the same thing?



-Original Message-
From: J C Nash 
To: Reece Goding ; r-devel@r-project.org 

Sent: Tue, Apr 12, 2022 10:17 am
Subject: Re: [Rd] I've written a big review of R. Can I get some feedback?

Any large community-based project is going to be driven by the willing 
volunteers. Duncan Murdoch
has pointed this out a long time ago for R. Those who do are those who define 
what is done.

That said, I've felt for quite a long time that the multiplicity of ways in 
which R can do the same
tasks lead to confusion and errors. Arguably, a much stricter language 
definition that could execute
95% of the existing user R scripts and programs would be welcome and provide a 
tool that is easier
to maintain and, with a great deal of luck, lead to better maintainability of 
user codes.

And, as others have pointed out, backward compatibility is a millstone.

Whether anything will happen depends on who steps up to participate in R.

In the meantime, I believe it is important for all R users to report and try to 
fix those things
that are egregious faults, and documentation fixes are a very good starting 
point.

John Nash


On 2022-04-09 15:52, Reece Goding wrote:
> Hello,
> 
> For a while, I've been working on writing a very big review of R. I've 
> finally finished my final proofread of it. Can I get some feedback? This 
> seems the most appropriate place to ask. It's linked below.
> 
> https://github.com/ReeceGoding/Frustration-One-Year-With-R
> 
> If you think you've seen it before, that will be because it found some 
> popularity on Hacker News before I was done proofreading it. The reception 
> seems largely positive so far.
> 
> Thanks,
> Reece Goding
> __
> R-devel@r-project.