Re: [Rd] Inquiry about the behaviour of subsetting and names in matrices

2023-05-03 Thread GILLIBERT, Andre via R-devel

Karolis K wrote:

> This is more an inconsistency between vectors and matrices.

> In vectors both numeric and character sub-setting works with NAs.

> In matrices only numberic and not character sub-setting works with NAs.

> Potentially this in itself can also be a source of bugs, or, at least 
> surprises.


Indeed.


Karolis K wrote:

> My original impression was that R was “clever” about the usage of NAs by 
> design. i.e. when you choose an unknown object

> from a set of objects the result is an object, but nobody knows which - hence 
> NA. Is it really accepted now that such a

> decision was a mistake and lead to bugs in user code?


This makes sense but my personal opinion (I do not speak for R developers, as I 
am not an R developer at all) is that the R language is so "clever" that it 
often becomes unsafe.

Sometimes, this cleverness is handy for fast programming, such as NA 
propagation at many places. Other times, it causes more bugs than it helps, 
such as partial matching for the '$' operator. Indexation of column names in a 
matrix is probably not the place where NA propagation is the most useful, 
although it has its use cases. Consistency may be the main reason to add that 
feature, but I am not sure that this is a major incentive.


Of course, the opinion of R developers would be more useful than my own 
personal views.


--

Sincerely

André GILLIBERT


De : Karolis Koncevičius 
Envoyé : mercredi 3 mai 2023 11:08:28
À : GILLIBERT, Andre
Cc : r-devel@r-project.org
Objet : Re: [Rd] Inquiry about the behaviour of subsetting and names in matrices


ATTENTION: Cet e-mail provient d’une adresse mail extérieure au CHU de Rouen. 
Ne cliquez pas sur les liens ou n'ouvrez pas les pièces jointes à moins de 
connaître l'expéditeur et de savoir que le contenu est sûr. En cas de doute, 
transférer le mail à « DSI, Sécurité » pour analyse. Merci de votre vigilance


Thank you for such a quick reply, here are some points that I think might have 
been missed:

I would state the question the other way : why are NAs integer indices allowed?
In my experience, they are sometimes useful but they often delay the detection 
of bugs. However, due to backward compatibility, this feature cannot be 
removed. Adding this feature to character indices would worsen the problem.

But please also note that character indices with NA are allowed for vectors. 
This is more an inconsistency between vectors and matrices. In vectors both 
numeric and character sub-setting works with NAs. In matrices only numberic and 
not character sub-setting works with NAs. Potentially this in itself can also 
be a source of bugs, or, at least surprises.

Setting names() on a matrix is a rarely used feature that has practically no 
positive and no negative consequences. I see no incentive to change the 
behavior and break existing code.

When writing this message I had the opposite opinion. That this 2nd point is 
one of the most bug-probe points of all 3. As I would assume most users setting 
names() on a matrix would only do it by accident.

In my opinion adding these features would improve the consistency of R but 
would add more sources of bugs in an already unsafe language.

I think this maybe is a crux of the thing.

My original impression was that R was “clever” about the usage of NAs by 
design. i.e. when you choose an unknown object from a set of objects the result 
is an object, but nobody knows which - hence NA. Is it really accepted now that 
such a decision was a mistake and lead to bugs in user code?

Kind regards,
Karolis K.

On May 3, 2023, at 11:15 AM, GILLIBERT, Andre  
wrote:


Karolis wrote:
Hello,

I have stumbled upon a few cases where the behaviour of naming and subsetting 
in matrices seems unintuitive.
All those look related so wanted to put everything in one message.


1. Why row/col selection by names with NAs is not allowed?

 x <- setNames(1:10, letters[1:10])
 X <- matrix(x, nrow=2, dimnames = list(letters[1:2], LETTERS[1:5]))

 x[c(1, NA, 3)]   # vector: works and adds "NA"
 x[c("a", NA, "c")]   # vector: works and adds "NA"
 X[,c(1, NA, 3)]  # works and selects "NA" column
 X[,c("A", NA, "C")]  # 

I would state the question the other way : why are NAs integer indices allowed?
In my experience, they are sometimes useful but they often delay the detection 
of bugs. However, due to backward compatibility, this feature cannot be 
removed. Adding this feature to character indices would worsen the problem.

I see another reason to keep the behavior as is currently : character indices 
are most often used with column names in contexts were they are unlikely to be 
NAs except as a consequence of a bug. In other words, I fear that the 
valid-use-case/bug ratio would be quite poor with this feature.

2. Should setting names() for a matrix be allowed?

 names(X)

Re: [Rd] Inquiry about the behaviour of subsetting and names in matrices

2023-05-03 Thread GILLIBERT, Andre via R-devel

Karolis wrote:
> Hello,

> I have stumbled upon a few cases where the behaviour of naming and subsetting 
> in matrices seems unintuitive.
> All those look related so wanted to put everything in one message.


> 1. Why row/col selection by names with NAs is not allowed?

>   x <- setNames(1:10, letters[1:10])
>   X <- matrix(x, nrow=2, dimnames = list(letters[1:2], LETTERS[1:5]))

>   x[c(1, NA, 3)]   # vector: works and adds "NA"
>   x[c("a", NA, "c")]   # vector: works and adds "NA"
>   X[,c(1, NA, 3)]  # works and selects "NA" column
>   X[,c("A", NA, "C")]  # 

I would state the question the other way : why are NAs integer indices allowed?
In my experience, they are sometimes useful but they often delay the detection 
of bugs. However, due to backward compatibility, this feature cannot be 
removed. Adding this feature to character indices would worsen the problem.

I see another reason to keep the behavior as is currently : character indices 
are most often used with column names in contexts were they are unlikely to be 
NAs except as a consequence of a bug. In other words, I fear that the 
valid-use-case/bug ratio would be quite poor with this feature.

> 2. Should setting names() for a matrix be allowed?
>
>   names(X) <- paste0("e", 1:length(X))
>   X["e4"]  # works
>
>   # but any operation on a matrix drops the names
>   X <- X[,-1]  # all names are gone
>   X["e4"]  # 
>
>   Maybe names() should not be allowed on a matrix?

Setting names() on a matrix is a rarely used feature that has practically no 
positive and no negative consequences. I see no incentive to change the 
behavior and break existing code.

> 3. Should selection of non-existent dimension names really be an error?
>
>   x[22]   # works on a vector - gives "NA"
>   X[,22]  # 

This is very often a bug on vectors and should not have been allowed on vectors 
in the first place... But for backwards compatibility, it is hard to remove. 
Adding this unsafe feature to matrices is a poor idea in my opinion.

>   A potential useful use-case is matching a smaller matrix to a larger one:

This is a valid use-case, but in my opinion, it adds more problems than it 
solves.

> These also doesn't seem to be documented in '[', 'names', 'rownames’.

Indeed, the documentation of '[' seems to be unclear on indices out of range. 
It can be improved.

> Interested if there specific reasons for this behaviour, or could these 
> potentially be adjusted?

In my opinion adding these features would improve the consistency of R but 
would add more sources of bugs in an already unsafe language.

Sincerely
André GILLIBERT
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Documentation bug?

2023-02-14 Thread GILLIBERT, Andre via R-devel
Dead R developers,

In R-devel  2023-02-11 and older R versions, there is a note in the "lm 
{stats}" help page specifying that:
> Offsets specified by offset will not be included in predictions by 
> predict.lm, whereas 
> those specified by an offset term in the formula will be.

However, the source code as well as basic tests seem to show that both types of 
offset terms are always used in predictions.
a<-data.frame(off=1:4, outcome=4:1)
mod<-lm(data=a, outcome~1, offset=off)
coef(a) # intercept is zero
predict(mod) # returns 1:4, which uses offset
predict(mod, newdata=data.frame(off=c(3,2,5))) # returns c(3,2,5) which uses 
the new offset

When looking at the history of R source code, this note seems to exist from R 
1.0.0 while the source code of predict.lm already called 
eval(object$call$offset, newdata)
https://github.com/SurajGupta/r-source/blob/1.0.0/src/library/base/R/lm.R
https://github.com/SurajGupta/r-source/blob/1.0.0/src/library/base/man/lm.Rd

Version 0.99.0 did not contain the note, but already had the call to 
eval(object$call$offset, newdata)
https://github.com/SurajGupta/r-source/blob/0.99.0/src/library/base/man/lm.Rd
https://github.com/SurajGupta/r-source/blob/0.99.0/src/library/base/R/lm.R

The actual behavior of R seems to be sane to me, but unless I miss something, 
this looks like a documentation bug.
It seems to have bugged someone before:
https://stackoverflow.com/questions/71264495/why-is-predict-not-ignoring-my-offset-from-a-poisson-model-in-r-no-matter-how-i

Digging deeper in R history, it seems that this note was also found in "glm 
{stats}" in R 1.0.0 but was removed in R 1.4.1. Maybe somebody forgot to remove 
it in "lm {stats}" too.

--
Sincerely
Andr� GILLIBERT

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Recycling in arithmetic operations involving zero-length vectors

2023-01-16 Thread GILLIBERT, Andre
Duncan Murdoch  wrote:

> To even do that, we would have to first decide which "cases" should produce a 
> warning.

> Let's say `1 + x` should give a warning when x = numeric(0). Then should 
> `x^2` also produce a warning? Should `x^0.5`? Should `sqrt(x)`?
> Should `log(x)`?


The most probable errors would be in functions taking two arguments (e.g. `+`) 
and for which one argument has length >= 2 while the other has length 0.

In my experience, most code with accidental zero-length propagations (e.g. typo 
in data_frame$field) quickly lead to errors, that are easy to debug (except for 
beginners), and so, do not need a warning.

The only cases where zero-length propagation is really dangerous in my 
experience is in code using an aggregating function like sum(), all() or any(), 
because it silently returns a valid value for a zero-length argument. Emitting 
warnings for sum(numeric(0)) would probably have too many false positives but a 
(length >= 2) vs (length == 0) warning for common binary operators could 
sometimes catch the issue before it reaches the aggregating function.

-- 
Sincerely
André GILLIBERT
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Bug or feature?

2023-01-14 Thread GILLIBERT, Andre
Dear developers,


I found an inconsistency in the predict.lm() function between offset and 
non-offset terms of the formula, but I am not sure whether that is intentional 
or a bug.


The problem can be shown in a simple example:


mod <- local({
y <- rep(0,10)
x <- rep(c(0,1), each=5)
list(lm(y ~ x), lm(y ~ offset(x)))
})
# works fine, using the x variable of the local environment
predict(mod[[1]], newdata=data.frame(z=1:10))
# error 'x' not found, because it seeks x in the global environment
predict(mod[[2]], newdata=data.frame(z=1:10))

I would expect either both predict() to use the local x variable or the global 
x variable, but the current behavior is inconsistent.

In the worse case, both variables may exist but refer to different data, which 
seems to be very dangerous in my opinion.


The problem becomes obvious from the source code of model.frame.default() and 
predict.lm()

predict.lm() calls model.frame()

For a non-offset variable, the source code of model.frame.default shows:


variables <- eval(predvars, data, env)


Where env is the environment of the formula parameter.

Consequently, non-offset variables are evaluated in the context of the data 
frame, then in the environment of the formula/terms of the model.


For offset variables, the source code of predict.lm() contains:

eval(attr(tt, "variables")[[i + 1]], newdata)


It is not executed in the environment of the formula/terms of the model.


The inconsistency could easily be fixed by a patch to predict.lm() by replacing 
eval(attr(tt, "variables")[[i + 1]], newdata) by eval(attr(tt, "variables")[[i 
+ 1]], newdata, environment(Terms))

The same modification would have to be done two lines after:

offset <- offset + eval(object$call$offset, newdata, environment(Terms))


However, fixing this inconsistency could break code that rely on the old 
behavior.


What do you think of that?


--

Sincerely

Andr� GILLIBERT

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Bug in optim for specific orders of magnitude

2023-01-03 Thread GILLIBERT, Andre

J C Nash wrote:
> Extreme scaling quite often ruins optimization calculations. If you think 
> available methods are capable of doing this, there's a bridge I can sell you 
> in NYC.

Given that optim()seems to have problem with denormals but work correctly with 
values greater than 1e-308, I think that it can hardly be considered a real bug.

There is no miracle solution, but there are a few tricks to work with extreme 
values.
What I call "flogs" are alternative representation of a number x as log(abs(x)) 
and sign(x). It supports very large and very small numbers, but the precision 
of flogs is very poor for very large numbers. Functions for addition, 
substraction, multiplication, etc, can easily be written.

For values that are between 0 and 1 and can be extremely close to 0 or 
extremely close to 1, a cloglog transformations can be used.

Another trick is to detect "hot points". Values that can be extremely close to 
hot points should be expressed as differences to the hot point, and the 
algorithm should be re-written to work with that, which is not always easy to 
do. For instance, the function log1p can be seen as a modification of the log 
function when the hot point is 1. 

In my experience, the algorithm should be rewritten before having to deal with 
denormals, because denormals expand the range of floating point values by a 
quite small amount.

-- 
Sincerely
André GILLIBERT
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Better 'undefined columns' error for data.frame

2022-09-26 Thread GILLIBERT, Andre



Duncan Murdoch  wrote:
>
>
> On 25/09/2022 2:48 p.m., Dirk Eddelbuettel wrote:
> >
> > Andre,
> >
> > On 25 September 2022 at 18:09, GILLIBERT, Andre wrote:
> > | Please, find the patch attached, based on the latest R SVN trunk code.
> >
> > Well the mailing list software tends to drop attachments.  There is a reason
> > all these emails suggest to use bugs.r-project.org.
> >
>
> I was named in the posting, so I saw the patch file, my copy didn't come
> via the list.  It's a good patch, I hope Andre does post there.
>
> Duncan Murdoch

Thank you.
I reported an "enhancement" #18409 at 
https://bugs.r-project.org/show_bug.cgi?id=18409
I took in account your suggestions to improve error messages.

I choosed to provide different messages for character and logical/numeric 
indices, but this increases the code size, the number of code paths and the 
number of translations to perform.
If you have suggestions to improve the patch, I am open to comments and ideas.

--
Sincerely
Andre GILLIBERT

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Better 'undefined columns' error for data.frame

2022-09-25 Thread GILLIBERT, Andre


Duncan Murdoch  wrote:
> On 24/09/2022 9:56 a.m., GILLIBERT, Andre wrote:
> > Dear R developers,
> >
> >
> > One of the error messages that make me loose the most time is the 
> > "undefined columns selected" of `[.data.frame`.
> >
> > It ought to specify the list of bad column names, but currently does not.
> >
> > Fortunately, this can easily be fixed by a small patch I can write.
> >
> >
> > Are you interested in that patch?
>
> I doubt if you'll get an answer to this without showing it to us.
>
> > Is there a standard way to transfer patches for feature requests?
>
> The standard instructions are here:  https://www.r-project.org/bugs.html .
> You're offering an enhancement rather than a bug fix, but the process is
> the same.  But since your change is probably quite small, you might not
> need to go through the whole process:  proposing it on this mailing list
> might be sufficient.
>
> Duncan Murdoch

Please, find the patch attached, based on the latest R SVN trunk code.
In order to avoid any accidental modification of the behavior of 
`[.data.frame`, I only inserted code at positions where the "undefined columns 
selected" error was generated.
In order to test all code paths, I wrote a small test script (code attached), 
but did not insert clean & standardized tests to the R build.

I added a new base::.undefined_columns_msg closure. I am not sure that adding 
such new internal symbols to the base package is smart because it pollutes the 
base environment, but looking at the base-internal documentation page, it seems 
to be standard practice.

For message translations, I tried to use tools::xgettext() to re-generate the 
R-base.pot file, but this file seems to be out of sync for a few messages. So, 
I manually added the new messages to R-base.pot.
As I am a native french speaker, I added appropriate translations to R-fr.po.
The make command did not regenerate .mo files, and I did not find any 
appropriate make target to rebuild them, so, for testing purposes, I manually 
generated them.  The R installation & administration manual did not help 
(https://cran.r-project.org/doc/manuals/r-release/R-admin.html#Localization-of-messages).

When compiling the source code on a x86_64 Ubuntu 16.04 system, I got a problem 
with position independent code in the internal R LAPACK module. The -fPIC 
option was not properly passed to the gfortran command line. I fixed that by 
replacing $(ALL_FCFLAGS) by $(ALL_FFLAGS) on line 16 of 
src/modules/lapack/Makefile.in
That looks like a bug to me (misspelling of FFLAGS), but I did not fix that in 
my patch, since it is a very different issue.

--
Sincerely
André GILLIBERT
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Better 'undefined columns' error for data.frame

2022-09-24 Thread GILLIBERT, Andre
Dear R developers,


One of the error messages that make me loose the most time is the "undefined 
columns selected" of `[.data.frame`.

It ought to specify the list of bad column names, but currently does not.

Fortunately, this can easily be fixed by a small patch I can write.


Are you interested in that patch?

Is there a standard way to transfer patches for feature requests?


--

Sincerely

Andr� GILLIBERT

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Documentation of addmargins

2021-12-07 Thread GILLIBERT, Andre

Thomas SOEIRO wrote:
> Dear list,

> There is a minor typo in addmargins (section Details):

> - If the functions used to form margins are not commutative the result 
> depends on the order in which margins are computed. Annotation of margins is 
> done via naming the FUN list.
> + If the functions used to form margins are not commutative**add ':' or ', 
> i.e.' here** the result depends on the order in which margins are computed. 
> Annotation of margins is done via naming the FUN list.
>
>
> I'm not sure if such minor things really need to be reported when they are 
> noticed... Please let me know if not. Of course this is minor, but imho one 
> of the strengths of R is also its documentation!
>

The documentation looks correct to me.
If the function FUN is not commutative (i.e. the result depends on the order of 
the vector passed to it), then the result of addmargins() will depend on the 
order of the 'margin' argument to the addmargins() function.

For instance:
mat <- rbind(c(1,10),c(100,1000))
fun <- function(x) {x[1]-x[2]-x[1]*x[2]} # non-commutative function
a <- addmargins(mat ,margin=c(1,2), FUN=fun)
b <- addmargins(mat ,margin=c(2,1), FUN=fun)

a and b are different, because the fun function is not commutative.

-- 
Sincerely
André GILLIBERT
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] string concatenation operator (revisited)

2021-12-05 Thread GILLIBERT, Andre
Ivan Krylov  wrote:
> FWIW, Perl5 has a separate string concatenation operator (".") in order
> to avoid potential confusion with addition. So do Lua (".."), SQL
> ("||", only some of the dialects) and Raku ("~", former Perl6).


Indeed, using the same operator '+' for addition and string concatenation is 
not a great idea in my opinion.
Accidental character arguments to a '+' that meant to be a numerical addition 
would go undetected. Bug tracking would be harder in that case.

R is already too permissive: it finds some interpretation of most probably 
buggy code, such as ifelse() on vectors of unequal length or '[' operator with 
only one argument to index a matrix. I would not want to add new permissive 
behaviors.

A new operator, dedicated to string concatenation, such as %+% or %.%, would be 
better, in my opinion.

--
Sincerely
Andr� GILLIBERT

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] stats::fft produces inconsistent results

2021-10-21 Thread GILLIBERT, Andre
ary resources (e.g. a socket you 
allocated) when a longjmp is triggered. Non-memory resources (e.g. a socket) 
returned to R should use the R_MakeExternalPtr() mechanism to make sure that, 
when the memory is freed by the garbage collector, the resource is also freed.


"Writing R extensions" contains more extensive documentation, but I hope that 
my quick description of the system will make it easier to understand the 
extensive documentation.


 --

Sinc�rement

Andr� GILLIBERT



De : Dipterix Wang 
Envoy� : mercredi 20 octobre 2021 22:01
� : Martin Maechler ; GILLIBERT, Andre 
; bbol...@gmail.com
Cc : r-devel@r-project.org
Objet : Re: [Rd] stats::fft produces inconsistent results



ATTENTION: Cet e-mail provient d�une adresse mail ext�rieure au CHU de Rouen. 
Ne cliquez pas sur les liens ou n'ouvrez pas les pi�ces jointes � moins de 
conna�tre l'exp�diteur et de savoir que le contenu est s�r. En cas de doute, 
transf�rer le mail � � DSI, S�curit� � pour analyse. Merci de votre vigilance



Wow, you guys are amazing!



as part of its pipeline, ravetools::mvfftw computes the mean of the
input vector **and then centers it to a mean of zero** (intentionally or
accidentally?)



because variables are passed to compiled code by reference (someone
can feel free to correct my terminology), this means that the original
vector in R now has a mean of zero



I didn�t center the input vector in my code. The data was fed �as-is� into 
FFTW3. My guess is FFTW3 internally center the data. It could be that FFTW3 
library behave differently on different platforms, which could explain the 
reproducibility issue.



"Indeed, R vectors are passed "by reference" to C code, but the semantic must 
be "by value", i.e. the C function must NOT change the contents of the vector, 
except in very specific cases.�



CRAN has already had fftw and fftwtools, the issue is the data I�m targeting at 
are at GB-level, copying the vectors can be memory inefficient or even use up 
memories. The strategy of ravetools is to import signals from local files, fft, 
then directly write to disk. So only one reference will be used and modifying 
in-place is on purpose. In fact, and the fft functions I created are not 
intended to be used directly by users.



However, I should've been very cautious when using these functions. This is my 
fault. I�ll check the whole package to make sure only one reference is used or 
otherwise the vectors will be copied.



This can be tested by the MAYBE_REFERENCED() macro in Rinternals.h.



Nice to learn! I�ll add it to my code.



Properly using R vectors in C code is tricky. You have to understand.

1) When you are allowed or not to modify vectors

2) When to PROTECT()vectors

3) How the garbage collector works and when it can trigger (answer : basically, 
when you call any internal R function)

Chapter 5 of "Writing R Extensions" documentation is quite extensive:

 
https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Handling-R-objects-in-C



Indeed, I found myself often confused about when to PROTECT and when not.



... but then ravetools  is not even a CRAN package, so why should you dare to 
use it for anything serious ?



Haha, thanks : ) I guess I will probably be grouchy too if seeing so many 
people making the same mistakes again and again. It just happened to be me.

But it�s good to be rigorous. Sooner or later I'll have to face these problems. 
It�s better to make mistakes before having made many.



Thanks y�all!



Best,

- Dipterix Wang





On Oct 20, 2021, at 5:32 AM, Martin Maechler 
mailto:maech...@stat.math.ethz.ch>> wrote:



Martin Maechler
   on Wed, 20 Oct 2021 11:26:21 +0200 writes:

[]



Thank you, Andr� , that's very good.



Just to state the obvious conclusion:



If Ben's suggestion is correct (and Andr� has explained *how*
that could happen) this would mean  a
SEVERE BUG in package ravetools's  mvfftw() function.



and it would have been (yet another) case of gaining speed by
killing correctness...



... but then ravetools  is not even a CRAN package, so why
should you dare to use it for anything serious ?



... yes, being grouchy ..

which I should rather not be.

Dipterix Wang *did* say initially that he is currently
developing ravetools so it's very reasonabl this is not yet a
CRAN package..

Best,
Martin



[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] stats::fft produces inconsistent results

2021-10-20 Thread GILLIBERT, Andre
Hello,

That sounds like a good diagnosis!
Indeed, R vectors are passed "by reference" to C code, but the semantic must be 
"by value", i.e. the C function must NOT change the contents of the vector, 
except in very specific cases.

A good program that has to work on a vector, must first duplicate the vector, 
unless the only reference to the vector is the reference inside the C function.
This can be tested by the MAYBE_REFERENCED() macro in Rinternals.h.

A good example can be found in the fft() function in 
src/library/stats/src/fourier.c in R source code:
switch (TYPEOF(z)) {
case INTSXP:
case LGLSXP:
case REALSXP:
z = coerceVector(z, CPLXSXP);
break;
case CPLXSXP:
if (MAYBE_REFERENCED(z)) z = duplicate(z);
break;
default:
error(_("non-numeric argument"));
}
PROTECT(z);

This code coerces non-complex vectors to complex. Since this makes a copy, 
there is no need to duplicate.
Complex vectors are duplicated, unless they are not referenced by anything but 
the fft() function.

Now, the z vector can be modified "in place" without inconsistency.

Properly using R vectors in C code is tricky. You have to understand.
1) When you are allowed or not to modify vectors
2) When to PROTECT()vectors
3) How the garbage collector works and when it can trigger (answer : basically, 
when you call any internal R function)

Chapter 5 of "Writing R Extensions" documentation is quite extensive:
https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Handling-R-objects-in-C

-- 
Sincerely
André GILLIBERT

-Message d'origine-
De : R-devel  De la part de Ben Bolker
Envoyé : mercredi 20 octobre 2021 03:27
À : r-devel@r-project.org
Objet : Re: [Rd] stats::fft produces inconsistent results


   This is a long shot, but here's a plausible scenario:

   as part of its pipeline, ravetools::mvfftw computes the mean of the
input vector **and then centers it to a mean of zero** (intentionally or
accidentally?)

   because variables are passed to compiled code by reference (someone
can feel free to correct my terminology), this means that the original
vector in R now has a mean of zero

   the first element of fft() is mean(x)*length(x), so if mean(x) has
been forced to zero, that would explain your issue.

   I don't know about the non-reproducibility part.

On 10/19/21 7:06 PM, Dipterix Wang wrote:
> Dear R-devel Team,
>
> I'm developing a neuroscience signal pipeline package in R 
> (https://github.com/dipterix/ravetools) and I noticed a weird issue that 
> failed my unit test.
>
> Basically I was trying to use `fftw3` library to implement fast multivariate 
> fft function in C++. When I tried to compare my results with stats::fft, the 
> test result showed the first element of **expected** (which was produced by 
> stats::fft) was zero, which, I am pretty sure, is wrong, and I can confirm 
> that my function produces correct results.
>
> However, somehow I couldn’t reproduce this issue on my personal computer 
> (osx, M1, R4.1.1), the error simply went away.
>
> The catch is my function produced consistent and correct results but 
> stats::fft was not. This does not mean `stats::fft` has bugs. Instead, I 
> suspect there could be some weird interactions between my code and stats::fft 
> at C/C++ level, but I couldn’t figure it out why.
>
> +++ Details:
>
> Here’s the code I used for the test:
>
> https://github.com/dipterix/ravetools/blob/4dc35d64763304aff869d92dddad38a7f2b30637/tests/testthat/test-fftw.R#L33-L41
>
> Test code
> set.seed(1)
> x <- rnorm(1000)
> dim(x) <- c(100,10)
> a <- ravetools:::mvfftw_r2c(x, 0)
> c <- apply(x, 2, stats::fft)[1:51,]
> expect_equal(a, c)
> 
>
> Here are the tests that gave me the errors:
>
> The test logs on win-builder
> https://win-builder.r-project.org/07586ios8AbL/00check.log
>
> Test logs on GitHub
> https://github.com/dipterix/ravetools/runs/3944874310?check_suite_focus=true
>
>
> —— Failed tests ——
>   -- Failure (test-fftw.R:41:3): mvfftw_r2c 
> --
>   `a` (`actual`) not equal to `c` (`expected`).
>
>   actual vs expected
>   [,1][,2]
>   [,3]  [,4]...
>   - actual[1, ] 10.8887367+ 0.000i  -3.7808077+ 0.000i   
> 2.967354+ 0.00i   5.160186+ 0.00i ...
>   + expected[1, ]0.000+ 0.000i  -3.7808077+ 0.000i   
> 2.967354+ 0.00i   5.160186+ 0.00i...
>
> 
>
> The first columns are different, `actual` is the results I produced via 
> `ravetools:::mvfftw_r2c`, and `expected` was produced by `stats::fft`
>
>
> Any help or attention is very much appreciated.
> Thanks,
> - Zhengjia
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

--
Dr. Benjamin Bolker

Re: [Rd] [External] Re: Workaround very slow NAN/Infinities arithmetic?

2021-10-01 Thread GILLIBERT, Andre

> Mildly related (?) to this discussion, if you happen to be in a situation
> where you know something is a C NAN, but need to check if its a proper R
> NA, the R_IsNA function is surprisingly (to me, at least) expensive to do
> in a tight loop because it calls the (again, surprisingly expensive to me)
> isnan function.  

What is your platform? CPU, OS, compiler?
How much expensive? 5-10 times slower than the improved code you wrote, or 
100-200 times slower?

I analyzed the C and assembly source code of R_IsNA on a x86_64 GNU/Linux 
computer (Celeron J1900) with GCC 5.4 and found that it was somewhat expensive, 
but the main problems did not seem to come from isnan.

isnan was only responsible of a ucomisd xmm0, xmm0 instruction followed by a 
conditional jump on x86_64. This instruction is slower on NAN than on normal 
FP, but it seems to have an acceptable speed.
On x86_32, the isnan is  responsible of a fld mem64, fst mem64, fucomip and 
conditional jump : it is suboptimal, but things could be worse.

On x86_64, the first problem I noticed is that R_IsNA is not inlined, and the 
registry-based x86_64 Linux calling convention is not necessarily good for that 
problem, with added loads/unloads from memory to registry.
Second problem (the worst part) : the write of a 64-bits double followed by the 
read of a 32-bits integer in the ieee_double union confuses the compiler, that 
generates very poor code, with unnecessary load/stores.

The second problem can be solved by using a union with a uint64_t and a double 
fields, and using &0x to extract the low part of the uint64_t. This 
works well for x86_64, but also for x86_32, where GCC avoids useless emulation 
of 64-bits integers, directly reading the 32-bits integer.

-- 
Sincerely
André GILLIBERT
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Workaround very slow NAN/Infinities arithmetic?

2021-09-30 Thread GILLIBERT, Andre
Brodie Gaslam wrote:
> André,

> I'm not an R core member, but happen to have looked a little bit at this
> issue myself.  I've seen similar things on Skylake and Coffee Lake 2
> (9700, one generation past your latest) too.  I think it would make sense
> to have some handling of this, although I would want to show the trade-off
> with performance impacts on CPUs that are not affected by this, and on
> vectors that don't actually have NAs and similar.  I think the performance
> impact is likely to be small so long as branch prediction is active, but
> since branch prediction is involved you might need to check with different
> ratios of NAs (not for your NA bailout branch, but for e.g. interaction
> of what you add and the existing `na.rm=TRUE` logic).

For operators such as '+', randomly placed NAs could slow down AMD processor 
due to many branch mispredictions. However, the functions using long doubles 
are mainly "cumulative" functions such as sum, mean, cumsum, etc. They can stop 
at the first NA found.

> You'll also need to think of cases such as c(Inf, NA), c(NaN, NA), etc.,
> which might complicate the logic a fair bit.

When an infinity is found, the rest of the vector must be searched for 
infinities of the opposite side or NAs.
Mixing NaN and NAs has always been platform-dependent in R, but, on x87, it 
seems that the first NA/NaN met wins. Being consistent with that behavior is 
easy.

I wrote a first patch for the sum() function, taking in account all those 
special +Inf/-Inf/NA/NaN cases. Without any NA, the sum() function was 1.6% 
slower with my first patch on a Celeron J1900. Not a big deal, in my opinion.

However, I could easily compensate that loss (and much more) by improving the 
code.
By splitting the na.rm=TRUE and na.rm=FALSE code paths, I could save a useless 
FP comparison (for NAN), a useless CMOV (for the 'updated' variable) and 
useless SSE->memory->FP87 moves (high latency).

My new code is x1.75 times faster, for sum(big_vector_without_NAs, na.rm=FALSE).
It is x1.35 times faster for sum(big_vector_without_NAs, na.rm=TRUE)

Of course, it is much faster if there are any NA, because it stops at the first 
NA found. For infinities on AMD CPUs, it may not necessarily be faster.

-- 
Sincerely
André GILLIBERT
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Workaround very slow NAN/Infinities arithmetic?

2021-09-30 Thread GILLIBERT, Andre
Dear R developers,

By default, R uses the "long double" data type to get extra precision for 
intermediate computations, with a small performance tradeoff.

Unfortunately, on all Intel x86 computers I have ever seen, long doubles 
(implemented in the x87 FPU) are extremely slow whenever a special 
representation (NA, NaN or infinities) is used; probably because it triggers 
poorly optimized microcode in the CPU firmware. A function such as sum() 
becomes more than hundred times slower!
Test code:
a=runif(1e7);system.time(for(i in 1:100)sum(a))
b=a;b[1]=NA;system.time(sum(b))

The slowdown factors are as follows on a few intel CPU:

1)  Pentium Gold G5400 (Coffee Lake, 8th generation) with R 64 bits : 140 
times slower with NA

2)  Pentium G4400 (Skylake, 6th generation) with R 64 bits : 150 times 
slower with NA

3)  Pentium G3220 (Haswell, 4th generation) with R 64 bits : 130 times 
slower with NA

4)  Celeron J1900 (Atom Silvermont) with R 64 bits : 45 times slower with NA

I do not have access to more recent Intel CPUs, but I doubt that it has 
improved much.

Recent AMD CPUs have no significant slowdown.
There is no significant slowdown on Intel CPUs (more recent than Sandy Bridge) 
for 64 bits floating point calculations based on SSE2. Therefore, operators 
using doubles, such as '+' are unaffected.

I do not know whether recent ARM CPUs have slowdowns on FP64... Maybe somebody 
can test.

Since NAs are not rare in real-life, I think that it would worth an extra check 
in functions based on long doubles, such as sum(). The check for special 
representations do not necessarily have to be done at each iteration for 
cumulative functions.
If you are interested, I can write a bunch of patches to fix the main functions 
using long doubles: cumsum, cumprod, sum, prod, rowSums, colSums, matrix 
multiplication (matprod="internal").

What do you think of that?

--
Sincerely
Andr� GILLIBERT

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Segfault in setMask in R 4.1

2021-09-21 Thread GILLIBERT, Andre
Hello,

Which graphic device (backend) do you use?
The setMask() function calls the internal setMask function associated to the 
graphic device.
If the Web application uses one of the standard graphic backend, the bug may 
come from the R core. If it uses a custom graphic backend, the bug may be in 
the backend.

-- 
Sincerely
André GILLIBERT

ATTENTION: Cet e-mail provient d’une adresse mail extérieure au CHU de Rouen. 
Ne cliquez pas sur les liens ou n'ouvrez pas les pièces jointes à moins de 
connaître l'expéditeur et de savoir que le contenu est sûr. En cas de doute, 
transférer le mail à « DSI, Sécurité » pour analyse. Merci de votre vigilance


I have inherited a build of R. We compile R from source and then use a custom
web application to allow users to enter R statements and render the output (it's
kind of like RStudio although the UX is quite different).

Things were going fine until I tried to upgrade to R 4.1.x. The build succeeds,
but I get the following segfault when I make qplot (ggplot2) calls:

 *** caught segfault ***
address (nil), cause 'memory not mapped'

Traceback:
 1: .setMask(NULL, NULL)
 2: resolveMask.NULL(NULL)
 3: (function (path) {UseMethod("resolveMask")})(NULL)
 4: grid.newpage()
 5: print.ggplot(x)
 6: (function (x, ...) UseMethod("print"))(x)

I am not an R developer, so I don't really know how to read this. I am wondering
if this is a known issue or if anyone has any suggestions. Here's some things
I've tested:

- I get the same segfault in R 4.1.0 and R 4.1.1. I do not get this error in R
  4.0.4 or R 4.0.5.

- The problem appears to be specific to the (graphics?) features of R that
  ggplot2 uses. I do not get a segfault if I do a generic `plot(c(1,2,3))`.

- I have tried versions of ggplot2 3.3.3 and 3.3.5.

- The traceback points to files that were introduced in this commit
  
https://github.com/wch/r-source/commit/16755bcddffe0cb4238d8a4979387d92b93a8324#diff-5c63f74229830cdde7886a50bf06dafcdf1d5b3d42cfa06b26814876e58c5ab0

I am not an R user or R developer so I'm a bit stuck here. I would be happy to
give the output of any commands. If anyone has any suggestions then please let
me know!

Thanks!
Mike Lee Williams

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] WISH: set.seed(seed) to produce error if length(seed) != 1 (now silent)

2021-09-17 Thread GILLIBERT, Andre
Hello,

A vector with a length >= 2 to set.seed would probably be a bug. An error 
message will help the user to fix his R code. The bug may be accidental or due 
to bad understanding of the set.seed function. For instance, a user may think 
that the whole state of the PRNG can be passed to set.seed.

The "if" instruction, emits a warning when the condition has length >= 2, 
because it is often a bug. I would expect a warning or error with set.seed().

Validating inputs and emitting errors early is a good practice.

Just my 2 cents.

Sincerely.
Andre GILLIBERT

-Message d'origine-
De : R-devel [mailto:r-devel-boun...@r-project.org] De la part de Avraham Adler
Envoyé : vendredi 17 septembre 2021 12:07
À : Henrik Bengtsson
Cc : R-devel
Objet : Re: [Rd] WISH: set.seed(seed) to produce error if length(seed) != 1 
(now silent)

Hi, Henrik.

I’m curious, other than proper programming practice, why?

Avi

On Fri, Sep 17, 2021 at 11:48 AM Henrik Bengtsson <
henrik.bengts...@gmail.com> wrote:

> Hi,
>
> according to help("set.seed"), argument 'seed' to set.seed() should be:
>
>   a single value, interpreted as an integer, or NULL (see ‘Details’).
>
> From code inspection (src/main/RNG.c) and testing, it turns out that
> if you pass a 'seed' with length greater than one, it silently uses
> seed[1], e.g.
>
> > set.seed(1); sum(.Random.seed)
> [1] 4070365163
> > set.seed(1:3); sum(.Random.seed)
> [1] 4070365163
> > set.seed(1:100); sum(.Random.seed)
> [1] 4070365163
>
> I'd like to suggest that set.seed() produces an error if length(seed)
> > 1.  As a reference, for length(seed) == 0, we get:
>
> > set.seed(integer(0))
> Error in set.seed(integer(0)) : supplied seed is not a valid integer
>
> /Henrik
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
--
Sent from Gmail Mobile

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Question about quantile fuzz and GPL license

2021-09-15 Thread GILLIBERT, Andre
Martin Maechler wrote:
> OTOH,  type=7 is the default, and I guess used in 99.9% of
> all uses of quantile, *and* does never use any fuzz 

Indeed. This also implies that this default should be well-thought when 
creating a new implementation of the quantile() procedure for a new programming 
language or library.
Most of the time, users use the default procedure, and do not report the 
procedure used in the statistical analysis reports, scientific or 
non-scientific articles produced.
The differences between all quantiles procedures are minor, unless they are 
used in crazy scenarios such as a sample size of 2, or with probs=0.001 for a 
sample of size 1000.
But, standardization of procedures is desirable for analysis reproducibility, 
as well as teaching (see https://doi.org/10.1080/10691898.2006.11910589 ).

Hyndman and Fan wanted that software package standardize their definition, but 
to no avail:
See https://robjhyndman.com/hyndsight/sample-quantiles-20-years-later/

In the absence of standard, my personal advice would be to use the same default 
as a popular statistical software, such as R or SAS.

R, Julia and NumPy (python) uses type 7 as default.
Microsoft Excel and LibreOffice Calc use type 7 as default (although Excel 
versions >= 2010 have new procedures).
SAS uses type 3 as default, unless prob=0.50
Stata uses type 2 or type 6, depending on the procedure 
(https://data.princeton.edu/stata/markdown/quantiles.htm)

-- 
Sincerely
André GILLIBERT

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Question about quantile fuzz and GPL license

2021-09-14 Thread GILLIBERT, Andre


On 9/14/21 9:22 AM, Abel AOUN wrote:
> However I don't get why epsilon is multiplied by 4 instead of simply using 
> epsilon.
> Is there someone who can explain this 4 ?

.Machine$double.eps is the "precision" of floating point values for values 
close to 1.0 (between 0.5 and 2.0).

Using fuzz = .Machine$double.eps would have no effect if nppm is greater than 
or equal to 2.
Using fuzz = 4 * .Machine$double.eps can fix rounding errors for nppm < 8; for 
greater nppm, it has no effect.

Indeed:
2 + .Machine$double.eps == 2
8+ 4*.Machine$double.eps == 8

Since nppm is approximatively equal to the quantile multiplied by the sample 
size, it can be much greater than 2 or 8.

Maybe the rounding errors are only problematic for small nppm; or only that 
case is taken in account.

Moreover, if rounding errors are cumulative, they can be much greater than the 
precision of the floating point value. I do not know how this constant was 
chosen and what the use-cases were.

--
Sincerely
Andre GILLIBERT


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Spurious warnings in coercion from double/complex/character to raw

2021-09-10 Thread GILLIBERT, Andre
Hello,


Integer overflow is undefined behavior by the C standard.


For instance, on my computer, with GCC 5.4.0, with the optimization level 2, 
the following program never stops:


include 


int main(void) {
for(int i=1; i != 0; i++) {
if ((i & 0xFFF) == 0) {
printf("%d\n", i);
}
}
}



This is due to a compiler optimization, that assumes that the integer can never 
overflow, and so, can never be equal to zero, and so, the for loop should never 
stops. You should always be very cautious when adding two integers, to avoid 
any overflow. There is no problem with unsigned integers.


Similarly, double-to-integer conversions are only safe if the double is in the 
range [INT_MIN to INT_MAX]


The standard contains:

When a finite value of real floating type is converted to an integer type other 
than _Bool,
the fractional part is discarded (i.e., the value is truncated toward zero). If 
the value of
the integral part cannot be represented by the integer type, the behavior is 
undefined


The easiest solution to avoid a risk when converting, is to check that the 
double (e.g. vi) is in range [0 to 255] BEFORE converting to an integer.


--

Sincerely

Andr� GILLIBERT


De : R-devel  de la part de Duncan Murdoch 

Envoy� : vendredi 10 septembre 2021 18:12:02
� : Herv� Pag�s; r-devel
Objet : Re: [Rd] Spurious warnings in coercion from double/complex/character to 
raw

ATTENTION: Cet e-mail provient d�une adresse mail ext�rieure au CHU de Rouen. 
Ne cliquez pas sur les liens ou n'ouvrez pas les pi�ces jointes � moins de 
conna�tre l'exp�diteur et de savoir que le contenu est s�r. En cas de doute, 
transf�rer le mail � � DSI, S�curit� � pour analyse. Merci de votre vigilance


On 10/09/2021 11:29 a.m., Herv� Pag�s wrote:
> Hi,
>
> The first warning below is unexpected and confusing:
>
> > as.raw(c(3e9, 5.1))
> [1] 00 05
> Warning messages:
> 1: NAs introduced by coercion to integer range
> 2: out-of-range values treated as 0 in coercion to raw
>
> The reason we get it is that coercion from numeric to raw is currently
> implemented on top of coercion from numeric to int (file
> src/main/coerce.c, lines 700-710):
>
>   case REALSXP:
>   for (i = 0; i < n; i++) {
> //  if ((i+1) % NINTERRUPT == 0) R_CheckUserInterrupt();
>   tmp = IntegerFromReal(REAL_ELT(v, i), );
>   if(tmp == NA_INTEGER || tmp < 0 || tmp > 255) {
>   tmp = 0;
>   warn |= WARN_RAW;
>   }
>   pa[i] = (Rbyte) tmp;
>   }
>   break;
>
> The first warning comes from the call to IntegerFromReal().
>
> The following code avoids the spurious warning and is also simpler and
> slightly faster:
>
>   case REALSXP:
>   for (i = 0; i < n; i++) {
> //  if ((i+1) % NINTERRUPT == 0) R_CheckUserInterrupt();
>   double vi = REAL_ELT(v, i);
>   if(ISNAN(vi) || (tmp = (int) vi) < 0 || tmp > 255) {
>   tmp = 0;
>   warn |= WARN_RAW;
>   }
>   pa[i] = (Rbyte) tmp;
>   }
>   break;

Doesn't that give different results in case vi is so large that "(int)
vi" overflows?  (I don't know what the C standard says, but some online
references say that behaviour is implementation dependent.)

For example, if

   vi = 1.0 +  INT_MAX;

wouldn't "(int) vi" be equal to a small integer?

Duncan Murdoch


>
> Coercion from complex to raw has the same problem:
>
> > as.raw(c(3e9+0i, 5.1))
> [1] 00 05
> Warning messages:
> 1: NAs introduced by coercion to integer range
> 2: out-of-range values treated as 0 in coercion to raw
>
> Current implementation (file src/main/coerce.c, lines 711-721):
>
>   case CPLXSXP:
>   for (i = 0; i < n; i++) {
> //  if ((i+1) % NINTERRUPT == 0) R_CheckUserInterrupt();
>   tmp = IntegerFromComplex(COMPLEX_ELT(v, i), );
>   if(tmp == NA_INTEGER || tmp < 0 || tmp > 255) {
>   tmp = 0;
>   warn |= WARN_RAW;
>   }
>   pa[i] = (Rbyte) tmp;
>   }
>   break;
>
> This implementation has the following additional problem when the
> supplied complex has a nonzero imaginary part:
>
> > as.raw(300+4i)
> [1] 00
> Warning messages:
> 1: imaginary parts discarded in coercion
> 2: out-of-range values treated as 0 in coercion to raw
>
> > as.raw(3e9+4i)
> [1] 00
> Warning messages:
> 1: NAs introduced by coercion to integer range
> 2: out-of-range values treated as 0 in coercion to raw
>
> In one case we get a warning about the discarding of the imaginary part
> but not the other case, which is unexpected. We should see the exact
> same warning (or warnings) in both cases.
>
> With the following fix we only get the warning about the discarding of
> 

Re: [Rd] sum() and mean() for (ALTREP) integer sequences

2021-09-02 Thread GILLIBERT, Andre


Hello,


Please, find a long response below.


== difference between mean(x) and sum(x)/length(x) ==


At the core, mean(x) and sum(x)/length(x) works very differently for real 
numbers.

Mean is more accurate when applied to a vector with a small variance but a very 
high mean, especially on platforms without extended precision numbers (i.e. 
long double is 64 bits rather than 80 bits).


On a Windows 10 computer (with 80 bits long double):

k=1e6
base=2^51
realmean = mean(1:k)
sqa=(base+1):(base+k) # with ALTREP
sq=(base+1):(base+k)+0.0 # without ALTREP
mean(sq) - base - realmean  # correctly returns zero
sum(sq)/k - base - realmean # incorrectly returns -0.5
sum(sqa)/k - base - realmean # correctly returns zero

On a GNU/Linux x86_64 computer with extended precision floating point disabled, 
the difference is worse:
mean(sq) - base - realmean  # correctly returns zero
sum(sq)/k - base - realmean # incorrectly returns -1180
sum(sqa)/k - base - realmean # correctly returns zero

Therefore (without ALTREP) sum can be inaccurate, due to floating point 
rounding errors.

The good accuracy of mean() is due to a two-passes algorithm of the real_mean() 
function in src/main/summary.c.

The algorithm can be summarized by the following equivalent R code:
badmean=function(v) {sum(v)/length(v)}
goodmean=function(v) {center = badmean(v); center + badmean(v - center)}
goodmean(sq) - base - realmean # correctly returns zero


== should mean() call ALTINTEGER_SUM ? ==
As you noticed in the examples above, sum() is much more accurate with ALTREPs 
than with the naive algorithm, because there are no cumulative rounding errors.

Moreover, if we focus on INTSXP, the maximum possible value is lower : INT_MAX 
= 2^31-1

The sum of a large sequence (e.g. 1:(2^31-1)) can still overflow the exact 
integer range (0 to 2^53) of an FP64, and the division does not always round 
back to the correct value.

bad = 1:189812531L
mean(bad) - sum(bad)/length(bad) # returns -1.5e-08 on a platform with FP80
mean(bad) == 94906266 # correct value (the actual result is an integer)

The implementation of mean() on INTSXP do not use the two-passes trick, but 
relies on a LDOUBLE (typically FP80 on the x86 platform) that is large enough 
to avoid rounding bugs even with huge integer sequences.

Unfortunately the ALTINTEGER_SUM interface returns at best a FP64, and so, 
would not return the FP64 closest to the actual mean for some sequences.

Adding a MEAN function to the ALTREP interface could solve the problem.


== can mean performance be improved easily ? ==


The mean() implementation for integers, supports ALTREPs with poor iteration 
performances, using the slow INTEGER_ELT() macro.

Moreover, it converts each integer to LDOUBLE, which is slow.


It can be improved using ITERATE_BY_REGION0 and using a 64 bits integer (if 
available) that cannot overflow on small batches (size = 512).


# before patching (on Ubuntu x86_64 Silvermont Celeron J1900)
x = 1:1e8
y = 1:1e8+0L
system.time(mean(x)) # user 1.33 second
system.time(mean(y)) # user 0.32 second

# after patching (on Ubuntu x86_64 Silvermont Celeron J1900)
# after patching (on Ubuntu x86_64 Silvermont Celeron J1900)
x = 1:1e8
y = 1:1e8+0L
system.time(mean(x)) # user 0.29 second # more than x4 faster
system.time(mean(y)) # user 0.18 second # x 1.7 faster !

(patch attached to this message)

The patch is not optimal. It should ideally use isum(), and risum() but these 
functions are a mess, needing refactoring. For instance, they take a 'call' 
argument and may display a warning message related to the sum() function.

--
Sincerely
Andre GILLIBERT

De : R-devel  de la part de Viechtbauer, 
Wolfgang (SP) 
Envoyé : jeudi 2 septembre 2021 12:55:03
À : r-devel@r-project.org
Objet : [Rd] sum() and mean() for (ALTREP) integer sequences

ATTENTION: Cet e-mail provient d’une adresse mail extérieure au CHU de Rouen. 
Ne cliquez pas sur les liens ou n'ouvrez pas les pièces jointes à moins de 
connaître l'expéditeur et de savoir que le contenu est sûr. En cas de doute, 
transférer le mail à « DSI, Sécurité » pour analyse. Merci de votre vigilance


Hi all,

I am trying to understand the performance of functions applied to integer 
sequences. Consider the following:

### begin example ###

library(lobstr)
library(microbenchmark)

x <- sample(1e6)
obj_size(x)
# 4,000,048 B

y <- 1:1e6
obj_size(y)
# 680 B

# So we can see that 'y' uses ALTREP. These are, as expected, the same:

sum(x)
# [1] 5050
sum(y)
# [1] 5050

# For 'x', we have to go through the trouble of actually summing up 1e6 
integers.
# For 'y', knowing its form, we really just need to do:

1e6*(1e6+1)/2
# [1] 5050

# which should be a whole lot faster. And indeed, it is:

microbenchmark(sum(x),sum(y))

# Unit: nanoseconds
#exprmin   lq  mean   median   uqmax neval cld
#  sum(x) 533452 595204.5 634266.90 613102.5 638271.5 978519   100   b
#  sum(y)183

Re: [Rd] order of operations

2021-08-28 Thread GILLIBERT, Andre
For the discussion:

> Yes, and were it not for 0 * NA == NA, you might skip evaluation of y if x 
> evaluates to zero.


With the same idea, NA * (alpha <- 6) could skip the assignment.

I do not think that, on the short term, R would do that thing, but who knows in 
the future!


As of R 4.1, typeof(`*`) actually returns "builtin" rather than "special", 
meaning that all its arguments are evaluated.


--

Sincerely

Andre GILLIBERT


De : R-devel  de la part de peter dalgaard 

Envoy� : samedi 28 ao�t 2021 09:26:03
� : Duncan Murdoch
Cc : r-devel@r-project.org
Objet : Re: [Rd] order of operations

ATTENTION: Cet e-mail provient d�une adresse mail ext�rieure au CHU de Rouen. 
Ne cliquez pas sur les liens ou n'ouvrez pas les pi�ces jointes � moins de 
conna�tre l'exp�diteur et de savoir que le contenu est s�r. En cas de doute, 
transf�rer le mail � � DSI, S�curit� � pour analyse. Merci de votre vigilance


Yes, and were it not for 0 * NA == NA, you might skip evaluation of y if x 
evaluates to zero.  In Andre Gillibert's example:

1 | (alpha<-6)

there really is no reason to evaluate the assignment since (1 | any) is always 
TRUE. Notwithstanding method dispatch, that is.

With general function calls, all bets are off. Even f(x <- 1) might decide not 
to evaluate its argument.

- pd

> On 27 Aug 2021, at 21:14 , Duncan Murdoch  wrote:
>
> On 27/08/2021 3:06 p.m., Enrico Schumann wrote:
>> On Fri, 27 Aug 2021, Gabor Grothendieck writes:
>>> Are there any guarantees of whether x will equal 1 or 2 after this is run?
>>>
>>> (x <- 1) * (x <- 2)
>>> ## [1] 2
>>> x
>>> ## [1] 2
>> At least the "R Language Definition" [1] says
>>   "The exponentiation operator �^� and the left
>>assignment plus minus operators �<- - = <<-�
>>group right to left, all other operators group
>>left to right.  That is  [...]  1 - 1 - 1 is -1"
>> which would imply 2.
>
> I think this is a different issue.  There's only one operator in question 
> (the "*").  The question is whether x*y evaluates x first or y first (and I 
> believe the answer is that there are no guarantees). I'm fairly sure both are 
> guaranteed to be evaluated, under the rules for group generics listed in 
> ?groupGeneric, but I'm not certain the guarantee is honoured in all cases.
>
> Duncan Murdoch
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd@cbs.dk  Priv: pda...@gmail.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] order of operations

2021-08-27 Thread GILLIBERT, Andre
Due to lazy evaluation, the order of operations can be pretty random in R. 
Actually, some operations may not be performed at all, sometimes.


The following program illustrates the issue:

test1=function(x,y) {}
test2=function(x,y) {x;y}
test3=function(x,y) {y;x}
alpha="hello"
test1(alpha <- 1, alpha <- 2)
print(alpha) # prints "hello"

test2(alpha <- 1, alpha <- 2)
print(alpha) # prints 2

test3(alpha <- 1, alpha <- 2)
print(alpha) # prints 1


Many internal functions unconditionally evaluate all their parameters before 
performing any operation, because they know that they will use all of them 
anyway. Theoretically, the order of evaluation could be well-specified for 
these internal functions, but I would recommend against doing that, because the 
functions could be changed in future, and not evaluate some of their 
parameters, or change the order of evaluation.

For instance, in R 4.0, the following code displays 6:
alpha<-42
1|(alpha<-6)
print(alpha)

But, I would not be shocked to see it display 42 in a future version of R.

Moreover, internal functions are usually wrapped in R code, that may evaluate 
parameters in random orders due to lazy evaluation.
See mean.default, for instance...

On R 4.0.3:
mean.default((alpha<-1),(alpha<-2),(alpha<-3))
print(alpha) # prints 2

Things are probably less tricky with a simple addition or multiplication, but I 
would not rely on that.

--
Sincerely
Andr� GILLIBERT

De : R-devel  de la part de Gabor Grothendieck 

Envoy� : vendredi 27 ao�t 2021 19:57:33
� : Avi Gross
Cc : r-devel@r-project.org
Objet : Re: [Rd] order of operations

ATTENTION: Cet e-mail provient d�une adresse mail ext�rieure au CHU de Rouen. 
Ne cliquez pas sur les liens ou n'ouvrez pas les pi�ces jointes � moins de 
conna�tre l'exp�diteur et de savoir que le contenu est s�r. En cas de doute, 
transf�rer le mail � � DSI, S�curit� � pour analyse. Merci de votre vigilance


It could be that the two sides of * are run in parallel in the future and maybe
not having a guarantee would simplify implementation?


On Fri, Aug 27, 2021 at 12:35 PM Avi Gross via R-devel
 wrote:
>
> Does anyone have a case where this construct has a valid use?
>
> Didn't Python  add a := operator recently that might be intended more for
> such uses as compared to using the standard assignment operators? I wonder
> if that has explicit guarantees of what happens in such cases, but that is
> outside what this forum cares about. Just for the heck of it, I tried the
> example there:
>
> >>> (x := 1) * (x := 2)
> 2
> >>> x
> 2
>
> Back to R, ...
>
> The constructs can get arbitrarily complex as in:
>
> (x <- (x <- 0) + 1) * (x <- (x <-2) + 1)
>
> My impression is that when evaluation is left to right and also innermost
> parentheses before outer ones, then something like the above goes in stages.
> The first of two parenthetical expressions is evaluated first.
>
> (x <- (x <- 0) + 1)
>
> The inner parenthesis set x to zero then the outer one increments x to 1.
> The full sub-expression evaluates to 1 and that value is set aside for a
> later multiplication.
>
> But then the second parenthesis evaluates similarly, from inside out:
>
> (x <- (x <-2) + 1)
>
> It clearly resets x to 2 then increments it by 1 to 3 and returns a value of
> 3. That is multiplied by the first sub-expression to result in 3.
>
> So for simple addition, even though it is commutative, is there any reason
> any compiler or interpreter should not follow rules like the above?
> Obviously with something like matrices, some operations are not abelian and
> require more strict interpretation in the right order.
>
> And note the expressions like the above can run into more complex quandaries
> such as when you have a conditional with OR or AND parts that may be
> short-circuited and in some cases, a variable you expected to be set, may
> remain unset or ...
>
> This reminds me a bit of languages that allow pre/post increment/decrement
> operators like ++ and -- and questions about what order things happen.
> Ideally, anything in which a deterministic order is not guaranteed should be
> flagged by the language at compile time (or when interpreted) and refuse to
> go on.
>
> All I can say with computer languages and adding ever more features,
> with greater power comes greater responsibility and often greater
> confusion.
>
>
> -Original Message-
> From: R-devel  On Behalf Of Gabor
> Grothendieck
> Sent: Friday, August 27, 2021 11:32 AM
> To: Thierry Onkelinx 
> Cc: r-devel@r-project.org
> Subject: Re: [Rd] order of operations
>
> I agree and personally never do this but I would still like to know if it is
> guaranteed behavior or not.
>
> On Fri, Aug 27, 2021 at 11:28 AM Thierry Onkelinx 
> wrote:
>
> > IMHO this is just bad practice. Whether the result is guaranteed or
> > not, doesn't matter.
> >
> > ir. Thierry Onkelinx
> > Statisticus / Statistician
> >
> > Vlaamse Overheid / 

[Rd] R_CheckUserInterrupt

2021-08-27 Thread GILLIBERT, Andre
Dear R developers,


R makes some functions interruptible, thanks to a call to R_CheckUserInterrupt. 
Simple arithmetic operations can be interrupted avoiding freezes when using 
huge arrays (e.g. length > 1 billion).

But many operations, such as matrix multiplication, are not interruptible. I 
estimated that a multiplication of two 1�1 square matrices would freeze 
R for at least 7 days on my computer, unless I kill the process.


I found an old commit that deleted the calls to R_CheckUserInterrupt in many 
basic operations 
(https://github.com/wch/r-source/commit/b99cd362a65012335a853d954cbeb1c782e6ae37)


Why were they deleted ?


First hypothesis : this slowed down the code too much, because it was done 
suboptimally, with an integer division (high latency CPU operation) at each 
iteration.

Second hypothesis : this introduced bugs, such as memory leaks, in code that 
did not handle interruptions gracefully.


If the first hypothesis is correct, I can write much better code, with almost 
zero penalty, using R_ITERATE_CHECK (in R_ext/Itermacros.h) and a new macro I 
wrote: ITERATE_BY_REGION_CHECK.


That would make more operations interruptible and would even provide 
performances improvements in loops that were not optimized for ALTREPs.


Are you interested in patches?



PS: the slow integer division is actually not very slow with recent GCC 
versions, because this compiler is smart enough to replace it by a 
multiplication and a shift because the divisor is known at compile time. Older 
compilers may not be that smart.


--

Sincerely

Andr� GILLIBERT

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Is it a good choice to increase the NCONNECTION value?

2021-08-26 Thread GILLIBERT, Andre

> as stated earlier, R already uses setrlimit() to raise the limit (see my 
> earlier reply).


Currently, setrlimit() is called by R_EnsureFDLimit() and the latter is called 
by initLoadedDLL() in Rdynload.c depending on the R_MAX_NUM_DLLS environment 
variable.

R_MAX_NUM_DLLS can be at most 1000 and the ulimit is raised to 
ceil(R_MAX_NUM_DLLS/0.6), which is at most 1667.

That seems pretty low to me.


I would not mind calling R_EnsureFDLimit(1) unconditionnally. Ten thousand 
file descriptors should be "cheap" enough in system resources on any modern 
system (probably less than hundred megabytes even if kernel structures are 
large), and should be enough for moderate-to-intensive uses.

> As for "special" connections, that is not feasible (without some serious 
> re-write), since the connection doesn't know what it is used for and 
> connections are not the only

> way descriptors may be used.


I was thinking about something like /proc/$PID/fd on Linux to enumerate file 
descriptors of the current process, independently of who created them.

However, I did not find a fast and portable way of doing that. If thousands of 
file descriptors are open, we cannot afford to do thousands of system calls 
every time a new connection is created.


Anyway, I wrote and tested a patch with the following features:

1) Dynamic allocation of the Connections array with a MAX_NCONNECTIONS limit

2) MAX_NCONNECTIONS defaults to 8192

3) MAX_NCONNECTIONS can be set at startup by an environment variable 
R_MAX_NCONNECTIONS

4) MAX_NCONNECTIONS can be read and changed at run time by the 
options("max.n.connections")

5) R_EnsureFDLimit(1) is called unconditionnally at startup


--

Sincerely

André GILLIBERT


De : Simon Urbanek 
Envoyé : jeudi 26 août 2021 01:27:51
À : GILLIBERT, Andre
Cc : qwey...@mail.ustc.edu.cn; R-devel; Martin Maechler
Objet : Re: [Rd] Is it a good choice to increase the NCONNECTION value?

ATTENTION: Cet e-mail provient d’une adresse mail extérieure au CHU de Rouen. 
Ne cliquez pas sur les liens ou n'ouvrez pas les pièces jointes à moins de 
connaître l'expéditeur et de savoir que le contenu est sûr. En cas de doute, 
transférer le mail à « DSI, Sécurité » pour analyse. Merci de votre vigilance


Andre,

as stated earlier, R already uses setrlimit() to raise the limit (see my 
earlier reply).

As for "special" connections, that is not feasible (without some serious 
re-write), since the connection doesn't know what it is used for and 
connections are not the only way descriptors may be used.

Anyway, I think the take away was that likely the best way forward is to make 
it configurable at startup time with possible option to check that value 
against the feasibility of open connections.

Cheers,
Simon



> On Aug 26, 2021, at 9:00 AM, GILLIBERT, Andre  
> wrote:
>
> Hello,
>
>
> The soft limit to the number of file descriptors is 1024 on GNU/Linux but the 
> default hard limit is at 1048576 or 524288 on major modern distributions : 
> Ubuntu, Fedora, Debian.
>
> I do not have access to a Macintosh, but it looks like the soft limit is 256 
> and hard limit is "unlimited", though actually, the real hard limit has been 
> reported as 10240 
> (https://developer.r-project.org/Blog/public/2018/03/23/maximum-number-of-dlls/index.html).
>
>
> Therefore, R should easily be able to change the limit without superuser 
> privileges, with a call to setrlimit().
>
> This should make file descriptor exhaustion very unlikely, except for buggy 
> programs leaking file descriptors.
>
>
> The simplest approach would be to set the soft limit to the value of the hard 
> limit. Maybe to be nicer, R could set it to 1 (or the hard limit if 
> lower), which should be enough for intensive uses but would not use too much 
> system resources in case of file descriptor leaks.
>
>
> To get R reliably work in more esoteric operating systems or in poorly 
> configured systems (e.g. systems with a hard limit at 1024), a second 
> security could be added: a request of a new connection would be denied if the 
> actual number of open file descriptors (or connections if that is easier to 
> compute) is too close to the hard limit. A fixed amount (e.g. 128) or a 
> proportion (e.g. 25%) of file descriptors would be reserved for "other uses", 
> such as shared libraries.
>
>
> This discussion reminds me of the fixed number of file descriptors of MS-DOS, 
> defined at boot time in config.sys (e.g. files=20).
>
> This is incredible that 64 bits computers in 2021 with gigabytes of RAM still 
> have similar limits, and that R, has a hard-coded limit at 128.
>
>
> --
>
> Sincerely
>
> André GILLIBERT
>
> 
> De : qwey...@mail.u

Re: [Rd] Is it a good choice to increase the NCONNECTION value?

2021-08-25 Thread GILLIBERT, Andre
Hello,


The soft limit to the number of file descriptors is 1024 on GNU/Linux but the 
default hard limit is at 1048576 or 524288 on major modern distributions : 
Ubuntu, Fedora, Debian.

I do not have access to a Macintosh, but it looks like the soft limit is 256 
and hard limit is "unlimited", though actually, the real hard limit has been 
reported as 10240 
(https://developer.r-project.org/Blog/public/2018/03/23/maximum-number-of-dlls/index.html).


Therefore, R should easily be able to change the limit without superuser 
privileges, with a call to setrlimit().

This should make file descriptor exhaustion very unlikely, except for buggy 
programs leaking file descriptors.


The simplest approach would be to set the soft limit to the value of the hard 
limit. Maybe to be nicer, R could set it to 1 (or the hard limit if lower), 
which should be enough for intensive uses but would not use too much system 
resources in case of file descriptor leaks.


To get R reliably work in more esoteric operating systems or in poorly 
configured systems (e.g. systems with a hard limit at 1024), a second security 
could be added: a request of a new connection would be denied if the actual 
number of open file descriptors (or connections if that is easier to compute) 
is too close to the hard limit. A fixed amount (e.g. 128) or a proportion (e.g. 
25%) of file descriptors would be reserved for "other uses", such as shared 
libraries.


This discussion reminds me of the fixed number of file descriptors of MS-DOS, 
defined at boot time in config.sys (e.g. files=20).

This is incredible that 64 bits computers in 2021 with gigabytes of RAM still 
have similar limits, and that R, has a hard-coded limit at 128.


--

Sincerely

André GILLIBERT


De : qwey...@mail.ustc.edu.cn 
Envoyé : mercredi 25 août 2021 06:15:59
À : Simon Urbanek
Cc : Martin Maechler; GILLIBERT, Andre; R-devel
Objet : 回复: [SPAM] Re: [Rd] Is it a good choice to increase the NCONNECTION 
value?

ATTENTION: Cet e-mail provient d’une adresse mail extérieure au CHU de Rouen. 
Ne cliquez pas sur les liens ou n'ouvrez pas les pièces jointes à moins de 
connaître l'expéditeur et de savoir que le contenu est sûr. En cas de doute, 
transférer le mail à « DSI, Sécurité » pour analyse. Merci de votre vigilance


Simon,

What about using a dynamically allocated connections and a modifiable 
MAX_NCONNECTIONS limit?
ulimit could be modified by root users, at least now NCONNECTION could not.

I tried changing the program using malloc and realloc to allocate memory, due 
to unfamiliar with `.Internal` calls, I could not provide a function that 
modify the MAX_NCONNECTIONS (but it is possible.)
test and changes are shown below. I'll be appperciate if you could tell me 
whether there could be a bug.

(a demo that may change MAX_NCONNECTIONS, not tested.)
static int SetMaxNconnections(int now){ // return current value of 
MAX_NCONNECTIONS
  if(now<3)error(_("Could not shrink the MAX_NCONNECTIONS less than 3"));
  if(now>65536)warning(_("Setting MAX_NCONNECTIONS=%d, larger than 65536, may 
be crazy. Use at your own risk."),now);
  // setting MAX_NCONNECTIONS to a really large value is safe, since the 
allocation is not done immediately. Thus this is a warning.
  if(now>=NCONNECTIONS)return MAX_NCONNECTIONS=now; // if now is larger than 
NCONNECTIONS<=now,MAX_NCONNECTIONS, thus it is safe.
  R_gc(); /* Try to reclaim unused connections */
  for(int i=NCONNECTIONS;i>=now;--i){// now >= 3 here, thus no underflow occurs.
// shrink the value of MAX_NCONNECTIONS and NCONNECTIONS
if(!Connections[i]){now=i+1;break;}
  }
  // here, we could call a realloc, since *Connections only capture several 
kilobytes, realloc seems meaningless.
  // a true realloc will trigger if NCONNECTIONS library(doParallel);cl=makeForkCluster(128);max(sapply(clusterCall(cl,function()runif(10)),"+"))
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
Warning messages:
1: In socketAccept(socket = socket, blocking = TRUE, open = "a+b",  :
  increase max connections from 16 to 32
2: In socketAccept(socket = socket, blocking = TRUE, open = "a+b",  :
  increase max connections from 32 to 64
3: In socketAccept(socket = socket, blocking = TRUE, open = "a+b",  :
  increase max connections from 64 to 128
4: In socketAccept(socket = socket, blocking = TRUE, open = "a+b",  :
  increase max connections from 128 to 256
[1] 0.9975836
>
>


tested changes:


~line 127

static int NCONNECTIONS=16; /* need one per cluster node, 16 is the
  initial value which grows dynamically */
static int MAX_NCONNECTIONS=8192; /* increase it only affect the speed of
  finding the correct connection, if you have a machine with more than
  4096 threads, you could submit an issue or modify this value manually */
#define NSINKS 21

Re: [Rd] Is it a good choice to increase the NCONNECTION value?

2021-08-24 Thread GILLIBERT, Andre
RConnection is a pointer to a Rconn structure. The Rconn structure must be 
allocated independently (e.g. by malloc() in R_new_custom_connection).

Therefore, increasing NCONNECTION to 1024 should only use 8 kilobytes on 
64-bits platforms and 4 kilobytes on 32 bits platforms.

Ideally, it should be dynamically allocated : either as a linked list or as a 
dynamic array (malloc/realloc). However, a simple change of NCONNECTION to 1024 
should be enough for most uses.


--

Sincerely

Andr� GILLIBERT



De : R-devel  de la part de Martin Maechler 

Envoy� : mardi 24 ao�t 2021 10:44
� : qwey...@mail.ustc.edu.cn
Cc : R-devel
Objet : Re: [Rd] Is it a good choice to increase the NCONNECTION value?


> qweytr1--- via R-devel
> on Tue, 24 Aug 2021 00:51:31 +0800 (GMT+08:00) writes:

> At least in 2015, a github user, tobigithub, submit an
> [issue](https://github.com/sneumann/xcms/issues/20) about
> the error "Error in file(con, "w") : all connections are
> in use" Nowadays, since AMD have really cool CPUs which
> increases the thread numbers to 128 or even 256 on a
> single server, we found that the NCONNECTIONS variable
> could prevent us from utilizing all the 128 threads.  It
> might be a good choice to increase its value.


> the variable is defined in
> `R-4.1.1/src/main/connections.c: 17` I have tested that,
> increase it to 1024 generates no error and all the
> clusters (I tried with 256 clusters on my 16 threads
> Laptop) works fine.

> Is it possible increase the size of NCONNECTION?

Yes, of course, it is possible.
The question is how much it costs  and to which number it should
be increased.

A quick look at the source connections.c --> src/R_ext/include/Connections.h
reveals that the Rconnection* <--> Rconn is a struct with about
200 chars and ca 30 int-like plus another 20 pointers .. which
would amount to a rough 400 bytes per connection.
Adding 1024-128 = 896 new ones  would then amount to increase
the R executable by about 360 kB .. all the above being rough.
So personally, I guess that's  "about ok" --
are there other things to consider?

Ideally, of course, the number of possible connections could be
increased dynamically only when needed

Martin

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Problem in random number generation for Marsaglia-Multicarry + Ahrens-Dieter

2021-08-12 Thread GILLIBERT, Andre
Dear R developers,

At the same time I discovered a flaw in Marsaglia-Multicarry + 
Kinderman-Ramage, I found another in Marsaglia-Multicarry + Ahrens-Dieter.
It is less obvious than for Kinderman-Ramage; so I created a new thread for 
this bug.

The following code shows the problem (tested on R 4.1.1 x86_64 for Windows 10):

== start of code sample ==
set.seed(1, "Marsaglia-Multicarry", normal.kind="Ahrens-Dieter")
v=rnorm(1e8)

q=qnorm(seq(0.01, 0.99, 0.01))
cv=cut(v, breaks=c(-Inf, q, +Inf))
observed=table(cv)
chisq.test(observed) # p < 2.2e-16
== end of code sample ==

The chisq.test returns a P-value < 2.2e-16 while it was expected to return a 
non-significant P-value.
The additionnal code below, shows severe irregularities in the distribution of 
quantiles:

== continuation of code sample ==
expected = chisq.test(observed)$expected
z = (observed - expected)/sqrt(expected)
mean (abs(z) > 6) # 58% of z-scores are greater than 6 while none should be
== end of code sample ==

The bug is specific to the combination Marsaglia-Multicarry + Ahrens-Dieter.
There is no problem with Marsaglia-Multicarry + Inversion or Mersenne-Twister + 
Ahrens-Dieter

I would expect at least a warning (or an error) from R for such a buggy 
combination.

--
Sincerely
Andr� GILLIBERT


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Problem in random number generation for Marsaglia-Multicarry + Kinderman-Ramage

2021-08-12 Thread GILLIBERT, Andre
Dear R developers,


In my opinion, I discovered a severe flaw that occur with the combination of 
the Marsaglia-Multicarry pseudo-random number generator associated to the 
Kinderman-Ramage algorithm to generate normally distributed numbers.


The sample program is very simple (tested on R-4.1.1 x86_64 on Windows 10):

set.seed(1, "Marsaglia-Multicarry", normal.kind="Kinderman-Ramage")
v=rnorm(1e7)
poisson.test(sum(v < (-4)))$conf.int # returns c(34.5, 62.5)
poisson.test(sum(v > (4)))$conf.int # returns c(334.2, 410.7)
pnorm(-4)*1e7 # returns 316.7


There should be approximatively 316 values less than -4 and 316 values greater 
than +4, bug there are far too few values less than -4.

Results are similar with other random seeds, and things are even more obvious 
with larger sample sizes.

The Kinderman-Ramage algorithm is fine when combined to Mersenne-Twister, and 
Marsaglia-Multicarry is fine when combined with the normal.kind="Inversion" 
algorithm, but the combination of Marsaglia-Multicarry and Kinderman-Ramage 
seems to have severe flaws.

R should at least warn for that combination !

What do you think? Should I file a bug report?

--
Sincerely
Andr� GILLIBERT

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Redundant source code for random number generation

2021-08-07 Thread GILLIBERT, Andre
Dear R developers,


When trying to fix poor performances of the runif() function (I can easily make 
it three times faster if you are interested in performance patches, maybe six 
times faster with a bit of refactoring of R source code), I noticed some 
redundant code in R source code (R-devel of 2021-08-05).

Indeed, the family of random number generation functions (runif, rnorm, rchisq, 
rbeta, rbinom, etc.) is implemented via Internal functions described in 
src/main/names.c and implemented as do_random1, do_random2 and do_random3 in 
src/main/random.c.


They are also reimplemented in src/library/stats/src/random.c in three main 
functions (random1, random2, random3) that will eventually be stored in a 
dynamic library (stats.so or stats.dll).


For instance, the stats::runif R function is implemented as:

function (n, min = 0, max = 1)
.Call(C_runif, n, min, max)


but could equivalently be implemented as:

function(n, min = 0, max = 1)

.Internal(runif(n, min, max))


The former calls the src/library/stats/src/random.c implementation (in stats.so 
or stats.dll) while the latter would call the src/main/random.c implementation 
(in the main R binary).


The two implementations (src/main/random.c and src/library/stats/src/random.c) 
are similar but slightly different on small details. For instance, rbinom 
always return a vector of doubles (REAL) in src/main/random.c while it tries to 
return a vector of integers in src/library/stats/src/random.c, unless the 
integers are too large to fit in an INT.


I see no obvious reason of maintaining both source codes. Actually the 
src/main/random.c seems to be unused in normal R programs. There could be some 
weird programs that use the .Internal call, but I do not think that there are 
many.


There are several strategies to merge both, but I want some feedback of people 
who know well the R source code before proposing patches.


--

Sincerely

Andr� GILLIBERT

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Poor family objects error messages

2020-04-13 Thread GILLIBERT, Andre
Hello,

The following code:

> binomial(identity)

Generates an error message:
Error in binomial(identity) :
  link "identity" not available for binomial family; available links are 
�logit�, �probit�, �cloglog�, �cauchit�, �log�

While :
> binomial("identity")
Yields an identity-binomial object that works as expected with stats::glm

The error in the first example mislead me during years. I thought 
identity-binomial models were unsupported by R.
The documentation is correct but misleading too.

> The gaussian family accepts the links (as names) identity, log and inverse; 
> the binomial family the
> links logit, probit, cauchit, (corresponding to logistic, normal and Cauchy 
> CDFs respectively) log and
> cloglog (complementary log-log);

Without changing the language, this could be fixed by changing the error 
messages to something more suggestive.

Suggestion:
Error in binomial(identity) :
  name identity not available for binomial family; please use a character 
string such as binomial("identity")

The documentation could be updated to insist on that.

The gaussian family accepts the links (as names) identity, log and inverse; the 
binomial family the
links logit, probit, cauchit, (corresponding to logistic, normal and Cauchy 
CDFs respectively) log and
cloglog (complementary log-log); [...] If the link function is given as a 
character string, all families accept all link functions.

What do you think of that ?

--
Sincerely
Andr� GILLIBERT

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel