[Bioc-devel] Compilation flags, CHECK errors and BiocNeighbors

2018-12-15 Thread Aaron Lun
Sometime between 6-18 November, BiocNeighbors’ BioC-devel builds began failing 
on Windows 64-bit, and have continued to fail since:

http://bioconductor.org/checkResults/devel/bioc-LATEST/BiocNeighbors/ 


The most interesting part is the nature of the failures. They are not 
segmentation faults but rather “incorrect” output in the unit tests:

- BiocNeighbors uses the Annoy algorithm for approximate nearest neighbor 
search, which is provided as a header-only C++ library in the RcppAnnoy package.

- I have compiled the BiocNeighhbors C++ code with an “#include" for these 
libraries to use the Annoy routines. For testing, I compared the output of my 
C++ code to the output of the code in the RcppAnnoy package.

- It is these tests that are failing (i.e., the output does not match up) 
during CHECK on Windows 64-bit only, despite the fact that the same library is 
being “#include”d in both the BiocNeighbors and RcppAnnoy sources!

What makes this particularly intriguing is that the differences between 
BiocNeighbors and RcppAnnoy are very minor. Less than 1% of the neighbor 
identities differ, and only for some of the scenarios, so it’s not an obvious 
bug that would be changing the output en masse. Now, the package also 
uses/tests Annoy in BioC-release but builds fine on tokay1:

http://bioconductor.org/checkResults/release/bioc-LATEST/BiocNeighbors/ 


The major difference between the Bioc-release/devel builds is the compilation 
flags, which have changed from “-O2 -mtune=generic” to “-O3 -march=native 
-mtune=native” in tokay2. I am told (thanks Val) that the timing of this change 
is consistent with the start of the BiocNeighbors build failures on tokay2. I 
would guess that RcppAnnoy is also compiled with “-O2 -mtune=generic” on the 
CRAN build systems, introducing differences in optimization levels between the 
BiocNeighbors and RcppAnnoy binaries. These could be responsible for the 
discrepancies in the search results.

I was able to reproduce this on my Unix cluster (gcc 6.5.0) where setting 
“-march=native” with either “-O3” or “-O2” caused a difference in the 
calculations. After much trial and error, I eventually narrowed this down to 
the “-mfma” flag, which seems to change the precision of multiply-and-add 
operations and thus the search results. This occurs even when AVX support is 
turned off; I guess the compiler tries to be smart if it detects you are doing 
some kind of simultaneous multiply and addition, which is a pretty common thing 
to do when computing Euclidean distances. 

In summary: can we not use “-march=native” on tokay2? (Val, I know we discussed 
this, but whatever changes you made to the compilation flags don’t seem to have 
propagated to the build machines.) As the case study with BiocNeighbors shows, 
this leads to inconsistencies between the CRAN and BioC-devel binaries for the 
same code, which unnecessarily complicates downstream usage and unit tests. I 
also wonder how binaries specialized for tokay2’s architecture would behave on 
other CPUs with different instruction sets, if they would run at all.

Cheers,

Aaron
[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Rd] Documentation examples for lm and glm

2018-12-15 Thread frederik

I agree with Steve and Achim that we should keep some examples with no
data frame. That's Objectively Simpler, whether or not it leads to
clutter in the wrong hands. As Steve points out, we have attach()
which is an excellent language feature - not to mention with().

I would go even further and say that the examples that are in lm() now
should stay at the top. Because people may be used to referring to
them, and also because Historical Order is generally a good order in
which to learn things. However, if there is an important function
argument ("data=") not in the examples, then we should add examples
which use it. Likewise if there is a popular programming style
(putting things in a data frame). So let's do something along the
lines of what Thomas is requesting, but put it after the existing
documentation? Please?

On a bit of a tangent, I would like to see an example in lm() which
plots my data with a fitted line through it. I'm probably betraying my
ignorance here, but I was asked how to do this when showing R to a
friend and I thought it should be in lm(), after all it seems a bit
more basic than displaying a Normal Q-Q plot (whatever that is!
gasp...). Similarly for glm(). Perhaps all this can be accomplished
with merely doubling the size of the existing examples.

Thanks.

Frederick

On Sat, Dec 15, 2018 at 02:15:52PM +0100, Achim Zeileis wrote:
A pragmatic solution could be to create a simple linear regression 
example with variables in the global environment and then another 
example with a data.frame.


The latter might be somewhat more complex, e.g., with several 
regressors and/or mixed categorical and numeric covariates to 
illustrate how regression and analysis of (co-)variance can be 
combined. I like to use MASS's whiteside data for this:


data("whiteside", package = "MASS")
m1 <- lm(Gas ~ Temp, data = whiteside)
m2 <- lm(Gas ~ Insul + Temp, data = whiteside)
m3 <- lm(Gas ~ Insul * Temp, data = whiteside)
anova(m1, m2, m3)

Moreover, some binary response data.frame with a few covariates might 
be a useful addition to "datasets". For example a more granular 
version of the "Titanic" data (in addition to the 4-way tabel 
?Titanic). Or another relatively straightforward data set, popular in 
econometrics and social sciences is the "Mroz" data, see e.g., 
help("PSID1976", package = "AER").


I would be happy to help with these if such additions were considered 
for datasets/stats.



On Sat, 15 Dec 2018, David Hugh-Jones wrote:


I would argue examples should encourage good practice. Beginners ought to
learn to keep data in data frames and not to overuse attach(). Experts can
do otherwise at their own risk, but they have less need of explicit
examples.

On Fri, 14 Dec 2018 at 14:51, S Ellison  wrote:


FWIW, before all the examples are changed to data frame variants, I think
there's fairly good reason to have at least _one_ example that does _not_
place variables in a data frame.

The data argument in lm() is optional. And there is more than one way to
manage data in a project. I personally don't much like lots of stray
variables lurking about, but if those are the only variables out there and
we can be sure they aren't affected by other code, it's hardly essential to
create a data frame to hold something you already have.
Also, attach() is still part of R, for those folk who have a data frame
but want to reference the contents across a wider range of functions
without using with() a lot. lm() can reasonably omit the data argument
there, too.

So while there are good reasons to use data frames, there are also good
reasons to provide examples that don't.

Steve Ellison



-Original Message-
From: R-devel [mailto:r-devel-boun...@r-project.org] On Behalf Of Ben
Bolker
Sent: 13 December 2018 20:36
To: r-devel@r-project.org
Subject: Re: [Rd] Documentation examples for lm and glm


 Agree.  Or just create the data frame with those variables in it
directly ...

On 2018-12-13 3:26 p.m., Thomas Yee wrote:

Hello,

something that has been on my mind for a decade or two has
been the examples for lm() and glm(). They encourage poor style
because of mismanagement of data frames. Also, having the
variables in a data frame means that predict()
is more likely to work properly.

For lm(), the variables should be put into a data frame.
As 2 vectors are assigned first in the general workspace they
should be deleted afterwards.

For the glm(), the data frame d.AD is constructed but not used. Also,
its 3 components were assigned first in the general workspace, so they
float around dangerously afterwards like in the lm() example.

Rather than attached improved .Rd files here, they are put at
www.stat.auckland.ac.nz/~yee/Rdfiles
You are welcome to use them!

Best,

Thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list

Re: [Rd] Documentation examples for lm and glm

2018-12-15 Thread Achim Zeileis
A pragmatic solution could be to create a simple linear regression example 
with variables in the global environment and then another example with a 
data.frame.


The latter might be somewhat more complex, e.g., with several regressors 
and/or mixed categorical and numeric covariates to illustrate how 
regression and analysis of (co-)variance can be combined. I like to use 
MASS's whiteside data for this:


data("whiteside", package = "MASS")
m1 <- lm(Gas ~ Temp, data = whiteside)
m2 <- lm(Gas ~ Insul + Temp, data = whiteside)
m3 <- lm(Gas ~ Insul * Temp, data = whiteside)
anova(m1, m2, m3)

Moreover, some binary response data.frame with a few covariates might be a 
useful addition to "datasets". For example a more granular version of the 
"Titanic" data (in addition to the 4-way tabel ?Titanic). Or another 
relatively straightforward data set, popular in econometrics and social 
sciences is the "Mroz" data, see e.g., help("PSID1976", package = "AER").


I would be happy to help with these if such additions were considered for 
datasets/stats.



On Sat, 15 Dec 2018, David Hugh-Jones wrote:


I would argue examples should encourage good practice. Beginners ought to
learn to keep data in data frames and not to overuse attach(). Experts can
do otherwise at their own risk, but they have less need of explicit
examples.

On Fri, 14 Dec 2018 at 14:51, S Ellison  wrote:


FWIW, before all the examples are changed to data frame variants, I think
there's fairly good reason to have at least _one_ example that does _not_
place variables in a data frame.

The data argument in lm() is optional. And there is more than one way to
manage data in a project. I personally don't much like lots of stray
variables lurking about, but if those are the only variables out there and
we can be sure they aren't affected by other code, it's hardly essential to
create a data frame to hold something you already have.
Also, attach() is still part of R, for those folk who have a data frame
but want to reference the contents across a wider range of functions
without using with() a lot. lm() can reasonably omit the data argument
there, too.

So while there are good reasons to use data frames, there are also good
reasons to provide examples that don't.

Steve Ellison



-Original Message-
From: R-devel [mailto:r-devel-boun...@r-project.org] On Behalf Of Ben
Bolker
Sent: 13 December 2018 20:36
To: r-devel@r-project.org
Subject: Re: [Rd] Documentation examples for lm and glm


  Agree.  Or just create the data frame with those variables in it
directly ...

On 2018-12-13 3:26 p.m., Thomas Yee wrote:

Hello,

something that has been on my mind for a decade or two has
been the examples for lm() and glm(). They encourage poor style
because of mismanagement of data frames. Also, having the
variables in a data frame means that predict()
is more likely to work properly.

For lm(), the variables should be put into a data frame.
As 2 vectors are assigned first in the general workspace they
should be deleted afterwards.

For the glm(), the data frame d.AD is constructed but not used. Also,
its 3 components were assigned first in the general workspace, so they
float around dangerously afterwards like in the lm() example.

Rather than attached improved .Rd files here, they are put at
www.stat.auckland.ac.nz/~yee/Rdfiles
You are welcome to use them!

Best,

Thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



***
This email and any attachments are confidential. Any u...{{dropped:12}}


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel