subject:"Re\: \[Rd\] \[RFC\] A case for freezing CRAN"

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-24 Thread Martin Maechler

 Hervé Pagès hpa...@fhcrc.org
 on Thu, 20 Mar 2014 15:23:57 -0700 writes:

 On 03/20/2014 01:28 PM, Ted Byers wrote:
 On Thu, Mar 20, 2014 at 3:14 PM, Hervé Pagès
 hpa...@fhcrc.org mailto:hpa...@fhcrc.org wrote:
 
 On 03/20/2014 03:52 AM, Duncan Murdoch wrote:
 
 On 14-03-20 2:15 AM, Dan Tenenbaum wrote:
 
 
 
 - Original Message -
 
 From: David Winsemius dwinsem...@comcast.net
 mailto:dwinsem...@comcast.net To: Jeroen Ooms
 jeroen.o...@stat.ucla.edu
 mailto:jeroen.o...@stat.ucla.edu Cc: r-devel
 r-devel@r-project.org mailto:r-devel@r-project.org
 Sent: Wednesday, March 19, 2014 11:03:32 PM Subject: Re:
 [Rd] [RFC] A case for freezing CRAN
 
 
 On Mar 19, 2014, at 7:45 PM, Jeroen Ooms wrote:
 
 On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
 michael.weyla...@gmail.com
 mailto:michael.weyla...@gmail.com wrote:
 
 Reading this thread again, is it a fair summary of your
 position to say reproducibility by default is more
 important than giving users access to the newest bug
 fixes and features by default?  It's certainly arguable,
 but I'm not sure I'm convinced: I'd imagine that the
 ratio of new work being done vs reproductions is rather
 high and the current setup optimizes for that already.
 
 
 I think that separating development from released
 branches can give us both reliability/reproducibility
 (stable branch) as well as new features (unstable
 branch). The user gets to pick (and you can pick
 both!). The same is true for r-base: when using a
 'released' version you get 'stable' base packages that
 are up to 12 months old. If you want to have the latest
 stuff you download a nightly build of r-devel.  For
 regular users and reproducible research it is recommended
 to use the stable branch. However if you are a developer
 (e.g. package author) you might want to
 develop/test/check your work with the latest r-devel.
 
 I think that extending the R release cycle to CRAN would
 result both in more stable released versions of R, as
 well as more freedom for package authors to implement
 rigorous change in the unstable branch.  When writing a
 script that is part of a production pipeline, or sweave
 paper that should be reproducible 10 years from now, or a
 book on using R, you use stable version of R, which is
 guaranteed to behave the same over time. However when
 developing packages that should be compatible with the
 upcoming release of R, you use r-devel which has the
 latest versions of other CRAN and base packages.
 
 
 
 As I remember ... The example demonstrating the need for
 this was an XML package that cause an extract from a
 website where the headers were misinterpreted as data in
 one version of pkg:XML and not in another. That seems
 fairly unconvincing. Data cleaning and validation is a
 basic task of data analysis. It also seems excessive to
 assert that it is the responsibility of CRAN to maintain
 a synced binary archive that will be available in ten
 years.
 
 
 
 CRAN already does this, the bin/windows/contrib directory
 has subdirectories going back to 1.7, with packages dated
 October 2004. I don't see why it is burdensome to
 continue to archive these.  It would be nice if source
 versions had a similar archive.
 
 
 The bin/windows/contrib directories are updated every day
 for active R versions.  It's only when Uwe decides that a
 version is no longer worth active support that he stops
 doing updates, and it freezes.  A consequence of this
 is that the snapshots preserved in those older
 directories are unlikely to match what someone who keeps
 up to date with R releases is using.  Their purpose is to
 make sure that those older versions aren't completely
 useless, but they aren't what Jeroen was asking for.
 
 
 But it is almost completely useless from a
 reproducibility point of view to get random package
 versions. For example if some people try to use R-2.13.2
 today to reproduce an analysis that was published 2 years
 ago, they'll get Matrix 1.0-4 on Windows, Matrix 1.0-3 on
 Mac, and Matrix 1.1-2-2 on Unix. And none of them of
 course is what was used by the authors of the paper (they
 used Matrix 1.0-1, which is what was current when they
 ran their analysis).
 
 Initially this discussion brought back nightmares of DLL
 hell on Windows.  Those as ancient as I will remember
 that well.  But now, the focus seems to be on
 reproducibility, but with what strikes me as a seriously
 flawed notion of what reproducibility means.
 
 Herve Pages mentions the risk of irreproducibility across
 three minor revisions of version 1.0

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-24 Thread Gábor Csárdi

FWIW, I am mirroring CRAN at github now, here:
https://github.com/cran

One can install specific package versions using the devtools package:
library(devtools)
install_github(cran/package@version)

In addition, one can also install versions based on the R version, e.g.:
install_github(cran/package@R-2.15.3)
installs the version that was on CRAN when R-2.15.3 was released.

This is not very convenient yet, because the dependencies should be
installed based on the R versions as well. This is in the works.

This is an experiment, and I am not yet committed to maintaining it in the
long run. We'll see how it works and if it has the potential to be useful.

Plans for features:
- convenient install of packages from CRAN snapshots, with all
dependencies coming from the same snapshot.
- web page with package search, summaries, etc.
- binaries

Help is welcome, especially advice and feedback:
https://github.com/metacran/tools/issues

Best,
Gabor

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-21 Thread Jari Oksanen

Freezing CRAN solves no problem of reproducibility. If you know the 
sessionInfo() or the version of R, the packages used and their versions, you 
can reproduce that set up. If you do not know, then you cannot. You can try 
guess: source code of old release versions of R and old packages are in CRAN 
archive, and these files have dates. So you can collect a snapshot of R and 
packages for a given date. This is not an ideal solution, but it is the same 
level of reproducibility that you get with strictly frozen CRAN. CRAN is no the 
sole source of packages, and even with strictly frozen CRAN the users may have 
used packages from other source. I am sure that if CRAN would be frozen (but I 
assume it happens the same day hell freezes), people would increasingly often 
use other package sources than CRAN. The choice is easy if the alternatives are 
to wait for the next year for the bug fix release, or do the analysis now and 
use package versions in R-Forge or github. Then you could not assume that 
frozen CRAN packages were used.

CRAN policy is not made in this mailing list, and CRAN maintainers are so 
silent that it hurts ears. However, I hope they won't freeze CRAN. 

Strict reproduction seems to be harder than I first imagined: ./configure  
make really failed for R 2.14.1 and older in my office desktop. To reproduce 
older analysis, I would also need to install older tool sets (I suspect 
gfortran and cairo libraries).

CRAN is one source of R packages, and certainly its policy does not suit all 
developers. There is no policy that suits all.  Frozen CRAN would suit some, 
but certainly would deter some others. 

There seems to a common sentiment here that the only reason anybody would use R 
older than 3.0.3 is to reproduce old results. My experience form the Real 
Life(™) is that many of us use computers that we do not own, but they are the 
property of our employer. This may mean that we are not allowed to install 
there any software or we have to pay, or the Department of project has to pay, 
to the computer administration for installing new versions of software (our 
case). This is often called security. Personally I avoid this by using Mac 
laptop and Linux desktop: these are not supported by the University computer 
administration and I can do what I please with these, but poor Windows users 
are stuck. Computer classes are also maintained by centralized computer 
administration. This January they had new R, but last year it was still two 
years old. However, users can install packages in their personal folders so 
that they can use current packages even with older R. Therefore I want to take 
care that the packages I maintain also run in older R. Therefore I also applaud 
the current CRAN policy where new versions of packages are backported to 
previous R release: Even if you are stuck with stale R, you need not be stuck 
with stale packages. Currently I cannot test with older R than 2.14.2, though, 
but I do that regularly and certainly before CRAN releases.  If somebody wants 
to prevent this, they can set their package to unnecessarily depend on the 
current version of R. I would regard this as antisocial, but nobody would ask 
what I think about this so it does not matter.

The development branch of my package is in R-Forge, and only bug fixes and 
(hopefully) non-breaking enhancements (isolated so that they do not influence 
other functions, safe so that API does not change or  format of the output does 
not change) are merged to the CRAN release branch. This policy was adopted 
because it fits the current CRAN policy, and probably would need to change if 
CRAN policy changes.

Cheers, Jari Oksanen
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-21 Thread Rainer M Krug



This is a long and (mainly) interesting discussion, which is fanning out
in many different directions, and I think many are not that relevant to
the OP's suggestion. 

I see the advantages of having such a dynamic CRAN, but also of having a
more stable CRAN. I prefer CRAN as it is now, but ion many cases a more
stable CRAN might b an advantage. So having releases of CRAN might make
sense. But then there is the archiving issue of CRAN.

The suggestion was made to move the responsibility away from CRAN and
the R infrastructure to the user / researcher to guarantee that the
results can be re-run years later. It would be nice to have this build
in CRAN, but let's stick at the scenario that the user should care for
reproducability.

Leaving the issue of compilation out, a package which is creating a
custom installation of the R version which includes the source of the R
version used and the sources of the packages in a on Linux compilable
format, given that the relevant dependencies are installed, would be a
huge step forward. 

I know - compilation on Windows (and sometimes Mac) is a serious
problem), but to archive *all* binaries and to re-compile all older
versions of R and all packages would be an impossible task.

Apart from that - doing your analysis in a Virtual Machine and then
simply archiving this Virtual Machine, would also be an option, but only
for the more tech savy users.

In a nutshell: I think a package would be able to provide the solution
for a local archiving to make it possible to re-run the simulation with
the same tools at a later stage - although guarantees would not be
possible.

Cheers,

Rainer
-- 
Rainer M. Krug
email: Raineratkrugsdotde
PGP: 0x0F52F982

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-21 Thread Rainer M Krug

Jari Oksanen jari.oksa...@oulu.fi writes:

 Freezing CRAN solves no problem of reproducibility. If you know the
 sessionInfo() or the version of R, the packages used and their
 versions, you can reproduce that set up. If you do not know, then you
 cannot. You can try guess: source code of old release versions of R
 and old packages are in CRAN archive, and these files have dates. So
 you can collect a snapshot of R and packages for a given date. This is
 not an ideal solution, but it is the same level of reproducibility
 that you get with strictly frozen CRAN. CRAN is no the sole source of
 packages, and even with strictly frozen CRAN the users may have used
 packages from other source. I am sure that if CRAN would be frozen
 (but I assume it happens the same day hell freezes), people would
 increasingly often use other package sources than CRAN. The choice is
 easy if the alternatives are to wait for the next year for the bug fix
 release, or do the analysis now and use package versions in R-Forge or
 github. Then you could not assume that frozen CRAN packages were used.

Agree completely here - the solution would be a package, which is
packaging the source (or even binaries?) of your local R setup including
R and packages used. The solution is local - not on a server.


 CRAN policy is not made in this mailing list, and CRAN maintainers are
 so silent that it hurts ears. 

+1

 However, I hope they won't freeze CRAN.

Yes and no - if they do, we need a devel branch which acts like the
current CRAN.


 Strict reproduction seems to be harder than I first imagined:
 ./configure  make really failed for R 2.14.1 and older in my office
 desktop. To reproduce older analysis, I would also need to install
 older tool sets (I suspect gfortran and cairo libraries).

Absolutely - let's not go there. And then there is also the hardware
issue.


 CRAN is one source of R packages, and certainly its policy does not
 suit all developers. There is no policy that suits all.  Frozen CRAN
 would suit some, but certainly would deter some others.

 There seems to a common sentiment here that the only reason anybody
 would use R older than 3.0.3 is to reproduce old results. My
 experience form the Real Life(™) is that many of us use computers that
 we do not own, but they are the property of our employer. This may
 mean that we are not allowed to install there any software or we have
 to pay, or the Department of project has to pay, to the computer
 administration for installing new versions of software (our
 case).  

 This is often called security. Personally I avoid this by using
 Mac laptop and Linux desktop: these are not supported by the
 University computer administration and I can do what I please with
 these, but poor Windows users are stuck. 

Nicely put.

 Computer classes are also
 maintained by centralized computer administration. This January they
 had new R, but last year it was still two years old. However, users
 can install packages in their personal folders so that they can use
 current packages even with older R. Therefore I want to take care that
 the packages I maintain also run in older R. Therefore I also applaud
 the current CRAN policy where new versions of packages are
 backported to previous R release: Even if you are stuck with stale
 R, you need not be stuck with stale packages. Currently I cannot test
 with older R than 2.14.2, though, but I do that regularly and
 certainly before CRAN releases.  If somebody wants to prevent this,
 they can set their package to unnecessarily depend on the current
 version of R. I would regard this as antisocial, but nobody would ask
 what I think about this so it does not matter.

 The development branch of my package is in R-Forge, and only bug fixes
 and (hopefully) non-breaking enhancements (isolated so that they do
 not influence other functions, safe so that API does not change or
 format of the output does not change) are merged to the CRAN release
 branch. This policy was adopted because it fits the current CRAN
 policy, and probably would need to change if CRAN policy changes.

 Cheers, Jari Oksanen

-- 
Rainer M. Krug
email: Raineratkrugsdotde
PGP: 0x0F52F982


pgp00NlNd0VYd.pgp
Description: PGP signature
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-21 Thread Philippe Grosjean

This is becoming an extremely long thread, and it is going in too many 
directions. However, I would like to mention here our ongoing five years 
projects ECOS project for the study of Open Source Ecosystems, among which, 
CRAN. You can find info here: 
http://informatique.umons.ac.be/genlog/projects/ecos/. We are in the second 
year now.

We are currently working on CRAN maintainability questions. See:

- Claes Maelick, Mens Tom, Grosjean Philippe, On the maintainability of CRAN 
packages in IEEE CSMR-WCRE 2014 Software Evolution Week, Antwerpen, Belgique, 
2014 (2014)

- Mens Tom, Claes Maelick, Grosjean Philippe, Serebrenik Alexander, Studying 
Evolving Software Ecosystems based on Ecological Models in Mens Tom, 
Serebrenik Alexander, Cleve Anthony, Evolving Software Systems , Springer, 
Mens Tom, Serebrenik Alexander, Cleve Anthony, 978-3-642-45397-7 (2014)

Currently, we are building an Open Source system based on Virtualbox and 
Vagrant to recreate a virtual machine under Linux (Debian and Ubuntu considered 
for the moment) that would be as close as possible as a simulated CRAN 
environment as it was at any given date. Our plans are to replay CRAN back in 
time and to instrumentize that platform to measure what we need for our 
ecological studies of CRAN.

The connection with this thread is the possibility to reuse this system for 
proposing something useful for reproducible research, that is, a reproducible 
platform, in the definition of reproducibility vs replicability Jeroen Ooms 
mentions. It would then be enough to record the date some R code was run on 
that platform (and perhaps whether it is 32 or 64 bit system) to be able to 
rebuild a similar software environment with all corresponding CRAN packages of 
the right version easily installable. In case something specific is required in 
addition to software proposed by default, Vagrant allows provisioning the 
Virtual machine in an easy way too… but then, the provisioning script must be 
provided too (not much a problem). Info required to rebuild the platform is 
shrunk down to a few kb Ascii text file. This is something easy to put together 
with your R code in, say, additional material of a publication. 

Please, keep in mind that many platform-specific features in R (graphic 
devices, string encoding, and many more) may be a problem too for reproducing 
published results. Hence, the idea to use a virtual box using only one OS, 
Linux, no matter if you work on Windows, or Mac OS X, or… Solaris (anyone 
there?).

PhG


On 20 Mar 2014, at 21:53, Jeroen Ooms jeroen.o...@stat.ucla.edu wrote:

 On Thu, Mar 20, 2014 at 1:28 PM, Ted Byers r.ted.by...@gmail.com wrote:
 
 Herve Pages mentions the risk of irreproducibility across three minor
 revisions of version 1.0 of Matrix.  My gut reaction would be that if the
 results are not reproducible across such minor revisions of one library,
 they are probably just so much BS.
 
 
 Perhaps this is just terminology, but what you refer to I would generally
 call 'replication'. Of course being able to replicate results with other
 data or other software is important to validate claims. But being able to
 reproduce how the original results were obtained is an important part of
 this process.
 
 If someone is publishing results that I think are questionable and I cannot
 replicate them, I want to know exactly how those outcomes were obtained in
 the first place, so that I can 'debug' the problem. It's quite important to
 be able to trace back if incorrect results were a result of a bug,
 incompetence or fraud.
 
 Let's take the example of the Reinhart and Rogoff case. The results
 obviously were not replicable, but without more information it was just the
 word of a grad students vs two Harvard professors. Only after reproducing
 the original analysis it was possible to point out the errors and proof
 that the original were incorrect.
 
   [[alternative HTML version deleted]]
 
 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel
 

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-21 Thread Jari Oksanen


On 21/03/2014, at 10:40 AM, Rainer M Krug wrote:

 
 
 This is a long and (mainly) interesting discussion, which is fanning out
 in many different directions, and I think many are not that relevant to
 the OP's suggestion. 
 
 I see the advantages of having such a dynamic CRAN, but also of having a
 more stable CRAN. I prefer CRAN as it is now, but ion many cases a more
 stable CRAN might b an advantage. So having releases of CRAN might make
 sense. But then there is the archiving issue of CRAN.
 
 The suggestion was made to move the responsibility away from CRAN and
 the R infrastructure to the user / researcher to guarantee that the
 results can be re-run years later. It would be nice to have this build
 in CRAN, but let's stick at the scenario that the user should care for
 reproducability.

There are two different problems that alternate in the discussion: 
reproducibility and breakage of CRAN dependencies. Frozen CRAN could make 
*approximate* reproducibility easier to achieve, but real reproducibility needs 
stricter solutions. Actual sessionInfo() is minimal information, but 
re-building a spitting image of old environment may still be demanding (but in 
many cases this does not matter). 

Another problem is that CRAN is so volatile that new versions of packages break 
other packages or old scripts. Here the main problem is how package developers 
work. Freezing CRAN would not change that: if package maintainers release 
breaking code, that would be frozen. I think that most packages do not make 
distinction between development and release branches, and CRAN policy won't 
change that. 

I can sympathize with package maintainers having 150 reverse dependencies. My 
main package only has ~50, and it is sure that I won't test them all with new 
release. I sometimes tried, but I could not even get all those built because 
they had other dependencies on packages that failed. Even those that I could 
test failed to detect problems (in one case all examples were \dontrun and 
passed nicely tests). I only wish that if people *really* depend on my package, 
they test it against R-Forge version and alert me before CRAN releases, but 
that is not very likely (I guess many dependencies are not *really* necessary, 
but only concern marginal features of the package, but CRAN forces to declare 
those). 

Still a few words about reproducibility of scripts: this can be hardly achieved 
with good coverage, because many scripts are so very ad hoc. When I edit and 
review manuscripts for journals, I very often get Sweave or knitr scripts that 
just work, where just means just so and so. Often they do not work at 
all, because they had some undeclared private functionalities or stray files in 
the author workspace that did not travel with the Sweave document. I think 
these -- published scientific papers -- are the main field where the code 
really should be reproducible, but they often are the hardest to reproduce. 
Nothing CRAN people do can help with sloppy code scientists write for 
publications. You know, they are scientists -- not engineers. 

Cheers, Jari Oksanen
 
 Leaving the issue of compilation out, a package which is creating a
 custom installation of the R version which includes the source of the R
 version used and the sources of the packages in a on Linux compilable
 format, given that the relevant dependencies are installed, would be a
 huge step forward. 
 
 I know - compilation on Windows (and sometimes Mac) is a serious
 problem), but to archive *all* binaries and to re-compile all older
 versions of R and all packages would be an impossible task.
 
 Apart from that - doing your analysis in a Virtual Machine and then
 simply archiving this Virtual Machine, would also be an option, but only
 for the more tech savy users.
 
 In a nutshell: I think a package would be able to provide the solution
 for a local archiving to make it possible to re-run the simulation with
 the same tools at a later stage - although guarantees would not be
 possible.
 
 Cheers,
 
 Rainer
 -- 
 Rainer M. Krug
 email: Raineratkrugsdotde
 PGP: 0x0F52F982
 
 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-21 Thread Rainer M Krug

Jari Oksanen jari.oksa...@oulu.fi writes:

 On 21/03/2014, at 10:40 AM, Rainer M Krug wrote:

 
 
 This is a long and (mainly) interesting discussion, which is fanning out
 in many different directions, and I think many are not that relevant to
 the OP's suggestion. 
 
 I see the advantages of having such a dynamic CRAN, but also of having a
 more stable CRAN. I prefer CRAN as it is now, but ion many cases a more
 stable CRAN might b an advantage. So having releases of CRAN might make
 sense. But then there is the archiving issue of CRAN.
 
 The suggestion was made to move the responsibility away from CRAN and
 the R infrastructure to the user / researcher to guarantee that the
 results can be re-run years later. It would be nice to have this build
 in CRAN, but let's stick at the scenario that the user should care for
 reproducability.

 There are two different problems that alternate in the discussion:
 reproducibility and breakage of CRAN dependencies. Frozen CRAN could
 make *approximate* reproducibility easier to achieve, but real
 reproducibility needs stricter solutions. Actual sessionInfo() is
 minimal information, but re-building a spitting image of old
 environment may still be demanding (but in many cases this does not
 matter).

 Another problem is that CRAN is so volatile that new versions of
 packages break other packages or old scripts. Here the main problem is
 how package developers work. Freezing CRAN would not change that: if
 package maintainers release breaking code, that would be frozen. I
 think that most packages do not make distinction between development
 and release branches, and CRAN policy won't change that.

 I can sympathize with package maintainers having 150 reverse
 dependencies. My main package only has ~50, and it is sure that I
 won't test them all with new release. I sometimes tried, but I could
 not even get all those built because they had other dependencies on
 packages that failed. Even those that I could test failed to detect
 problems (in one case all examples were \dontrun and passed nicely
 tests). I only wish that if people *really* depend on my package, they
 test it against R-Forge version and alert me before CRAN releases, but
 that is not very likely (I guess many dependencies are not *really*
 necessary, but only concern marginal features of the package, but CRAN
 forces to declare those).

Breakage of CRAN packages is a problem, to which I can not comment
much. I have no idea how this could be saved unless one introduces more
checks, which nobody wants. CRAN is a (more or less) open repository for
packages written by engineers / programmers but also scientists of other
fields - and that is the strength of CRAN - a central repository to find
packages which conform to a minimal standard and format. 


 Still a few words about reproducibility of scripts: this can be hardly
 achieved with good coverage, because many scripts are so very ad
 hoc. When I edit and review manuscripts for journals, I very often get
 Sweave or knitr scripts that just work, where just means just so
 and so. Often they do not work at all, because they had some
 undeclared private functionalities or stray files in the author
 workspace that did not travel with the Sweave document. 

One reason why I *always* start my R sessions --vanilla and ave a local
initialization script which I call manually. 

 I think these
 -- published scientific papers -- are the main field where the code
 really should be reproducible, but they often are the hardest to
 reproduce. 

And this is completely ouyt of the hands of R / CRAN / ... and in the
hand of Journals and Authors. But R could provide a framework to make
this more easy in form of a package which provides functions to make
this a one-command approach.

 Nothing CRAN people do can help with sloppy code scientists
 write for publications. You know, they are scientists -- not
 engineers.

Absolutely - and I am also a sloppy scientists - I put my code online,
but hope that not many people ask me later about it.

Cheers,

Rainer


 Cheers, Jari Oksanen
 
 Leaving the issue of compilation out, a package which is creating a
 custom installation of the R version which includes the source of the R
 version used and the sources of the packages in a on Linux compilable
 format, given that the relevant dependencies are installed, would be a
 huge step forward. 
 
 I know - compilation on Windows (and sometimes Mac) is a serious
 problem), but to archive *all* binaries and to re-compile all older
 versions of R and all packages would be an impossible task.
 
 Apart from that - doing your analysis in a Virtual Machine and then
 simply archiving this Virtual Machine, would also be an option, but only
 for the more tech savy users.
 
 In a nutshell: I think a package would be able to provide the solution
 for a local archiving to make it possible to re-run the simulation with
 the same tools at a later stage - although guarantees would not be
 possible.

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-21 Thread Philippe GROSJEAN


On 21 Mar 2014, at 11:08, Rainer M Krug rai...@krugs.de wrote:

 Jari Oksanen jari.oksa...@oulu.fi writes:
 
 On 21/03/2014, at 10:40 AM, Rainer M Krug wrote:
 
 
 
 This is a long and (mainly) interesting discussion, which is fanning out
 in many different directions, and I think many are not that relevant to
 the OP's suggestion. 
 
 I see the advantages of having such a dynamic CRAN, but also of having a
 more stable CRAN. I prefer CRAN as it is now, but ion many cases a more
 stable CRAN might b an advantage. So having releases of CRAN might make
 sense. But then there is the archiving issue of CRAN.
 
 The suggestion was made to move the responsibility away from CRAN and
 the R infrastructure to the user / researcher to guarantee that the
 results can be re-run years later. It would be nice to have this build
 in CRAN, but let's stick at the scenario that the user should care for
 reproducability.
 
 There are two different problems that alternate in the discussion:
 reproducibility and breakage of CRAN dependencies. Frozen CRAN could
 make *approximate* reproducibility easier to achieve, but real
 reproducibility needs stricter solutions. Actual sessionInfo() is
 minimal information, but re-building a spitting image of old
 environment may still be demanding (but in many cases this does not
 matter).
 
 Another problem is that CRAN is so volatile that new versions of
 packages break other packages or old scripts. Here the main problem is
 how package developers work. Freezing CRAN would not change that: if
 package maintainers release breaking code, that would be frozen. I
 think that most packages do not make distinction between development
 and release branches, and CRAN policy won't change that.
 
 I can sympathize with package maintainers having 150 reverse
 dependencies. My main package only has ~50, and it is sure that I
 won't test them all with new release. I sometimes tried, but I could
 not even get all those built because they had other dependencies on
 packages that failed. Even those that I could test failed to detect
 problems (in one case all examples were \dontrun and passed nicely
 tests). I only wish that if people *really* depend on my package, they
 test it against R-Forge version and alert me before CRAN releases, but
 that is not very likely (I guess many dependencies are not *really*
 necessary, but only concern marginal features of the package, but CRAN
 forces to declare those).
 
We work on these too. So far, for latest CRAN version, we have successfully 
installed 4999 packages among the 5321 CRAN package on our platform. Regarding 
conflicts in term of function names, around 2000 packages are clean, but the 
rest produce more than 11,000 pairs of conflicts (i.e., same function name in 
different packages). For dependency errors, look at the cited references 
earlier. It is strange that a large portion of R CMD check errors on CRAN occur 
and disappear *without any version update* of a package or any of its direct or 
indirect dependencies! That is, a fraction of errors or warnings seem to appear 
and disappear without any code update. We have traced back some of these to 
interaction with the net (e.g., example or vignette downloading data from a 
server and the server may be sometimes unavailable). So, yes, a complex and 
difficult topic.


 Breakage of CRAN packages is a problem, to which I can not comment
 much. I have no idea how this could be saved unless one introduces more
 checks, which nobody wants. CRAN is a (more or less) open repository for
 packages written by engineers / programmers but also scientists of other
 fields - and that is the strength of CRAN - a central repository to find
 packages which conform to a minimal standard and format. 
 
 
 Still a few words about reproducibility of scripts: this can be hardly
 achieved with good coverage, because many scripts are so very ad
 hoc. When I edit and review manuscripts for journals, I very often get
 Sweave or knitr scripts that just work, where just means just so
 and so. Often they do not work at all, because they had some
 undeclared private functionalities or stray files in the author
 workspace that did not travel with the Sweave document. 
 
 One reason why I *always* start my R sessions --vanilla and ave a local
 initialization script which I call manually. 
 
 I think these
 -- published scientific papers -- are the main field where the code
 really should be reproducible, but they often are the hardest to
 reproduce. 
 
 And this is completely ouyt of the hands of R / CRAN / ... and in the
 hand of Journals and Authors. But R could provide a framework to make
 this more easy in form of a package which provides functions to make
 this a one-command approach.
 
 Nothing CRAN people do can help with sloppy code scientists
 write for publications. You know, they are scientists -- not
 engineers.
 
This would be a first step. Then, people would have to learn how to use, say, 
Sweave, in order to

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-21 Thread Tom Short

For me, the most important aspect is being able to reproduce my own
work. Some other tools offer interesting approaches to managing
packages:

* NPM -- The Node Package Manager for Node.js loads a local copy of
all packages and dependencies. This helps ensure reproducibility and
avoids dependency issues. Different projects in different directories
can then use different package versions.

* Julia -- Julia's package manager is based on git, so users should
have a local copy of all package versions they've used. Theoretically,
you could use separate git repos for different projects, and merge as
desired.

I've thought about putting my local R library into a git repository.
Then, I could clone that into a project directory and use
.libPaths(.Rlibrary)  in a .Rprofile file to set the library
directory to the clone. In addition to handling package versions, this
might be nice for installing packages that are rarely used (my library
directory tends to get cluttered if I start trying out packages).
Another addition could be a local script that starts a specific
version of R.

For now, I don't have much incentive to do this. For the packages that
I use, R's been pretty good to me with backwards compatibility.

I do like the idea of a CRAN mirror that's under version control.




On Tue, Mar 18, 2014 at 4:24 PM, Jeroen Ooms jeroen.o...@stat.ucla.edu wrote:
 This came up again recently with an irreproducible paper. Below an
 attempt to make a case for extending the r-devel/r-release cycle to
 CRAN packages. These suggestions are not in any way intended as
 criticism on anyone or the status quo.

 The proposal described in [1] is to freeze a snapshot of CRAN along
 with every release of R. In this design, updates for contributed
 packages treated the same as updates for base packages in the sense
 that they are only published to the r-devel branch of CRAN and do not
 affect users of released versions of R. Thereby all users, stacks
 and applications using a particular version of R will by default be
 using the identical version of each CRAN package. The bioconductor
 project uses similar policies.

 This system has several important advantages:

 ## Reproducibility

 Currently r/sweave/knitr scripts are unstable because of ambiguity
 introduced by constantly changing cran packages. This causes scripts
 to break or change behavior when upstream packages are updated, which
 makes reproducing old results extremely difficult.

 A common counter-argument is that script authors should document
 package versions used in the script using sessionInfo(). However even
 if authors would manually do this, reconstructing the author's
 environment from this information is cumbersome and often nearly
 impossible, because binary packages might no longer be available,
 dependency conflicts, etc. See [1] for a worked example. In practice,
 the current system causes many results or documents generated with R
 no to be reproducible, sometimes already after a few months.

 In a system where contributed packages inherit the r-base release
 cycle, scripts will behave the same across users/systems/time within a
 given version of R. This severely reduces ambiguity of R behavior, and
 has the potential of making reproducibility a natural part of the
 language, rather than a tedious exercise.

 ## Repository Management

 Just like scripts suffer from upstream changes, so do packages
 depending on other packages. A particular package that has been
 developed and tested against the current version of a particular
 dependency is not guaranteed to work against *any future version* of
 that dependency. Therefore, packages inevitably break over time as
 their dependencies are updated.

 One recent example is the Rcpp 0.11 release, which required all
 reverse dependencies to be rebuild/modified. This updated caused some
 serious disruption on our production servers. Initially we refrained
 from updating Rcpp on these servers to prevent currently installed
 packages depending on Rcpp to stop working. However soon after the
 Rcpp 0.11 release, many other cran packages started to require Rcpp =
 0.11, and our users started complaining about not being able to
 install those packages. This resulted in the impossible situation
 where currently installed packages would not work with the new Rcpp,
 but newly installed packages would not work with the old Rcpp.

 Current CRAN policies blame this problem on package authors. However
 as is explained in [1], this policy does not solve anything, is
 unsustainable with growing repository size, and sets completely the
 wrong incentives for contributing code. Progress comes with breaking
 changes, and the system should be able to accommodate this. Much of
 the trouble could have been prevented by a system that does not push
 bleeding edge updates straight to end-users, but has a devel branch
 where conflicts are resolved before publishing them in the next
 r-release.

 ## Reliability

 Another example, this time on

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread David Winsemius


On Mar 19, 2014, at 7:45 PM, Jeroen Ooms wrote:

 On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
 michael.weyla...@gmail.com wrote:
 Reading this thread again, is it a fair summary of your position to say 
 reproducibility by default is more important than giving users access to 
 the newest bug fixes and features by default? It's certainly arguable, but 
 I'm not sure I'm convinced: I'd imagine that the ratio of new work being 
 done vs reproductions is rather high and the current setup optimizes for 
 that already.
 
 I think that separating development from released branches can give us
 both reliability/reproducibility (stable branch) as well as new
 features (unstable branch). The user gets to pick (and you can pick
 both!). The same is true for r-base: when using a 'released' version
 you get 'stable' base packages that are up to 12 months old. If you
 want to have the latest stuff you download a nightly build of r-devel.
 For regular users and reproducible research it is recommended to use
 the stable branch. However if you are a developer (e.g. package
 author) you might want to develop/test/check your work with the latest
 r-devel.
 
 I think that extending the R release cycle to CRAN would result both
 in more stable released versions of R, as well as more freedom for
 package authors to implement rigorous change in the unstable branch.
 When writing a script that is part of a production pipeline, or sweave
 paper that should be reproducible 10 years from now, or a book on
 using R, you use stable version of R, which is guaranteed to behave
 the same over time. However when developing packages that should be
 compatible with the upcoming release of R, you use r-devel which has
 the latest versions of other CRAN and base packages.


As I remember ... The example demonstrating the need for this was an XML 
package that cause an extract from a website where the headers were 
misinterpreted as data in one version of pkg:XML and not in another. That seems 
fairly unconvincing. Data cleaning and validation is a basic task of data 
analysis. It also seems excessive to assert that it is the responsibility of 
CRAN to maintain a synced binary archive that will be available in ten years. 
Bug fixes would be inhibited for years not unlike SAS and Excel. What next? 
Perhaps al bugs should be labeled as features?  Surely this CRAN-of-the-future 
would be offering something that no other statistical package currently offers, 
nicht wahr?

Why not leave it to the authors to specify the packages which version numbers 
were used in their publications. The authors of the packages would get 
recognition and the dependencies would be recorded.

-- 
David.
 
 
 What I'm trying to figure out is why the standard install the following 
 list of package versions isn't good enough in your eyes?
 
 Almost nobody does this because it is cumbersome and impractical. We
 can do so much better than this. Note that in order to install old
 packages you also need to investigate which versions of dependencies
 of those packages were used. On win/osx, users need to manually build
 those packages which can be a pain. All in all it makes reproducible
 research difficult and expensive and error prone. At the end of the
 day most published results obtain with R just won't be reproducible.
 
 Also I believe that keeping it simple is essential for solutions to be
 practical. If every script has to be run inside an environment with
 custom libraries, it takes away much of its power. Running a bash or
 python script in Linux is so easy and reliable that entire
 distributions are based on it. I don't understand why we make our
 lives so difficult in R.
 
 In my estimation, a system where stable versions of R pull packages
 from a stable branch of CRAN will naturally resolve the majority of
 the reproducibility and reliability problems with R. And in contrast
 to what some people here are suggesting it does not introduce any
 limitations. If you want to get the latest stuff, you either grab a
 copy of r-devel, or just enable the testing branch and off you go.
 Debian 'testing' works in a similar way, see
 http://www.debian.org/devel/testing.
 
 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel

David Winsemius
Alameda, CA, USA

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Dan Tenenbaum

- Original Message -
 From: David Winsemius dwinsem...@comcast.net
 To: Jeroen Ooms jeroen.o...@stat.ucla.edu
 Cc: r-devel r-devel@r-project.org
 Sent: Wednesday, March 19, 2014 11:03:32 PM
 Subject: Re: [Rd] [RFC] A case for freezing CRAN

 On Mar 19, 2014, at 7:45 PM, Jeroen Ooms wrote:

  On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
  michael.weyla...@gmail.com wrote:
  Reading this thread again, is it a fair summary of your position
  to say reproducibility by default is more important than giving
  users access to the newest bug fixes and features by default?
  It's certainly arguable, but I'm not sure I'm convinced: I'd
  imagine that the ratio of new work being done vs reproductions is
  rather high and the current setup optimizes for that already.

  I think that separating development from released branches can give
  us
  both reliability/reproducibility (stable branch) as well as new
  features (unstable branch). The user gets to pick (and you can pick
  both!). The same is true for r-base: when using a 'released'
  version
  you get 'stable' base packages that are up to 12 months old. If you
  want to have the latest stuff you download a nightly build of
  r-devel.
  For regular users and reproducible research it is recommended to
  use
  the stable branch. However if you are a developer (e.g. package
  author) you might want to develop/test/check your work with the
  latest
  r-devel.

  I think that extending the R release cycle to CRAN would result
  both
  in more stable released versions of R, as well as more freedom for
  package authors to implement rigorous change in the unstable
  branch.
  When writing a script that is part of a production pipeline, or
  sweave
  paper that should be reproducible 10 years from now, or a book on
  using R, you use stable version of R, which is guaranteed to behave
  the same over time. However when developing packages that should be
  compatible with the upcoming release of R, you use r-devel which
  has
  the latest versions of other CRAN and base packages.

 As I remember ... The example demonstrating the need for this was an
 XML package that cause an extract from a website where the headers
 were misinterpreted as data in one version of pkg:XML and not in
 another. That seems fairly unconvincing. Data cleaning and
 validation is a basic task of data analysis. It also seems excessive
 to assert that it is the responsibility of CRAN to maintain a synced
 binary archive that will be available in ten years. 

CRAN already does this, the bin/windows/contrib directory has subdirectories 
going back to 1.7, with packages dated October 2004. I don't see why it is 
burdensome to continue to archive these. It would be nice if source versions 
had a similar archive.

Dan

 Bug fixes would
 be inhibited for years not unlike SAS and Excel. What next?
 Perhaps al bugs should be labeled as features?  Surely this
 CRAN-of-the-future would be offering something that no other
 statistical package currently offers, nicht wahr?

 Why not leave it to the authors to specify the packages which version
 numbers were used in their publications. The authors of the packages
 would get recognition and the dependencies would be recorded.

 --
 David.

  What I'm trying to figure out is why the standard install the
  following list of package versions isn't good enough in your
  eyes?

  Almost nobody does this because it is cumbersome and impractical.
  We
  can do so much better than this. Note that in order to install old
  packages you also need to investigate which versions of
  dependencies
  of those packages were used. On win/osx, users need to manually
  build
  those packages which can be a pain. All in all it makes
  reproducible
  research difficult and expensive and error prone. At the end of the
  day most published results obtain with R just won't be
  reproducible.

  Also I believe that keeping it simple is essential for solutions to
  be
  practical. If every script has to be run inside an environment with
  custom libraries, it takes away much of its power. Running a bash
  or
  python script in Linux is so easy and reliable that entire
  distributions are based on it. I don't understand why we make our
  lives so difficult in R.

  In my estimation, a system where stable versions of R pull packages
  from a stable branch of CRAN will naturally resolve the majority of
  the reproducibility and reliability problems with R. And in
  contrast
  to what some people here are suggesting it does not introduce any
  limitations. If you want to get the latest stuff, you either grab a
  copy of r-devel, or just enable the testing branch and off you go.
  Debian 'testing' works in a similar way, see
  http://www.debian.org/devel/testing.

  __
  R-devel@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-devel

 David Winsemius
 Alameda, CA, USA

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Rainer M Krug

Michael Weylandt michael.weyla...@gmail.com writes:

 On Mar 19, 2014, at 22:17, Gavin Simpson ucfa...@gmail.com wrote:

 Michael,
 
 I think the issue is that Jeroen wants to take that responsibility out
 of the hands of the person trying to reproduce a work. If it used R
 3.0.x and packages A, B and C then it would be trivial to to install
 that version of R and then pull down the stable versions of A B and C
 for that version of R. At the moment, one might note the packages used
 and even their versions, but what about the versions of the packages
 that the used packages rely upon  so on? What if developers don't
 state know working versions of dependencies?

 Doesn't sessionInfo() give all of this?

 If you want to be very worried about every last bit, I suppose it
 should also include options(), compiler flags, compiler version, BLAS
 details, etc.  (Good talk on the dregs of a floating point number and
 how hard it is to reproduce them across processors
 http://www.youtube.com/watch?v=GIlp4rubv8U)

In principle yes - but this calls specifically for a package which is
extracting the info and stores it into a human readable format, which
can then be used to re-install (automatically) all the versions for
(hopefully) reproducibility - because if there are external libraries
included, you HAVE problems.


 
 The problem is how the heck do you know which versions of packages are
 needed if developers don't record these dependencies in sufficient
 detail? The suggested solution is to freeze CRAN at intervals
 alongside R releases. Then you'd know what the stable versions were.

 Only if you knew which R release was used. 

Well - that would be easier to specify in a paper then the version infos
of all packages needed - and which ones of the installed ones are
actually needed? OK - the ones specified in library() calls. But wait -
there are dependencies, imports, ... That is a lot of digging - I wpul;d
not know how to do this out of my head, except by digging through the
DESCRIPTION files of the packages...


 
 Or we could just get package developers to be more thorough in
 documenting dependencies. Or R CMD check could refuse to pass if a
 package is listed as a dependency but with no version qualifiers. Or
 have R CMD build add an upper bound (from the current, at build-time
 version of dependencies on CRAN) if the package developer didn't
 include and upper bound. Or... The first is unliekly to happen
 consistently, and no-one wants *more* checks and hoops to jump through
 :-)
 
 To my mind it is incumbent upon those wanting reproducibility to build
 the tools to enable users to reproduce works.

 But the tools already allow it with minimal effort. If the author
 can't even include session info, how can we be sure the version of R
 is known. If we can't know which version of R, can we ever change R at
 all? Etc to absurdity.

 My (serious) point is that the tools are in place, but ramming them
 down folks' throats by intentionally keeping them on older versions by
 default is too much.

 When you write a paper
 or release a tool, you will have tested it with a specific set of
 packages. It is relatively easy to work out what those versions are
 (there are tools in R for this). What is required is an automated way
 to record that info in an agreed upon way in an approved
 file/location, and have a tool that facilitates setting up a package
 library sufficient with which to reproduce a work. That approval
 doesn't need to come from CRAN or R Core - we can store anything in
 ./inst.

 I think the package version and published paper cases are different. 

 For the latter, the recipe is simple: if you want the same results,
 use the same software (as noted by sessionInfoPlus() or equiv)

Dependencies, imports, package versions, ... not that straight forward I
would say.


 For the former, I think you start straying into this NP complete problem: 
 http://people.debian.org/~dburrows/model.pdf 

 Yes, a good config can (and should be recorded) but isn't that exactly what 
 sessionInfo() gives?

 
 Reproducibility is a very important part of doing science, but not
 everyone using CRAN is doing that. Why force everyone to march to the
 reproducibility drum? I would place the onus elsewhere to make this
 work.
 

 Agreed: reproducibility is the onus of the author, not the reader

Exactly - but also the authors of the software which is aimed at being
used in the context of reproducibility - the tools should be there to
make it easy!

My points are:

1) I think the snapshot idea of CRAN is a good idea which should be
followed
2) The snapshots should be incorporated at CRAN as I assume that CRAN
will be there longer then any third party repository.
3) the default for the user should *not* change, i.e. normal users will
always get the newest packages as it is now
4) If this can / will not be done because of workload, storage space,
... commands should be incorporated in a package (preferably which
becomes part of

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Rainer M Krug

Hadley Wickham h.wick...@gmail.com writes:

 What would be more useful in terms of reproducibility is the capability of
 installing a specific version of a package from a repository using
 install.packages(), which would require archiving older versions in a
 coordinated fashion. I know CRAN archives old versions, but I am not aware
 if we can programmatically query the repository about this.

 See devtools::install_version().

 The main caveat is that you also need to be able to build the package,
 and ensure you have dependencies that work with that version.

The compiling will always be the problem when using older source
packages, whatever is done.

But for the dependencies: an automatic parsing of the dependencies
(DEPENDS, IMPORTS, ...) would help a lot. 

Together with a command which scans the installed package in the session
and stores them in a parsable human readable format so that all packages
(with the specified version) required can be installed with one command,
and I think the problem would be much closer to be solved.

Rainer


 Hadley

-- 
Rainer M. Krug
email: Raineratkrugsdotde
PGP: 0x0F52F982


pgpfAhwo1bQBT.pgp
Description: PGP signature
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Duncan Murdoch

On 14-03-20 2:15 AM, Dan Tenenbaum wrote:

- Original Message -

From: David Winsemius dwinsem...@comcast.net
To: Jeroen Ooms jeroen.o...@stat.ucla.edu
Cc: r-devel r-devel@r-project.org
Sent: Wednesday, March 19, 2014 11:03:32 PM
Subject: Re: [Rd] [RFC] A case for freezing CRAN

On Mar 19, 2014, at 7:45 PM, Jeroen Ooms wrote:

On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
michael.weyla...@gmail.com wrote:

Reading this thread again, is it a fair summary of your position
to say reproducibility by default is more important than giving
users access to the newest bug fixes and features by default?
It's certainly arguable, but I'm not sure I'm convinced: I'd
imagine that the ratio of new work being done vs reproductions is
rather high and the current setup optimizes for that already.

I think that separating development from released branches can give
us
both reliability/reproducibility (stable branch) as well as new
features (unstable branch). The user gets to pick (and you can pick
both!). The same is true for r-base: when using a 'released'
version
you get 'stable' base packages that are up to 12 months old. If you
want to have the latest stuff you download a nightly build of
r-devel.
For regular users and reproducible research it is recommended to
use
the stable branch. However if you are a developer (e.g. package
author) you might want to develop/test/check your work with the
latest
r-devel.

I think that extending the R release cycle to CRAN would result
both
in more stable released versions of R, as well as more freedom for
package authors to implement rigorous change in the unstable
branch.
When writing a script that is part of a production pipeline, or
sweave
paper that should be reproducible 10 years from now, or a book on
using R, you use stable version of R, which is guaranteed to behave
the same over time. However when developing packages that should be
compatible with the upcoming release of R, you use r-devel which
has
the latest versions of other CRAN and base packages.

As I remember ... The example demonstrating the need for this was an
XML package that cause an extract from a website where the headers
were misinterpreted as data in one version of pkg:XML and not in
another. That seems fairly unconvincing. Data cleaning and
validation is a basic task of data analysis. It also seems excessive
to assert that it is the responsibility of CRAN to maintain a synced
binary archive that will be available in ten years.

CRAN already does this, the bin/windows/contrib directory has subdirectories 
going back to 1.7, with packages dated October 2004. I don't see why it is 
burdensome to continue to archive these. It would be nice if source versions 
had a similar archive.

The bin/windows/contrib directories are updated every day for active R 
versions.  It's only when Uwe decides that a version is no longer worth 
active support that he stops doing updates, and it freezes.  A 
consequence of this is that the snapshots preserved in those older 
directories are unlikely to match what someone who keeps up to date with 
R releases is using.  Their purpose is to make sure that those older 
versions aren't completely useless, but they aren't what Jeroen was 
asking for.

Karl Millar's suggestion seems like an ideal solution to this problem. 
Any CRAN mirror could implement it.  If someone sets this up and commits 
to maintaining it, I'd be happy to work on the necessary changes to the 
install.packages/update.packages code to allow people to use it from 
within R.

Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Roger Bivand

Gavin Simpson ucfagls at gmail.com writes:

 
...
 
 
 To my mind it is incumbent upon those wanting reproducibility to build
 the tools to enable users to reproduce works. When you write a paper
 or release a tool, you will have tested it with a specific set of
 packages. It is relatively easy to work out what those versions are
 (there are tools in R for this). What is required is an automated way
 to record that info in an agreed upon way in an approved
 file/location, and have a tool that facilitates setting up a package
 library sufficient with which to reproduce a work. That approval
 doesn't need to come from CRAN or R Core - we can store anything in
 ./inst.

Gavin,

Thanks for contributing useful insights. With reference to Jeroen's proposal
and the discussion so far, I can see where the problem lies, but the
proposed solutions are very invasive. What might offer a less invasive
resolution is through a robust and predictable schema for sessionInfo()
content, permitting ready parsing, so that (using Hadley's interjection) the
reproducer could reconstruct the original execution environment at least as
far as R and package versions are concerned.

In fact, I'd argue that the responsibility for securing reproducibility lies
with the originating author or organisation, so that work where
reproducibility is desired should include such a standardised record. 

There is an additional problem not addressed directly in this thread but
mentioned in some contributions, upstream of R. The further problem upstream
is actually in the external dependencies and compilers, beyond that in
hardware. So raising consciousness about the importance of being able to
query version information to enable reproducibility is important.

Next, encapsulating the information permitting its parsing would perhaps
enable the original execution environment to be reconstructed locally by
installing external dependencies, then R, then packages from source, using
the same versions of build train components if possible (and noting
mismatches if not). Maybe ressurect StatDataML in addition to RData
serialization of the version dependencies? Of course, current R and package
versions may provide reproducibility, but if they don't, one would use the
parseable record of the original development environment 

 
 Reproducibility is a very important part of doing science, but not
 everyone using CRAN is doing that. Why force everyone to march to the
 reproducibility drum? I would place the onus elsewhere to make this
 work.

Exactly.

Roger

 
 Gavin
 A scientist, very much interested in reproducibility of my work and others.
 
...
 
  __
  R-devel at r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread S Ellison

  If we could all agree on a particular set
 of cran packages to be used with a certain release of R, then it doesn't 
 matter
 how the 'snapshotting' gets implemented.

This is pretty much the sticking point, though. I see no practical way of 
reaching that agreement without the kind of decision authority (and effort) 
that Linux distro maintainers put in to the internal consistency of each 
distribution.

CRAN doesn't try to do that; it's just a place to access packages offered by 
maintainers. 

As a package maintainer, I think support for critical version dependencies in 
the imports or dependency lists is a good idea that individual package 
maintainers could relatively easily manage, but I think freezing CRAN as a 
whole or adopting single release cycles for CRAN would be thoroughly 
impractical.

S Ellison





***
This email and any attachments are confidential. Any use...{{dropped:8}}

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Jari Oksanen


On 20/03/2014, at 14:14 PM, S Ellison wrote:

 If we could all agree on a particular set
 of cran packages to be used with a certain release of R, then it doesn't 
 matter
 how the 'snapshotting' gets implemented.
 
 This is pretty much the sticking point, though. I see no practical way of 
 reaching that agreement without the kind of decision authority (and effort) 
 that Linux distro maintainers put in to the internal consistency of each 
 distribution.
 
 CRAN doesn't try to do that; it's just a place to access packages offered by 
 maintainers. 
 
 As a package maintainer, I think support for critical version dependencies in 
 the imports or dependency lists is a good idea that individual package 
 maintainers could relatively easily manage, but I think freezing CRAN as a 
 whole or adopting single release cycles for CRAN would be thoroughly 
 impractical.
 

I have a feeling that this discussion has floated between two different 
arguments in favour of freezing: discontent with package authors who break 
their packages within R release cycle, and ability to reproduce old results. In 
the beginning the first argument was more prominent, but now the discussion has 
drifted to reproducing old results. 

I cannot see how freezing CRAN would help with package authors who do not 
separate development and CRAN release branches but introduce broken code, or 
code that breaks other packages. Freezing a broken snapshot would only mean 
that the situation cannot be cured before next R release, and then new breakage 
could be introduced. Result would be dysfunctional CRAN. I think that quite a 
few of the package updates are bug fixes and minor enhancements. Further, I do 
think that these should be backported to previous versions of R: users of 
previous version of R should also benefit from bug fixes. This also is the 
current CRAN policy and I think this is a good policy. Personally, I try to 
keep my packages in such a condition that they will also work in previous 
versions of R so that people do not need to upgrade R to have bug fixes in 
packages. 

The policy is the same with Linux maintainers: they do not just build a 
consistent release, but maintain the release by providing bug fixes. In Linux 
distributions, end of life equals freezing, or not providing new versions of 
software.

Another issue is reproducing old analyses. This is a valuable thing, and 
sessionInfo and ability to get certain versions of package certainly are steps 
forward. It looks that guaranteed reproduction is a hard task, though. For 
instance, R 2.14.2 is the oldest version of R that I can build out of the box 
in my Linux desktop. I have earlier built older, even much older, R versions, 
but something has happened in my OS that crashes the build process. To 
reproduce an old analysis, I also should install an older version of my OS,  
then build old R and then get the old versions of packages. It is nice if the 
last step is made easier.

Cheers, Jari Oksanen

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Hervé Pagès


On 03/20/2014 03:52 AM, Duncan Murdoch wrote:

On 14-03-20 2:15 AM, Dan Tenenbaum wrote:



- Original Message -

From: David Winsemius dwinsem...@comcast.net
To: Jeroen Ooms jeroen.o...@stat.ucla.edu
Cc: r-devel r-devel@r-project.org
Sent: Wednesday, March 19, 2014 11:03:32 PM
Subject: Re: [Rd] [RFC] A case for freezing CRAN


On Mar 19, 2014, at 7:45 PM, Jeroen Ooms wrote:


On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
michael.weyla...@gmail.com wrote:

Reading this thread again, is it a fair summary of your position
to say reproducibility by default is more important than giving
users access to the newest bug fixes and features by default?
It's certainly arguable, but I'm not sure I'm convinced: I'd
imagine that the ratio of new work being done vs reproductions is
rather high and the current setup optimizes for that already.


I think that separating development from released branches can give
us
both reliability/reproducibility (stable branch) as well as new
features (unstable branch). The user gets to pick (and you can pick
both!). The same is true for r-base: when using a 'released'
version
you get 'stable' base packages that are up to 12 months old. If you
want to have the latest stuff you download a nightly build of
r-devel.
For regular users and reproducible research it is recommended to
use
the stable branch. However if you are a developer (e.g. package
author) you might want to develop/test/check your work with the
latest
r-devel.

I think that extending the R release cycle to CRAN would result
both
in more stable released versions of R, as well as more freedom for
package authors to implement rigorous change in the unstable
branch.
When writing a script that is part of a production pipeline, or
sweave
paper that should be reproducible 10 years from now, or a book on
using R, you use stable version of R, which is guaranteed to behave
the same over time. However when developing packages that should be
compatible with the upcoming release of R, you use r-devel which
has
the latest versions of other CRAN and base packages.



As I remember ... The example demonstrating the need for this was an
XML package that cause an extract from a website where the headers
were misinterpreted as data in one version of pkg:XML and not in
another. That seems fairly unconvincing. Data cleaning and
validation is a basic task of data analysis. It also seems excessive
to assert that it is the responsibility of CRAN to maintain a synced
binary archive that will be available in ten years.



CRAN already does this, the bin/windows/contrib directory has
subdirectories going back to 1.7, with packages dated October 2004. I
don't see why it is burdensome to continue to archive these. It would
be nice if source versions had a similar archive.


The bin/windows/contrib directories are updated every day for active R
versions.  It's only when Uwe decides that a version is no longer worth
active support that he stops doing updates, and it freezes.  A
consequence of this is that the snapshots preserved in those older
directories are unlikely to match what someone who keeps up to date with
R releases is using.  Their purpose is to make sure that those older
versions aren't completely useless, but they aren't what Jeroen was
asking for.


But it is almost completely useless from a reproducibility point of
view to get random package versions. For example if some people try
to use R-2.13.2 today to reproduce an analysis that was published
2 years ago, they'll get Matrix 1.0-4 on Windows, Matrix 1.0-3 on Mac,
and Matrix 1.1-2-2 on Unix. And none of them of course is what was used
by the authors of the paper (they used Matrix 1.0-1, which is what was
current when they ran their analysis).

A big improvement from a reproducibility point of view would be to
(a) have a clear cut for the freezes, (b) freeze the source
packages as well as the binary packages, and (c) freeze the same
versions of source and binaries. For example the freeze of
bin/windows/contrib/x.y, bin/macosx/contrib/x.y and contrib/x.y
could happen when the R-x.y series itself freezes (i.e. no more
minor versions planned for this series).

Cheers,
H.



Karl Millar's suggestion seems like an ideal solution to this problem.
Any CRAN mirror could implement it.  If someone sets this up and commits
to maintaining it, I'd be happy to work on the necessary changes to the
install.packages/update.packages code to allow people to use it from
within R.

Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Ted Byers

On Thu, Mar 20, 2014 at 3:14 PM, Hervé Pagès hpa...@fhcrc.org wrote:

 On 03/20/2014 03:52 AM, Duncan Murdoch wrote:

 On 14-03-20 2:15 AM, Dan Tenenbaum wrote:



 - Original Message -

 From: David Winsemius dwinsem...@comcast.net
 To: Jeroen Ooms jeroen.o...@stat.ucla.edu
 Cc: r-devel r-devel@r-project.org
 Sent: Wednesday, March 19, 2014 11:03:32 PM
 Subject: Re: [Rd] [RFC] A case for freezing CRAN


 On Mar 19, 2014, at 7:45 PM, Jeroen Ooms wrote:

  On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
 michael.weyla...@gmail.com wrote:

 Reading this thread again, is it a fair summary of your position
 to say reproducibility by default is more important than giving
 users access to the newest bug fixes and features by default?
 It's certainly arguable, but I'm not sure I'm convinced: I'd
 imagine that the ratio of new work being done vs reproductions is
 rather high and the current setup optimizes for that already.


 I think that separating development from released branches can give
 us
 both reliability/reproducibility (stable branch) as well as new
 features (unstable branch). The user gets to pick (and you can pick
 both!). The same is true for r-base: when using a 'released'
 version
 you get 'stable' base packages that are up to 12 months old. If you
 want to have the latest stuff you download a nightly build of
 r-devel.
 For regular users and reproducible research it is recommended to
 use
 the stable branch. However if you are a developer (e.g. package
 author) you might want to develop/test/check your work with the
 latest
 r-devel.

 I think that extending the R release cycle to CRAN would result
 both
 in more stable released versions of R, as well as more freedom for
 package authors to implement rigorous change in the unstable
 branch.
 When writing a script that is part of a production pipeline, or
 sweave
 paper that should be reproducible 10 years from now, or a book on
 using R, you use stable version of R, which is guaranteed to behave
 the same over time. However when developing packages that should be
 compatible with the upcoming release of R, you use r-devel which
 has
 the latest versions of other CRAN and base packages.



 As I remember ... The example demonstrating the need for this was an
 XML package that cause an extract from a website where the headers
 were misinterpreted as data in one version of pkg:XML and not in
 another. That seems fairly unconvincing. Data cleaning and
 validation is a basic task of data analysis. It also seems excessive
 to assert that it is the responsibility of CRAN to maintain a synced
 binary archive that will be available in ten years.



 CRAN already does this, the bin/windows/contrib directory has
 subdirectories going back to 1.7, with packages dated October 2004. I
 don't see why it is burdensome to continue to archive these. It would
 be nice if source versions had a similar archive.


 The bin/windows/contrib directories are updated every day for active R
 versions.  It's only when Uwe decides that a version is no longer worth
 active support that he stops doing updates, and it freezes.  A
 consequence of this is that the snapshots preserved in those older
 directories are unlikely to match what someone who keeps up to date with
 R releases is using.  Their purpose is to make sure that those older
 versions aren't completely useless, but they aren't what Jeroen was
 asking for.


 But it is almost completely useless from a reproducibility point of
 view to get random package versions. For example if some people try
 to use R-2.13.2 today to reproduce an analysis that was published
 2 years ago, they'll get Matrix 1.0-4 on Windows, Matrix 1.0-3 on Mac,
 and Matrix 1.1-2-2 on Unix. And none of them of course is what was used
 by the authors of the paper (they used Matrix 1.0-1, which is what was
 current when they ran their analysis).


Initially this discussion brought back nightmares of DLL hell on Windows.
Those as ancient as I will remember that well.  But now, the focus seems to
be on reproducibility, but with what strikes me as a seriously flawed
notion of what reproducibility means.

Herve Pages mentions the risk of irreproducibility across three minor
revisions of version 1.0 of Matrix.  My gut reaction would be that if the
results are not reproducible across such minor revisions of one library,
they are probably just so much BS.  I am trained in mathematical ecology,
with more than a couple decades of post-doc experience working with risk
assessment in the private sector.  When I need to do an analysis, I will
repeat it myself in multiple products, as well as C++ or FORTRAN code I
have hand-crafted myself (and when I wrote number crunching code myself, I
would do so in multiple programming languages - C++, Java, FORTRAN,
applying rigorous QA procedures to each program/library I developed).  Back
when I was a grad student, I would not even show the results to my
supervisor, let alone try to publish them, unless

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Jeroen Ooms

On Thu, Mar 20, 2014 at 1:28 PM, Ted Byers r.ted.by...@gmail.com wrote:

 Herve Pages mentions the risk of irreproducibility across three minor
 revisions of version 1.0 of Matrix.  My gut reaction would be that if the
 results are not reproducible across such minor revisions of one library,
 they are probably just so much BS.


Perhaps this is just terminology, but what you refer to I would generally
call 'replication'. Of course being able to replicate results with other
data or other software is important to validate claims. But being able to
reproduce how the original results were obtained is an important part of
this process.

If someone is publishing results that I think are questionable and I cannot
replicate them, I want to know exactly how those outcomes were obtained in
the first place, so that I can 'debug' the problem. It's quite important to
be able to trace back if incorrect results were a result of a bug,
incompetence or fraud.

Let's take the example of the Reinhart and Rogoff case. The results
obviously were not replicable, but without more information it was just the
word of a grad students vs two Harvard professors. Only after reproducing
the original analysis it was possible to point out the errors and proof
that the original were incorrect.

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Tim Triche, Jr.

That doesn't make sense.

If an API changes (e.g. in Matrix) and a program written against the old
API can no longer run, that is a very different issue than if the same
numbers (data) give different results.  The latter is what I am guessing
you address.  The former is what I believe most people are concerned about
here.  Or at least I hope that's so.

It's more an issue of usability than reproducibility in such a case, far as
I can tell (see e.g.
http://liorpachter.wordpress.com/2014/03/18/reproducibility-vs-usability/).
 If the same data produces substantially different results (not
attributable to e.g. better handling of machine precision and so forth,
although that could certainly be a bugaboo in many cases... anyone who has
programmed numerical routines in FORTRAN already knows this) then yes,
that's a different type of bug.  But in order to uncover the latter type of
bug, the code has to run in the first place.  After a while it becomes
rather impenetrable if no thought is given to these changes.

So the Bioconductor solution, as Herve noted, is to have freezes and
releases.  There can be old bugs enshrined in people's code due to using
old versions, and those can be traced even after many releases have come
and gone, because there is a point-in-time snapshot of about when these
things occurred.  As with (say) ANSI C++, deprecation notices stay in place
for a year before anything is actually done to remove a function or break
an API.  It's not impossible, it just requires more discipline than
declaring that the same program should be written multiple times on
multiple platforms every time.  The latter isn't an efficient use of
anyone's time.

Most of these analyses are not about putting a man on the moon or making
sure a dam does not break.  They're relatively low-consequence exploratory
sorties.  If something comes of them, it would be nice to have a
point-in-time reference to check and see whether the original results were
hooey.  That's a lot quicker and more efficient than rewriting everything
from scratch (which, in some fields, simply ensures things won't get
checked).

My $0.02, since we do still have those to bedevil cashiers.



Statistics is the grammar of science.
Karl Pearson http://en.wikipedia.org/wiki/The_Grammar_of_Science


On Thu, Mar 20, 2014 at 1:28 PM, Ted Byers r.ted.by...@gmail.com wrote:

 On Thu, Mar 20, 2014 at 3:14 PM, Hervé Pagès hpa...@fhcrc.org wrote:

  On 03/20/2014 03:52 AM, Duncan Murdoch wrote:
 
  On 14-03-20 2:15 AM, Dan Tenenbaum wrote:
 
 
 
  - Original Message -
 
  From: David Winsemius dwinsem...@comcast.net
  To: Jeroen Ooms jeroen.o...@stat.ucla.edu
  Cc: r-devel r-devel@r-project.org
  Sent: Wednesday, March 19, 2014 11:03:32 PM
  Subject: Re: [Rd] [RFC] A case for freezing CRAN
 
 
  On Mar 19, 2014, at 7:45 PM, Jeroen Ooms wrote:
 
   On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
  michael.weyla...@gmail.com wrote:
 
  Reading this thread again, is it a fair summary of your position
  to say reproducibility by default is more important than giving
  users access to the newest bug fixes and features by default?
  It's certainly arguable, but I'm not sure I'm convinced: I'd
  imagine that the ratio of new work being done vs reproductions is
  rather high and the current setup optimizes for that already.
 
 
  I think that separating development from released branches can give
  us
  both reliability/reproducibility (stable branch) as well as new
  features (unstable branch). The user gets to pick (and you can pick
  both!). The same is true for r-base: when using a 'released'
  version
  you get 'stable' base packages that are up to 12 months old. If you
  want to have the latest stuff you download a nightly build of
  r-devel.
  For regular users and reproducible research it is recommended to
  use
  the stable branch. However if you are a developer (e.g. package
  author) you might want to develop/test/check your work with the
  latest
  r-devel.
 
  I think that extending the R release cycle to CRAN would result
  both
  in more stable released versions of R, as well as more freedom for
  package authors to implement rigorous change in the unstable
  branch.
  When writing a script that is part of a production pipeline, or
  sweave
  paper that should be reproducible 10 years from now, or a book on
  using R, you use stable version of R, which is guaranteed to behave
  the same over time. However when developing packages that should be
  compatible with the upcoming release of R, you use r-devel which
  has
  the latest versions of other CRAN and base packages.
 
 
 
  As I remember ... The example demonstrating the need for this was an
  XML package that cause an extract from a website where the headers
  were misinterpreted as data in one version of pkg:XML and not in
  another. That seems fairly unconvincing. Data cleaning and
  validation is a basic task of data analysis. It also seems excessive
  to assert

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Ted Byers

On Thu, Mar 20, 2014 at 4:53 PM, Jeroen Ooms jeroen.o...@stat.ucla.eduwrote:

 On Thu, Mar 20, 2014 at 1:28 PM, Ted Byers r.ted.by...@gmail.com wrote:

 Herve Pages mentions the risk of irreproducibility across three minor
 revisions of version 1.0 of Matrix.  My gut reaction would be that if the
 results are not reproducible across such minor revisions of one library,
 they are probably just so much BS.


 Perhaps this is just terminology, but what you refer to I would generally
 call 'replication'. Of course being able to replicate results with other
 data or other software is important to validate claims. But being able to
 reproduce how the original results were obtained is an important part of
 this process.

 Fair enough.


 If someone is publishing results that I think are questionable and I
 cannot replicate them, I want to know exactly how those outcomes were
 obtained in the first place, so that I can 'debug' the problem. It's quite
 important to be able to trace back if incorrect results were a result of a
 bug, incompetence or fraud.

 OK.  That is where archives come in.  When I had to deal with that sort of
thing, I provided copies of both data and code to whoever asked.  It ought
not be hard for authors to make an archive, to e.g. an optical disk, that
includes the software used along with the data, and store it like any other
backup, so it can be provided to anyone upon request.


 Let's take the example of the Reinhart and Rogoff case. The results
 obviously were not replicable, but without more information it was just the
 word of a grad students vs two Harvard professors. Only after reproducing
 the original analysis it was possible to point out the errors and proof
 that the original were incorrect.




 Ok, but, if the practice I used were used, then a copy of the optical disk
to which everything relevant was stored would solve that problem (and it
would be extremely easy for the researcher or his/her supervisor to do).  I
once had a reviewer complain he couldn't reproduce my results, so I sent
him my code, which, translated into any of the Algol family of languages,
would allow  him, or anyone else, to replicate my results regardless of
their programming language of choice.  Once he had my code, he found his
error and reported back that he had finally replicated my results.  Several
of my colleagues used the same practice, with the same consequences
(whenever questioned, they just provide their code, and related software,
and then their results were reproduced).  There is nothing like backups
with due attention to detail.

Cheers

Ted

-- 
R.E.(Ted) Byers, Ph.D.,Ed.D.

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Ted Byers

On Thu, Mar 20, 2014 at 5:11 PM, Tim Triche, Jr. tim.tri...@gmail.comwrote:

 That doesn't make sense.

 If an API changes (e.g. in Matrix) and a program written against the old
 API can no longer run, that is a very different issue than if the same
 numbers (data) give different results.  The latter is what I am guessing
 you address.  The former is what I believe most people are concerned about
 here.  Or at least I hope that's so.

 The problem you describe is the classic case of a failure of backward
compatibility.  That is completely different from the question of
reproducibility or replicability.  And, since I, among others, noticed the
question of reproducibility had arisen, I felt a need to primarily address
that.

I do not have a quibble with anything else you wrote (or with anything in
this thread related to the issue of backward compatibility), and I have
enough experience to know both that it is a hard problem and that there are
a number of different solutions people have used.  Appropriate management
of deprecation of features is one, and the use of code freezes is another.
Version control is a third.  Each option carries its own advantages and
disadvantages.


 It's more an issue of usability than reproducibility in such a case, far
 as I can tell (see e.g.
 http://liorpachter.wordpress.com/2014/03/18/reproducibility-vs-usability/).  
 If the same data produces substantially different results (not
 attributable to e.g. better handling of machine precision and so forth,
 although that could certainly be a bugaboo in many cases... anyone who has
 programmed numerical routines in FORTRAN already knows this) then yes,
 that's a different type of bug.  But in order to uncover the latter type of
 bug, the code has to run in the first place.  After a while it becomes
 rather impenetrable if no thought is given to these changes.

 So the Bioconductor solution, as Herve noted, is to have freezes and
 releases.  There can be old bugs enshrined in people's code due to using
 old versions, and those can be traced even after many releases have come
 and gone, because there is a point-in-time snapshot of about when these
 things occurred.  As with (say) ANSI C++, deprecation notices stay in place
 for a year before anything is actually done to remove a function or break
 an API.  It's not impossible, it just requires more discipline than
 declaring that the same program should be written multiple times on
 multiple platforms every time.  The latter isn't an efficient use of
 anyone's time.

 Most of these analyses are not about putting a man on the moon or making
 sure a dam does not break.  They're relatively low-consequence exploratory
 sorties.  If something comes of them, it would be nice to have a
 point-in-time reference to check and see whether the original results were
 hooey.  That's a lot quicker and more efficient than rewriting everything
 from scratch (which, in some fields, simply ensures things won't get
 checked).

 My $0.02, since we do still have those to bedevil cashiers.



 Statistics is the grammar of science.
 Karl Pearson http://en.wikipedia.org/wiki/The_Grammar_of_Science


 Cheers

Ted

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Tim Triche, Jr.

 There is nothing like backups with due attention to detail.

Agreed, although given the complexity of dependencies among packages, this
might entail several GB of snapshots per paper (if not several TB for some
papers) in various cases.  Anyone who is reasonably prolific then gets the
exciting prospect of managing these backups.

At least if I grind out a vignette with a bunch of Bioconductor packages
and call sessionInfo() at the end, I can find out later on (if, say, things
stop working) what was the state of the tree when it last worked, and what
might have changed since then.  If a self-contained C++ or FORTRAN program
is sufficient to perform an entire analysis, that's awesome, and it ought
to be stuffed into revision control (doesn't everyone already do this?).
 But once you start using tools that depend on other tools, it becomes
substantially more difficult to ensure that

1) a comprehensive snapshot is taken
2) reviewers, possibly on different platforms and/or major versions, can
run using that snapshot
3) some means of a quick sanity check (does this analysis even return
sensible results?) can be run

Hopefully this is better articulated than my previous missive.

I believe we fundamentally agree; some of the particulars may be an issue
of notation or typical workflow.



Statistics is the grammar of science.
Karl Pearson http://en.wikipedia.org/wiki/The_Grammar_of_Science


On Thu, Mar 20, 2014 at 2:13 PM, Ted Byers r.ted.by...@gmail.com wrote:

 On Thu, Mar 20, 2014 at 4:53 PM, Jeroen Ooms jeroen.o...@stat.ucla.edu
 wrote:

  On Thu, Mar 20, 2014 at 1:28 PM, Ted Byers r.ted.by...@gmail.com
 wrote:
 
  Herve Pages mentions the risk of irreproducibility across three minor
  revisions of version 1.0 of Matrix.  My gut reaction would be that if
 the
  results are not reproducible across such minor revisions of one library,
  they are probably just so much BS.
 
 
  Perhaps this is just terminology, but what you refer to I would generally
  call 'replication'. Of course being able to replicate results with other
  data or other software is important to validate claims. But being able to
  reproduce how the original results were obtained is an important part of
  this process.
 
  Fair enough.


  If someone is publishing results that I think are questionable and I
  cannot replicate them, I want to know exactly how those outcomes were
  obtained in the first place, so that I can 'debug' the problem. It's
 quite
  important to be able to trace back if incorrect results were a result of
 a
  bug, incompetence or fraud.
 
  OK.  That is where archives come in.  When I had to deal with that sort
 of
 thing, I provided copies of both data and code to whoever asked.  It ought
 not be hard for authors to make an archive, to e.g. an optical disk, that
 includes the software used along with the data, and store it like any other
 backup, so it can be provided to anyone upon request.


  Let's take the example of the Reinhart and Rogoff case. The results
  obviously were not replicable, but without more information it was just
 the
  word of a grad students vs two Harvard professors. Only after reproducing
  the original analysis it was possible to point out the errors and proof
  that the original were incorrect.
 
 
 
 
  Ok, but, if the practice I used were used, then a copy of the optical
 disk
 to which everything relevant was stored would solve that problem (and it
 would be extremely easy for the researcher or his/her supervisor to do).  I
 once had a reviewer complain he couldn't reproduce my results, so I sent
 him my code, which, translated into any of the Algol family of languages,
 would allow  him, or anyone else, to replicate my results regardless of
 their programming language of choice.  Once he had my code, he found his
 error and reported back that he had finally replicated my results.  Several
 of my colleagues used the same practice, with the same consequences
 (whenever questioned, they just provide their code, and related software,
 and then their results were reproduced).  There is nothing like backups
 with due attention to detail.

 Cheers

 Ted

 --
 R.E.(Ted) Byers, Ph.D.,Ed.D.

 [[alternative HTML version deleted]]

 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Ted Byers

On Thu, Mar 20, 2014 at 5:27 PM, Tim Triche, Jr. tim.tri...@gmail.comwrote:

  There is nothing like backups with due attention to detail.

 Agreed, although given the complexity of dependencies among packages, this
 might entail several GB of snapshots per paper (if not several TB for some
 papers) in various cases.  Anyone who is reasonably prolific then gets the
 exciting prospect of managing these backups.

 Isn't that what support staff is for?  ;-)  But, storage space is cheap,
and as tedious as managing backups can be (definitely not fun), it is
managable.


 At least if I grind out a vignette with a bunch of Bioconductor packages
 and call sessionInfo() at the end, I can find out later on (if, say, things
 stop working) what was the state of the tree when it last worked, and what
 might have changed since then.  If a self-contained C++ or FORTRAN program
 is sufficient to perform an entire analysis, that's awesome, and it ought
 to be stuffed into revision control (doesn't everyone already do this?).
  But once you start using tools that depend on other tools, it becomes
 substantially more difficult to ensure that

 1) a comprehensive snapshot is taken
 2) reviewers, possibly on different platforms and/or major versions, can
 run using that snapshot
 3) some means of a quick sanity check (does this analysis even return
 sensible results?) can be run

 Hopefully this is better articulated than my previous missive.

 Tell me about it.  Oh, wait, you already did.  ;-)

I understand this, as I routinely work with complex distributed systems
involving multiple programming languages and other diverse tools.  But such
is part of the overhead of doing quality work.


 I believe we fundamentally agree; some of the particulars may be an issue
 of notation or typical workflow.


 I agree that we fundamentally agree  ;-)

From my experience, the issues addressed in this thread are probably best
handled by in the package developers and those authors that use their
packages, rather than imposing additional work on those responsible for
CRAN, especially when the means for doing things a little differently than
how CRAN does it are readily available.

Cheers

Ted
R.E.(Ted) Byers, Ph.D.,Ed.D.



 Statistics is the grammar of science.
 Karl Pearson http://en.wikipedia.org/wiki/The_Grammar_of_Science


 On Thu, Mar 20, 2014 at 2:13 PM, Ted Byers r.ted.by...@gmail.com wrote:

 On Thu, Mar 20, 2014 at 4:53 PM, Jeroen Ooms jeroen.o...@stat.ucla.edu
 wrote:

  On Thu, Mar 20, 2014 at 1:28 PM, Ted Byers r.ted.by...@gmail.com
 wrote:
 
  Herve Pages mentions the risk of irreproducibility across three minor
  revisions of version 1.0 of Matrix.  My gut reaction would be that if
 the
  results are not reproducible across such minor revisions of one
 library,
  they are probably just so much BS.
 
 
  Perhaps this is just terminology, but what you refer to I would
 generally
  call 'replication'. Of course being able to replicate results with other
  data or other software is important to validate claims. But being able
 to
  reproduce how the original results were obtained is an important part of
  this process.
 
  Fair enough.


  If someone is publishing results that I think are questionable and I
  cannot replicate them, I want to know exactly how those outcomes were
  obtained in the first place, so that I can 'debug' the problem. It's
 quite
  important to be able to trace back if incorrect results were a result
 of a
  bug, incompetence or fraud.
 
  OK.  That is where archives come in.  When I had to deal with that sort
 of
 thing, I provided copies of both data and code to whoever asked.  It ought
 not be hard for authors to make an archive, to e.g. an optical disk, that
 includes the software used along with the data, and store it like any
 other
 backup, so it can be provided to anyone upon request.


  Let's take the example of the Reinhart and Rogoff case. The results
  obviously were not replicable, but without more information it was just
 the
  word of a grad students vs two Harvard professors. Only after
 reproducing
  the original analysis it was possible to point out the errors and proof
  that the original were incorrect.
 
 
 
 
  Ok, but, if the practice I used were used, then a copy of the optical
 disk
 to which everything relevant was stored would solve that problem (and it
 would be extremely easy for the researcher or his/her supervisor to do).
  I
 once had a reviewer complain he couldn't reproduce my results, so I sent
 him my code, which, translated into any of the Algol family of languages,
 would allow  him, or anyone else, to replicate my results regardless of
 their programming language of choice.  Once he had my code, he found his
 error and reported back that he had finally replicated my results.
  Several
 of my colleagues used the same practice, with the same consequences
 (whenever questioned, they just provide their code, and related software,
 and then their results were

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Hervé Pagès




On 03/20/2014 01:28 PM, Ted Byers wrote:

On Thu, Mar 20, 2014 at 3:14 PM, Hervé Pagès hpa...@fhcrc.org
mailto:hpa...@fhcrc.org wrote:

On 03/20/2014 03:52 AM, Duncan Murdoch wrote:

On 14-03-20 2:15 AM, Dan Tenenbaum wrote:



- Original Message -

From: David Winsemius dwinsem...@comcast.net
mailto:dwinsem...@comcast.net
To: Jeroen Ooms jeroen.o...@stat.ucla.edu
mailto:jeroen.o...@stat.ucla.edu
Cc: r-devel r-devel@r-project.org
mailto:r-devel@r-project.org
Sent: Wednesday, March 19, 2014 11:03:32 PM
Subject: Re: [Rd] [RFC] A case for freezing CRAN


On Mar 19, 2014, at 7:45 PM, Jeroen Ooms wrote:

On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
michael.weyla...@gmail.com
mailto:michael.weyla...@gmail.com wrote:

Reading this thread again, is it a fair summary
of your position
to say reproducibility by default is more
important than giving
users access to the newest bug fixes and
features by default?
It's certainly arguable, but I'm not sure I'm
convinced: I'd
imagine that the ratio of new work being done vs
reproductions is
rather high and the current setup optimizes for
that already.


I think that separating development from released
branches can give
us
both reliability/reproducibility (stable branch) as
well as new
features (unstable branch). The user gets to pick
(and you can pick
both!). The same is true for r-base: when using a
'released'
version
you get 'stable' base packages that are up to 12
months old. If you
want to have the latest stuff you download a nightly
build of
r-devel.
For regular users and reproducible research it is
recommended to
use
the stable branch. However if you are a developer
(e.g. package
author) you might want to develop/test/check your
work with the
latest
r-devel.

I think that extending the R release cycle to CRAN
would result
both
in more stable released versions of R, as well as
more freedom for
package authors to implement rigorous change in the
unstable
branch.
When writing a script that is part of a production
pipeline, or
sweave
paper that should be reproducible 10 years from now,
or a book on
using R, you use stable version of R, which is
guaranteed to behave
the same over time. However when developing packages
that should be
compatible with the upcoming release of R, you use
r-devel which
has
the latest versions of other CRAN and base packages.



As I remember ... The example demonstrating the need for
this was an
XML package that cause an extract from a website where
the headers
were misinterpreted as data in one version of pkg:XML
and not in
another. That seems fairly unconvincing. Data cleaning and
validation is a basic task of data analysis. It also
seems excessive
to assert that it is the responsibility of CRAN to
maintain a synced
binary archive that will be available in ten years.



CRAN already does this, the bin/windows/contrib directory has
subdirectories going back to 1.7, with packages dated
October 2004. I
don't see why it is burdensome to continue to archive these.
It would
be nice if source versions had a similar archive.


The bin/windows/contrib directories are updated every day for
active R
versions.  It's only when Uwe decides that a version is no
longer worth
active support that he stops doing updates, and it freezes

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Uwe Ligges




On 20.03.2014 23:23, Hervé Pagès wrote:



On 03/20/2014 01:28 PM, Ted Byers wrote:

On Thu, Mar 20, 2014 at 3:14 PM, Hervé Pagès hpa...@fhcrc.org
mailto:hpa...@fhcrc.org wrote:

On 03/20/2014 03:52 AM, Duncan Murdoch wrote:

On 14-03-20 2:15 AM, Dan Tenenbaum wrote:



- Original Message -

From: David Winsemius dwinsem...@comcast.net
mailto:dwinsem...@comcast.net
To: Jeroen Ooms jeroen.o...@stat.ucla.edu
mailto:jeroen.o...@stat.ucla.edu
Cc: r-devel r-devel@r-project.org
mailto:r-devel@r-project.org
Sent: Wednesday, March 19, 2014 11:03:32 PM
Subject: Re: [Rd] [RFC] A case for freezing CRAN


On Mar 19, 2014, at 7:45 PM, Jeroen Ooms wrote:

On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
michael.weyla...@gmail.com
mailto:michael.weyla...@gmail.com wrote:

Reading this thread again, is it a fair summary
of your position
to say reproducibility by default is more
important than giving
users access to the newest bug fixes and
features by default?
It's certainly arguable, but I'm not sure I'm
convinced: I'd
imagine that the ratio of new work being done vs
reproductions is
rather high and the current setup optimizes for
that already.


I think that separating development from released
branches can give
us
both reliability/reproducibility (stable branch) as
well as new
features (unstable branch). The user gets to pick
(and you can pick
both!). The same is true for r-base: when using a
'released'
version
you get 'stable' base packages that are up to 12
months old. If you
want to have the latest stuff you download a nightly
build of
r-devel.
For regular users and reproducible research it is
recommended to
use
the stable branch. However if you are a developer
(e.g. package
author) you might want to develop/test/check your
work with the
latest
r-devel.

I think that extending the R release cycle to CRAN
would result
both
in more stable released versions of R, as well as
more freedom for
package authors to implement rigorous change in the
unstable
branch.
When writing a script that is part of a production
pipeline, or
sweave
paper that should be reproducible 10 years from now,
or a book on
using R, you use stable version of R, which is
guaranteed to behave
the same over time. However when developing packages
that should be
compatible with the upcoming release of R, you use
r-devel which
has
the latest versions of other CRAN and base packages.



As I remember ... The example demonstrating the need for
this was an
XML package that cause an extract from a website where
the headers
were misinterpreted as data in one version of pkg:XML
and not in
another. That seems fairly unconvincing. Data cleaning
and
validation is a basic task of data analysis. It also
seems excessive
to assert that it is the responsibility of CRAN to
maintain a synced
binary archive that will be available in ten years.



CRAN already does this, the bin/windows/contrib directory has
subdirectories going back to 1.7, with packages dated
October 2004. I
don't see why it is burdensome to continue to archive these.
It would
be nice if source versions had a similar archive.


The bin/windows/contrib directories are updated every day for
active R
versions.  It's only when Uwe decides that a version is no
longer worth
active support that he stops doing

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Hervé Pagès


On 03/20/2014 03:29 PM, Uwe Ligges wrote:



On 20.03.2014 23:23, Hervé Pagès wrote:



On 03/20/2014 01:28 PM, Ted Byers wrote:

On Thu, Mar 20, 2014 at 3:14 PM, Hervé Pagès hpa...@fhcrc.org
mailto:hpa...@fhcrc.org wrote:

On 03/20/2014 03:52 AM, Duncan Murdoch wrote:

On 14-03-20 2:15 AM, Dan Tenenbaum wrote:



- Original Message -

From: David Winsemius dwinsem...@comcast.net
mailto:dwinsem...@comcast.net
To: Jeroen Ooms jeroen.o...@stat.ucla.edu
mailto:jeroen.o...@stat.ucla.edu
Cc: r-devel r-devel@r-project.org
mailto:r-devel@r-project.org
Sent: Wednesday, March 19, 2014 11:03:32 PM
Subject: Re: [Rd] [RFC] A case for freezing CRAN


On Mar 19, 2014, at 7:45 PM, Jeroen Ooms wrote:

On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
michael.weyla...@gmail.com
mailto:michael.weyla...@gmail.com wrote:

Reading this thread again, is it a fair summary
of your position
to say reproducibility by default is more
important than giving
users access to the newest bug fixes and
features by default?
It's certainly arguable, but I'm not sure I'm
convinced: I'd
imagine that the ratio of new work being done vs
reproductions is
rather high and the current setup optimizes for
that already.


I think that separating development from released
branches can give
us
both reliability/reproducibility (stable branch) as
well as new
features (unstable branch). The user gets to pick
(and you can pick
both!). The same is true for r-base: when using a
'released'
version
you get 'stable' base packages that are up to 12
months old. If you
want to have the latest stuff you download a nightly
build of
r-devel.
For regular users and reproducible research it is
recommended to
use
the stable branch. However if you are a developer
(e.g. package
author) you might want to develop/test/check your
work with the
latest
r-devel.

I think that extending the R release cycle to CRAN
would result
both
in more stable released versions of R, as well as
more freedom for
package authors to implement rigorous change in the
unstable
branch.
When writing a script that is part of a production
pipeline, or
sweave
paper that should be reproducible 10 years from now,
or a book on
using R, you use stable version of R, which is
guaranteed to behave
the same over time. However when developing packages
that should be
compatible with the upcoming release of R, you use
r-devel which
has
the latest versions of other CRAN and base packages.



As I remember ... The example demonstrating the need for
this was an
XML package that cause an extract from a website where
the headers
were misinterpreted as data in one version of pkg:XML
and not in
another. That seems fairly unconvincing. Data cleaning
and
validation is a basic task of data analysis. It also
seems excessive
to assert that it is the responsibility of CRAN to
maintain a synced
binary archive that will be available in ten years.



CRAN already does this, the bin/windows/contrib directory
has
subdirectories going back to 1.7, with packages dated
October 2004. I
don't see why it is burdensome to continue to archive these.
It would
be nice if source versions had a similar archive.


The bin/windows/contrib directories are updated every day for
active R
versions.  It's only when Uwe decides that a version is no
longer worth

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Gábor Csárdi

Much of the discussion was about reproducibility so far. Let me emphasize
another point from Jeroen's proposal.

This is hard to measure of course, but I think I can say that the existence
and the quality of CRAN and its packages contributed immensely to the
success of R and the success of people using R. Having one central, well
controlled and tested package repository is a huge advantage for the users.
(I know that there are other repositories, but they are either similarly
well controlled and specialized (BioC), or less used.) It would be great to
keep it like this.

I also think that the current CRAN policy is not ideal for further growth.
In particular, updating a package with many reverse dependencies is a
frustrating process, for everybody. As a maintainer with ~150 reverse
dependencies, I think not twice, but ten times if I really want to publish
a new version on CRAN. I cannot speak for other maintainers of course, but
I have a feeling that I am not alone.

Tying CRAN packages to R releases would help, because then I would not have
to worry about breaking packages in the stable version of CRAN, only in
CRAN-devel.

Somebody mentioned that it is good not to do this because then users get
bug fixes and new features earlier. Well, in my case, the opposite it true.
As I am not updating, they actually get it (much) later. If it wasn't such
a hassle, I would definitely update more often, about once a month. Now my
goal is more like once a year.

Again, I cannot speak for others, but I believe the current policy does not
help progress, and is not sustainable in the long run. It penalizes the
maintainers of more important (= many rev. dependencies, that is, which
probably also means many users) packages, and I fear they will slowly move
away from CRAN. I don't think this is what anybody in the R community would
want.

Best,
Gabor

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread William Dunlap

 In particular, updating a package with many reverse dependencies is a
 frustrating process, for everybody. As a maintainer with ~150 reverse
 dependencies, I think not twice, but ten times if I really want to publish
 a new version on CRAN.

It might be easier if more of those packages came with good test suites.

Bill Dunlap
TIBCO Software
wdunlap tibco.com


 -Original Message-
 From: r-devel-boun...@r-project.org [mailto:r-devel-boun...@r-project.org] On 
 Behalf
 Of Gábor Csárdi
 Sent: Thursday, March 20, 2014 6:24 PM
 To: r-devel
 Subject: Re: [Rd] [RFC] A case for freezing CRAN
 
 Much of the discussion was about reproducibility so far. Let me emphasize
 another point from Jeroen's proposal.
 
 This is hard to measure of course, but I think I can say that the existence
 and the quality of CRAN and its packages contributed immensely to the
 success of R and the success of people using R. Having one central, well
 controlled and tested package repository is a huge advantage for the users.
 (I know that there are other repositories, but they are either similarly
 well controlled and specialized (BioC), or less used.) It would be great to
 keep it like this.
 
 I also think that the current CRAN policy is not ideal for further growth.
 In particular, updating a package with many reverse dependencies is a
 frustrating process, for everybody. As a maintainer with ~150 reverse
 dependencies, I think not twice, but ten times if I really want to publish
 a new version on CRAN. I cannot speak for other maintainers of course, but
 I have a feeling that I am not alone.
 
 Tying CRAN packages to R releases would help, because then I would not have
 to worry about breaking packages in the stable version of CRAN, only in
 CRAN-devel.
 
 Somebody mentioned that it is good not to do this because then users get
 bug fixes and new features earlier. Well, in my case, the opposite it true.
 As I am not updating, they actually get it (much) later. If it wasn't such
 a hassle, I would definitely update more often, about once a month. Now my
 goal is more like once a year.
 
 Again, I cannot speak for others, but I believe the current policy does not
 help progress, and is not sustainable in the long run. It penalizes the
 maintainers of more important (= many rev. dependencies, that is, which
 probably also means many users) packages, and I fear they will slowly move
 away from CRAN. I don't think this is what anybody in the R community would
 want.
 
 Best,
 Gabor
 
   [[alternative HTML version deleted]]
 
 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Gábor Csárdi

On Thu, Mar 20, 2014 at 9:45 PM, William Dunlap wdun...@tibco.com wrote:

  In particular, updating a package with many reverse dependencies is a
  frustrating process, for everybody. As a maintainer with ~150 reverse
  dependencies, I think not twice, but ten times if I really want to
 publish
  a new version on CRAN.

 It might be easier if more of those packages came with good test suites.


Test suites are great, but I don't think this would make my job easier.
More tests means more potential breakage. The extreme of not having any
examples and tests in these 150 packages would be the easiest for _me_,
actually. Not for the users, though.

What would really help is either fully versioned package dependencies
(daydreaming here), or having a CRAN-devel repository, that changes and
might break often, and a CRAN-stable that does not change (much).

Gabor

[...]

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Tim Triche, Jr.

Heh, you just described BioC

--t

 On Mar 20, 2014, at 7:15 PM, Gábor Csárdi csardi.ga...@gmail.com wrote:
 
 On Thu, Mar 20, 2014 at 9:45 PM, William Dunlap wdun...@tibco.com wrote:
 
 In particular, updating a package with many reverse dependencies is a
 frustrating process, for everybody. As a maintainer with ~150 reverse
 dependencies, I think not twice, but ten times if I really want to
 publish
 a new version on CRAN.
 
 It might be easier if more of those packages came with good test suites.
 
 Test suites are great, but I don't think this would make my job easier.
 More tests means more potential breakage. The extreme of not having any
 examples and tests in these 150 packages would be the easiest for _me_,
 actually. Not for the users, though.
 
 What would really help is either fully versioned package dependencies
 (daydreaming here), or having a CRAN-devel repository, that changes and
 might break often, and a CRAN-stable that does not change (much).
 
 Gabor
 
 [...]
 
[[alternative HTML version deleted]]
 
 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Tim Triche, Jr.

Except that tests (as vignettes) are mandatory for BioC. So if something blows 
up you hear about it right quick :-)

--t

 On Mar 20, 2014, at 7:15 PM, Gábor Csárdi csardi.ga...@gmail.com wrote:
 
 On Thu, Mar 20, 2014 at 9:45 PM, William Dunlap wdun...@tibco.com wrote:
 
 In particular, updating a package with many reverse dependencies is a
 frustrating process, for everybody. As a maintainer with ~150 reverse
 dependencies, I think not twice, but ten times if I really want to
 publish
 a new version on CRAN.
 
 It might be easier if more of those packages came with good test suites.
 
 Test suites are great, but I don't think this would make my job easier.
 More tests means more potential breakage. The extreme of not having any
 examples and tests in these 150 packages would be the easiest for _me_,
 actually. Not for the users, though.
 
 What would really help is either fully versioned package dependencies
 (daydreaming here), or having a CRAN-devel repository, that changes and
 might break often, and a CRAN-stable that does not change (much).
 
 Gabor
 
 [...]
 
[[alternative HTML version deleted]]
 
 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Dan Tenenbaum

- Original Message -
 From: Gábor Csárdi csardi.ga...@gmail.com
 To: r-devel r-devel@r-project.org
 Sent: Thursday, March 20, 2014 6:23:33 PM
 Subject: Re: [Rd] [RFC] A case for freezing CRAN

 Much of the discussion was about reproducibility so far. Let me
 emphasize
 another point from Jeroen's proposal.

 This is hard to measure of course, but I think I can say that the
 existence
 and the quality of CRAN and its packages contributed immensely to the
 success of R and the success of people using R. Having one central,
 well
 controlled and tested package repository is a huge advantage for the
 users.
 (I know that there are other repositories, but they are either
 similarly
 well controlled and specialized (BioC), or less used.) It would be
 great to
 keep it like this.

 I also think that the current CRAN policy is not ideal for further
 growth.
 In particular, updating a package with many reverse dependencies is a
 frustrating process, for everybody. As a maintainer with ~150 reverse
 dependencies, I think not twice, but ten times if I really want to
 publish
 a new version on CRAN. I cannot speak for other maintainers of
 course, but
 I have a feeling that I am not alone.

 Tying CRAN packages to R releases would help, because then I would
 not have
 to worry about breaking packages in the stable version of CRAN, only
 in
 CRAN-devel.

 Somebody mentioned that it is good not to do this because then users
 get
 bug fixes and new features earlier. Well, in my case, the opposite it
 true.
 As I am not updating, they actually get it (much) later. If it wasn't
 such
 a hassle, I would definitely update more often, about once a month.
 Now my
 goal is more like once a year.

These are good points. Not only do maintainers think twice (or more) before 
updating packages but it also seems that there are CRAN policies that 
discourage frequent updates. Whereas Bioconductor welcomes frequent updates 
because they usually fix problems and help us understand 
interoperability/dependency issues. Probably the main reason for this 
difference is the existence of a devel branch where breakage can happen and 
it's not the end of the world.

 Again, I cannot speak for others, but I believe the current policy
 does not
 help progress, and is not sustainable in the long run. It penalizes
 the
 maintainers of more important (= many rev. dependencies, that is,
 which
 probably also means many users) packages, and I fear they will slowly
 move
 away from CRAN. I don't think this is what anybody in the R community
 would
 want.

 Best,
 Gabor

   [[alternative HTML version deleted]]

 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Frank Harrell

To me it boils down to one simple question: is an update to a package on 
CRAN more likely to (1) fix a bug, (2) introduce a bug or downward 
incompatibility, or (3) add a new feature or fix a compatibility problem 
without introducing a bug?  I think the probability of (1) | (3) is much 
greater than the probability of (2), hence the current approach 
maximizes user benefit.


Frank
--
Frank E Harrell Jr Professor and Chairman  School of Medicine
   Department of Biostatistics Vanderbilt University

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Joshua Ulrich

On Tue, Mar 18, 2014 at 3:24 PM, Jeroen Ooms jeroen.o...@stat.ucla.edu wrote:
snip
 ## Summary

 Extending the r-release cycle to CRAN seems like a solution that would
 be easy to implement. Package updates simply only get pushed to the
 r-devel branches of cran, rather than r-release and r-release-old.
 This separates development from production/use in a way that is common
 sense in most open source communities. Benefits for R include:

Nothing is ever as simple as it seems (especially from the perspective
of one who won't be doing the work).

There is nothing preventing you (or anyone else) from creating
repositories that do what you suggest.  Create a CRAN mirror (or more
than one) that only include the package versions you think they
should.  Then have your production servers use it (them) instead of
CRAN.

Better yet, make those repositories public.  If many people like your
idea, they will use your new repositories instead of CRAN.  There is
no reason to impose this change on all world-wide CRAN users.

Best,
--
Joshua Ulrich  |  about.me/joshuaulrich
FOSS Trading  |  www.fosstrading.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Duncan Murdoch

I don't see why CRAN needs to be involved in this effort at all.  A 
third party could take snapshots of CRAN at R release dates, and make 
those available to package users in a separate repository.  It is not 
hard to set a different repository than CRAN as the default location 
from which to obtain packages.


The only objection I can see to this is that it requires extra work by 
the third party, rather than extra work by the CRAN team. I don't think 
the total amount of work required is much different.  I'm very 
unsympathetic to proposals to dump work on others.


Duncan Murdoch

On 18/03/2014 4:24 PM, Jeroen Ooms wrote:

This came up again recently with an irreproducible paper. Below an
attempt to make a case for extending the r-devel/r-release cycle to
CRAN packages. These suggestions are not in any way intended as
criticism on anyone or the status quo.

The proposal described in [1] is to freeze a snapshot of CRAN along
with every release of R. In this design, updates for contributed
packages treated the same as updates for base packages in the sense
that they are only published to the r-devel branch of CRAN and do not
affect users of released versions of R. Thereby all users, stacks
and applications using a particular version of R will by default be
using the identical version of each CRAN package. The bioconductor
project uses similar policies.

This system has several important advantages:

## Reproducibility

Currently r/sweave/knitr scripts are unstable because of ambiguity
introduced by constantly changing cran packages. This causes scripts
to break or change behavior when upstream packages are updated, which
makes reproducing old results extremely difficult.

A common counter-argument is that script authors should document
package versions used in the script using sessionInfo(). However even
if authors would manually do this, reconstructing the author's
environment from this information is cumbersome and often nearly
impossible, because binary packages might no longer be available,
dependency conflicts, etc. See [1] for a worked example. In practice,
the current system causes many results or documents generated with R
no to be reproducible, sometimes already after a few months.

In a system where contributed packages inherit the r-base release
cycle, scripts will behave the same across users/systems/time within a
given version of R. This severely reduces ambiguity of R behavior, and
has the potential of making reproducibility a natural part of the
language, rather than a tedious exercise.

## Repository Management

Just like scripts suffer from upstream changes, so do packages
depending on other packages. A particular package that has been
developed and tested against the current version of a particular
dependency is not guaranteed to work against *any future version* of
that dependency. Therefore, packages inevitably break over time as
their dependencies are updated.

One recent example is the Rcpp 0.11 release, which required all
reverse dependencies to be rebuild/modified. This updated caused some
serious disruption on our production servers. Initially we refrained
from updating Rcpp on these servers to prevent currently installed
packages depending on Rcpp to stop working. However soon after the
Rcpp 0.11 release, many other cran packages started to require Rcpp =
0.11, and our users started complaining about not being able to
install those packages. This resulted in the impossible situation
where currently installed packages would not work with the new Rcpp,
but newly installed packages would not work with the old Rcpp.

Current CRAN policies blame this problem on package authors. However
as is explained in [1], this policy does not solve anything, is
unsustainable with growing repository size, and sets completely the
wrong incentives for contributing code. Progress comes with breaking
changes, and the system should be able to accommodate this. Much of
the trouble could have been prevented by a system that does not push
bleeding edge updates straight to end-users, but has a devel branch
where conflicts are resolved before publishing them in the next
r-release.

## Reliability

Another example, this time on a very small scale. We recently
discovered that R code plotting medal counts from the Sochi Olympics
generated different results for users on OSX than it did on
Linux/Windows. After some debugging, we narrowed it down to the XML
package. The application used the following code to scrape results
from the Sochi website:

XML::readHTMLTable(http://www.sochi2014.com/en/speed-skating;, which=2, skip=1)

This code was developed and tested on mac, but results in a different
winner on windows/linux. This happens because the current version of
the XML package on CRAN is 3.98, but the latest mac binary is 3.95.
Apparently this new version of XML introduces a tiny change that
causes html-table-headers to become colnames, rather than a row in the
matrix, resulting in different medal counts.

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Kasper Daniel Hansen

Our experience in Bioconductor is that this is a pretty hard problem.

What the OP presumably wants is some guarantee that all packages on CRAN
work well together.  A good example is when Rcpp was updated, it broke
other packages (quick note: The Rcpp developers do a incredible amount of
work to deal with this; it is almost impossible to not have a few days of
chaos).  Ensuring this is not a trivial task, and it requires some buy-in
both from the repository and from the developers.

For Bioconductor it is even harder as the dependency graph of Bioconductor
is much more involved than the one for CRAN, where most packages depends
only on a few other packages.  This is why we need to do this for Bioc.

Based on my experience with CRAN I am not sure I see a need for a
coordinated release (or rather, I can sympathize with the need, but I don't
think the effort is worth it).

What would be more useful in terms of reproducibility is the capability of
installing a specific version of a package from a repository using
install.packages(), which would require archiving older versions in a
coordinated fashion. I know CRAN archives old versions, but I am not aware
if we can programmatically query the repository about this.

Best,
Kasper


On Wed, Mar 19, 2014 at 8:52 AM, Joshua Ulrich josh.m.ulr...@gmail.comwrote:

 On Tue, Mar 18, 2014 at 3:24 PM, Jeroen Ooms jeroen.o...@stat.ucla.edu
 wrote:
 snip
  ## Summary
 
  Extending the r-release cycle to CRAN seems like a solution that would
  be easy to implement. Package updates simply only get pushed to the
  r-devel branches of cran, rather than r-release and r-release-old.
  This separates development from production/use in a way that is common
  sense in most open source communities. Benefits for R include:
 
 Nothing is ever as simple as it seems (especially from the perspective
 of one who won't be doing the work).

 There is nothing preventing you (or anyone else) from creating
 repositories that do what you suggest.  Create a CRAN mirror (or more
 than one) that only include the package versions you think they
 should.  Then have your production servers use it (them) instead of
 CRAN.

 Better yet, make those repositories public.  If many people like your
 idea, they will use your new repositories instead of CRAN.  There is
 no reason to impose this change on all world-wide CRAN users.

 Best,
 --
 Joshua Ulrich  |  about.me/joshuaulrich
 FOSS Trading  |  www.fosstrading.com

 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Dirk Eddelbuettel


Piling on:

On 19 March 2014 at 07:52, Joshua Ulrich wrote:
| There is nothing preventing you (or anyone else) from creating
| repositories that do what you suggest.  Create a CRAN mirror (or more
| than one) that only include the package versions you think they
| should.  Then have your production servers use it (them) instead of
| CRAN.
| 
| Better yet, make those repositories public.  If many people like your
| idea, they will use your new repositories instead of CRAN.  There is
| no reason to impose this change on all world-wide CRAN users.

On 19 March 2014 at 08:52, Duncan Murdoch wrote:
| I don't see why CRAN needs to be involved in this effort at all.  A 
| third party could take snapshots of CRAN at R release dates, and make 
| those available to package users in a separate repository.  It is not 
| hard to set a different repository than CRAN as the default location 
| from which to obtain packages.
| 
| The only objection I can see to this is that it requires extra work by 
| the third party, rather than extra work by the CRAN team. I don't think 
| the total amount of work required is much different.  I'm very 
| unsympathetic to proposals to dump work on others.


And to a first approximation some of those efforts already exist:

  -- 200+ r-cran-* packages in Debian proper

  -- 2000+ r-cran-* packages in Michael's c2d4u (via launchpad)

  -- 5000+ r-cran-* packages in Don's debian-r.debian.net

The only difference here is that Jeroen wants to organize source packages.
But that is just a matter of stacking them in directory trees and calling

setwd(/path/to/root/of/your/repo/version)
tools::write_PACKAGES(., type=source)'

to create PACKAGES and PACKAGES.gz.

Dirk

-- 
Dirk Eddelbuettel | e...@debian.org | http://dirk.eddelbuettel.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Hadley Wickham

 What would be more useful in terms of reproducibility is the capability of
 installing a specific version of a package from a repository using
 install.packages(), which would require archiving older versions in a
 coordinated fashion. I know CRAN archives old versions, but I am not aware
 if we can programmatically query the repository about this.

See devtools::install_version().

The main caveat is that you also need to be able to build the package,
and ensure you have dependencies that work with that version.

Hadley


-- 
http://had.co.nz/

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Geoff Jentry


using the identical version of each CRAN package. The bioconductor
project uses similar policies.


While I agree that this can be an issue, I don't think it is fair to 
compare CRAN to BioC. Unless things have changed, the latter has a more 
rigorous barrier to entry which includes buy in of various ideals (e.g. 
interoperability w/ other BioC packages, making use of BioC constructs, 
the official release cycle). All of that requires extra management 
overhead (read: human effort) which considering that CRAN isn't exactly 
swimming in spare cycles seems unlikely to happen.


It seems like one could set up a curated CRAN-a-like quite easily, 
advertise the heck out of it and let the market decide. That is, IMO, 
the beauty of open source.


-J

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Spencer Graves

  What about having this purpose met with something like an 
expansion of R-Forge?  We could have packages submitted to R-Forge 
rather than CRAN, and people who wanted the latest could get it from 
R-Forge.  If changes I make on R-Forge break a reverse dependency, 
emails explaining the problem are sent to both me and the maintainer for 
the package I broke.



  The budget for R-Forge would almost certainly need to be 
increased:  They currently disable many of the tests they once ran.



  Regarding budget, the R Project would get more donations if they 
asked for them and made it easier to contribute.  I've tried multiple 
times without success to find a way to donate.  I didn't try hard, but 
it shouldn't be hard ;-)  (And donations should be accepted in US 
dollars and Euros -- and maybe other currencies.) There should be a 
procedure whereby anyone could receive a pro forma invoice, which they 
can pay or ignore as they choose.  I mention this, because many grants 
could cover a reasonable fee provided they have an invoice.



  Spencer Graves


On 3/19/2014 10:59 AM, Jeroen Ooms wrote:

On Wed, Mar 19, 2014 at 5:52 AM, Duncan Murdoch murdoch.dun...@gmail.comwrote:


I don't see why CRAN needs to be involved in this effort at all.  A third
party could take snapshots of CRAN at R release dates, and make those
available to package users in a separate repository.  It is not hard to set
a different repository than CRAN as the default location from which to
obtain packages.


I am happy to see many people giving this some thought and engage in the
discussion.

Several have suggested that staging  freezing can be simply done by a
third party. This solution and its limitations is also described in the
paper [1] in the section titled R: downstream staging and repackaging.

If this would solve the problem without affecting CRAN, we would have been
done this obviously. In fact, as described in the paper and pointed out by
some people, initiatives such as Debian or Revolution Enterprise already
include a frozen library of R packages. Also companies like Google maintain
their own internal repository with packages that are used throughout the
company.

The problem with this approach is that when you using some 3rd party
package snapshot, your r/sweave scripts will still only be
reliable/reproducible for other users of that specific snapshot. E.g. for
the examples above, a script that is written in R 3.0 by a Debian user is
not guaranteed to work on R 3.0 in Google, or R 3.0 on some other 3rd party
cran snapshot. Hence this solution merely redefines the problem from this
script depends on pkgA 1.1 and pkgB 0.2.3 to this script depends on
repository foo 2.0. And given that most users would still be pulling
packages straight from CRAN, it would still be terribly difficult to
reproduce a 5 year old sweave script from e.g. JSS.

For this reason I believe the only effective place to organize this staging
is all the way upstream, on CRAN. Imagine a world where your r/sweave
script would be reliable/reproducible, out of the box, on any system, any
platform in any company using on R 3.0. No need to investigate which
specific packages or cran snapshot the author was using at the time of
writing the script, and trying to reconstruct such libraries for each
script you want to reproduce. No ambiguity about which package versions are
used by R 3.0. However for better or worse, I think this could only be
accomplished with a cran release cycle (i.e. universal snapshots)
accompanying the already existing r releases.




The only objection I can see to this is that it requires extra work by the
third party, rather than extra work by the CRAN team. I don't think the
total amount of work required is much different.  I'm very unsympathetic to
proposals to dump work on others.


I am merely trying to discuss a technical issue in an attempt to improve
reliability of our software and reproducibility of papers created with R.

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Joshua Ulrich

On Wed, Mar 19, 2014 at 12:59 PM, Jeroen Ooms jeroen.o...@stat.ucla.edu wrote:
 On Wed, Mar 19, 2014 at 5:52 AM, Duncan Murdoch 
 murdoch.dun...@gmail.comwrote:

 I don't see why CRAN needs to be involved in this effort at all.  A third
 party could take snapshots of CRAN at R release dates, and make those
 available to package users in a separate repository.  It is not hard to set
 a different repository than CRAN as the default location from which to
 obtain packages.


 I am happy to see many people giving this some thought and engage in the
 discussion.

 Several have suggested that staging  freezing can be simply done by a
 third party. This solution and its limitations is also described in the
 paper [1] in the section titled R: downstream staging and repackaging.

 If this would solve the problem without affecting CRAN, we would have been
 done this obviously. In fact, as described in the paper and pointed out by
 some people, initiatives such as Debian or Revolution Enterprise already
 include a frozen library of R packages. Also companies like Google maintain
 their own internal repository with packages that are used throughout the
 company.

The suggested solution is not described in the referenced article.  It
was not suggested that it be the operating system's responsibility to
distribute snapshots, nor was it suggested to create binary
repositories for specific operating systems, nor was it suggested to
freeze only a subset of CRAN packages.

 The problem with this approach is that when you using some 3rd party
 package snapshot, your r/sweave scripts will still only be
 reliable/reproducible for other users of that specific snapshot. E.g. for
 the examples above, a script that is written in R 3.0 by a Debian user is
 not guaranteed to work on R 3.0 in Google, or R 3.0 on some other 3rd party
 cran snapshot. Hence this solution merely redefines the problem from this
 script depends on pkgA 1.1 and pkgB 0.2.3 to this script depends on
 repository foo 2.0. And given that most users would still be pulling
 packages straight from CRAN, it would still be terribly difficult to
 reproduce a 5 year old sweave script from e.g. JSS.

This can be solved by the third party making the repository public.

 For this reason I believe the only effective place to organize this staging
 is all the way upstream, on CRAN. Imagine a world where your r/sweave
 script would be reliable/reproducible, out of the box, on any system, any
 platform in any company using on R 3.0. No need to investigate which
 specific packages or cran snapshot the author was using at the time of
 writing the script, and trying to reconstruct such libraries for each
 script you want to reproduce. No ambiguity about which package versions are
 used by R 3.0. However for better or worse, I think this could only be
 accomplished with a cran release cycle (i.e. universal snapshots)
 accompanying the already existing r releases.

This could be done by a public third-party repository, independent of
CRAN.  However, you would need to find a way to actively _prevent_
people from installing newer versions of packages with the stable R
releases.

--
Joshua Ulrich  |  about.me/joshuaulrich
FOSS Trading  |  www.fosstrading.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Carl Boettiger

Dear list,

I'm curious what people would think of a more modest proposal at this time:

State the version of the dependencies used by the package authors when the
package was built.

Eventually CRAN could enforce such a statement be present in the
description. We encourage users to declare the version of the packages they
use in publications, so why not have the same expectation of developers?
 This would help address the problem of archived packages that Jeroen
raises, as it is currently it is impossible to reliably install archived
packages because their dependencies have since been updated and are no
longer compatible.  (Even if it passes checks and installs, we have no way
of knowing if the upstream changes have introduced a bug).  This
information would be relatively straight forward to capture, shouldn't
change the way anyone currently uses CRAN, and should address a major pain
point anyone trying to install archived versions from CRAN has probably
encountered.  What am I overlooking?

Carl


On Wed, Mar 19, 2014 at 11:36 AM, Spencer Graves 
spencer.gra...@structuremonitoring.com wrote:

   What about having this purpose met with something like an expansion
 of R-Forge?  We could have packages submitted to R-Forge rather than CRAN,
 and people who wanted the latest could get it from R-Forge.  If changes I
 make on R-Forge break a reverse dependency, emails explaining the problem
 are sent to both me and the maintainer for the package I broke.


   The budget for R-Forge would almost certainly need to be increased:
  They currently disable many of the tests they once ran.


   Regarding budget, the R Project would get more donations if they
 asked for them and made it easier to contribute.  I've tried multiple times
 without success to find a way to donate.  I didn't try hard, but it
 shouldn't be hard ;-)  (And donations should be accepted in US dollars and
 Euros -- and maybe other currencies.) There should be a procedure whereby
 anyone could receive a pro forma invoice, which they can pay or ignore as
 they choose.  I mention this, because many grants could cover a reasonable
 fee provided they have an invoice.


   Spencer Graves


 On 3/19/2014 10:59 AM, Jeroen Ooms wrote:

 On Wed, Mar 19, 2014 at 5:52 AM, Duncan Murdoch murdoch.dun...@gmail.com
 wrote:

  I don't see why CRAN needs to be involved in this effort at all.  A third
 party could take snapshots of CRAN at R release dates, and make those
 available to package users in a separate repository.  It is not hard to
 set
 a different repository than CRAN as the default location from which to
 obtain packages.

  I am happy to see many people giving this some thought and engage in the
 discussion.

 Several have suggested that staging  freezing can be simply done by a
 third party. This solution and its limitations is also described in the
 paper [1] in the section titled R: downstream staging and repackaging.

 If this would solve the problem without affecting CRAN, we would have been
 done this obviously. In fact, as described in the paper and pointed out by
 some people, initiatives such as Debian or Revolution Enterprise already
 include a frozen library of R packages. Also companies like Google
 maintain
 their own internal repository with packages that are used throughout the
 company.

 The problem with this approach is that when you using some 3rd party
 package snapshot, your r/sweave scripts will still only be
 reliable/reproducible for other users of that specific snapshot. E.g. for
 the examples above, a script that is written in R 3.0 by a Debian user is
 not guaranteed to work on R 3.0 in Google, or R 3.0 on some other 3rd
 party
 cran snapshot. Hence this solution merely redefines the problem from this
 script depends on pkgA 1.1 and pkgB 0.2.3 to this script depends on
 repository foo 2.0. And given that most users would still be pulling
 packages straight from CRAN, it would still be terribly difficult to
 reproduce a 5 year old sweave script from e.g. JSS.

 For this reason I believe the only effective place to organize this
 staging
 is all the way upstream, on CRAN. Imagine a world where your r/sweave
 script would be reliable/reproducible, out of the box, on any system, any
 platform in any company using on R 3.0. No need to investigate which
 specific packages or cran snapshot the author was using at the time of
 writing the script, and trying to reconstruct such libraries for each
 script you want to reproduce. No ambiguity about which package versions
 are
 used by R 3.0. However for better or worse, I think this could only be
 accomplished with a cran release cycle (i.e. universal snapshots)
 accompanying the already existing r releases.



  The only objection I can see to this is that it requires extra work by
 the
 third party, rather than extra work by the CRAN team. I don't think the
 total amount of work required is much different.  I'm very unsympathetic
 to
 proposals to dump work on others.

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Jeroen Ooms

On Wed, Mar 19, 2014 at 7:00 AM, Kasper Daniel Hansen
kasperdanielhan...@gmail.com wrote:
 Our experience in Bioconductor is that this is a pretty hard problem.

 What the OP presumably wants is some guarantee that all packages on CRAN work 
 well together.

Obviously we can not guarantee that all packages on CRAN work
together. But what we can do is prevent problems that are introduced
by version ambiguity. If author develops and tests a script/package
with dependency Rcpp 0.10.6, the best chance of making that script or
package work for other users is using Rcpp 0.10.6.

This especially holds if there is a big time difference between the
author creating the pkg/script and someone using it. In practice most
Sweave/knitr scripts used for generating papers and articles can not
be reproduced after a while because the dependency packages have
changed in the mean time. These problem can largely be mitigated with
a release cycle.

I am not arguing that anyone should put manual effort into testing
that packages work together. On the contrary: a system that separates
development from released branches prevents you from having to
continuously test all reverse dependencies for every package update.

My argument is simply that many problems introduced by version
ambiguity can be prevented if we can unite the entire R community
around using a single version of each CRAN package for every specific
release of R. Similar to how linux distributions use a single version
of each software package in a particular release of the distribution.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Hervé Pagès


Hi,

On 03/19/2014 07:00 AM, Kasper Daniel Hansen wrote:

Our experience in Bioconductor is that this is a pretty hard problem.


What's hard and requires a substantial amount of human resources is to
run our build system (set up the build machines, keep up with changes
in R, babysit the builds, assist developers with build issues, etc...)

But *freezing* the CRAN packages for each version of R is *very* easy
to do. The CRAN maintainers already do it for the binary packages.
What could be the reason for not doing it for source packages too?
Maybe in prehistoric times there was this belief that a source package
was aimed to remain compatible with all versions of R, present and
future, but that dream is dead and gone...

Right now the layout of the CRAN package repo is:

  ├── src
  │   └── contrib
  └── bin
  ├── windows
  │   └── contrib
  │   ├ ...
  │   ├ 3.0
  │   ├ 3.1
  │   ├ ...
  └── macosx
  └── contrib
  ├ ...
  ├ 3.0
  ├ 3.1
  ├ ...

when it could be:

  ├── 3.0
  │   ├── src
  │   │   └── contrib
  │   └── bin
  │   ├── windows
  │   │   └── contrib
  │   └── macosx
  │   └── contrib
  ├── 3.1
  │   ├── src
  │   │   └── contrib
  │   └── bin
  │   ├── windows
  │   │   └── contrib
  │   └── macosx
  │   └── contrib
  ├── ...

That is: the split by version is done at the top, not at the bottom.

It doesn't use more disk space than the current layout (you can just
throw the src/contrib/Archive/ folder away, there is no more need
for it).

install.packages() and family would need to be modified a little bit
to work with this new layout. And that's all!

The never ending changes in Mac OS X binary formats can be handled
in a cleaner way i.e. no more symlinks under bin/macosx to keep
backward compatibility with different binary formats and with old
versions of install.packages().

Then in 10 years from now, you can reproduce an analysis that you
did today with R-3.0. Because when you'll install R-3.0 and the
packages required for this analysis, you'll end up with exactly
the same packages as today.

Cheers,
H.



What the OP presumably wants is some guarantee that all packages on CRAN
work well together.  A good example is when Rcpp was updated, it broke
other packages (quick note: The Rcpp developers do a incredible amount of
work to deal with this; it is almost impossible to not have a few days of
chaos).  Ensuring this is not a trivial task, and it requires some buy-in
both from the repository and from the developers.

For Bioconductor it is even harder as the dependency graph of Bioconductor
is much more involved than the one for CRAN, where most packages depends
only on a few other packages.  This is why we need to do this for Bioc.

Based on my experience with CRAN I am not sure I see a need for a
coordinated release (or rather, I can sympathize with the need, but I don't
think the effort is worth it).

What would be more useful in terms of reproducibility is the capability of
installing a specific version of a package from a repository using
install.packages(), which would require archiving older versions in a
coordinated fashion. I know CRAN archives old versions, but I am not aware
if we can programmatically query the repository about this.

Best,
Kasper


On Wed, Mar 19, 2014 at 8:52 AM, Joshua Ulrich josh.m.ulr...@gmail.comwrote:


On Tue, Mar 18, 2014 at 3:24 PM, Jeroen Ooms jeroen.o...@stat.ucla.edu
wrote:
snip

## Summary

Extending the r-release cycle to CRAN seems like a solution that would
be easy to implement. Package updates simply only get pushed to the
r-devel branches of cran, rather than r-release and r-release-old.
This separates development from production/use in a way that is common
sense in most open source communities. Benefits for R include:


Nothing is ever as simple as it seems (especially from the perspective
of one who won't be doing the work).

There is nothing preventing you (or anyone else) from creating
repositories that do what you suggest.  Create a CRAN mirror (or more
than one) that only include the package versions you think they
should.  Then have your production servers use it (them) instead of
CRAN.

Better yet, make those repositories public.  If many people like your
idea, they will use your new repositories instead of CRAN.  There is
no reason to impose this change on all world-wide CRAN users.

Best,
--
Joshua Ulrich  |  about.me/joshuaulrich
FOSS Trading  |  www.fosstrading.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Jeroen Ooms

On Wed, Mar 19, 2014 at 11:50 AM, Joshua Ulrich josh.m.ulr...@gmail.com
wrote:

 The suggested solution is not described in the referenced article.  It
 was not suggested that it be the operating system's responsibility to
 distribute snapshots, nor was it suggested to create binary
 repositories for specific operating systems, nor was it suggested to
 freeze only a subset of CRAN packages.


IMO this is an implementation detail. If we could all agree on a particular
set of cran packages to be used with a certain release of R, then it
doesn't matter how the 'snapshotting' gets implemented. It could be a
separate repository, or a directory on cran with symbolic links, or a page
somewhere with hyperlinks to the respective source packages. Or you can put
all packages in a big zip file, or include it in your OS distribution. You
can even distribute your entire repo on cdroms (debian style!) or do all of
the above.

The hard problem is not implementation. The hard part is that for
reproducibility to work, we need community wide conventions on which
versions of cran packages are used by a particular release of R. Local
downstream solutions are impractical, because this results in
scripts/packages that only work within your niche using this particular
snapshot. I expect that requiring every script be executed in the context
of dependencies from some particular third party repository will make
reproducibility even less common. Therefore I am trying to make a case for
a solution that would naturally improve reliability/reproducibility of R
code without any effort by the end-user.

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Joshua Ulrich

On Wed, Mar 19, 2014 at 4:28 PM, Jeroen Ooms jeroen.o...@stat.ucla.edu wrote:
 On Wed, Mar 19, 2014 at 11:50 AM, Joshua Ulrich josh.m.ulr...@gmail.com
 wrote:

 The suggested solution is not described in the referenced article.  It
 was not suggested that it be the operating system's responsibility to
 distribute snapshots, nor was it suggested to create binary
 repositories for specific operating systems, nor was it suggested to
 freeze only a subset of CRAN packages.


 IMO this is an implementation detail. If we could all agree on a particular
 set of cran packages to be used with a certain release of R, then it doesn't
 matter how the 'snapshotting' gets implemented. It could be a separate
 repository, or a directory on cran with symbolic links, or a page somewhere
 with hyperlinks to the respective source packages. Or you can put all
 packages in a big zip file, or include it in your OS distribution. You can
 even distribute your entire repo on cdroms (debian style!) or do all of the
 above.

 The hard problem is not implementation. The hard part is that for
 reproducibility to work, we need community wide conventions on which
 versions of cran packages are used by a particular release of R. Local
 downstream solutions are impractical, because this results in
 scripts/packages that only work within your niche using this particular
 snapshot. I expect that requiring every script be executed in the context of
 dependencies from some particular third party repository will make
 reproducibility even less common. Therefore I am trying to make a case for a
 solution that would naturally improve reliability/reproducibility of R code
 without any effort by the end-user.

So implementation isn't a problem.  The problem is that you need a way
to force people not to be able to use different package versions than
what existed at the time of each R release.  I said this in my
previous email, but you removed and did not address it: However, you
would need to find a way to actively _prevent_ people from installing
newer versions of packages with the stable R releases.  Frankly, I
would stop using CRAN if this policy were adopted.

I suggest you go build this yourself.  You have all the code available
on CRAN, and the dates at which each package was published.  If others
who care about reproducible research find what you've built useful,
you will create the very community you want.  And you won't have to
force one single person to change their workflow.

Best,
--
Joshua Ulrich  |  about.me/joshuaulrich
FOSS Trading  |  www.fosstrading.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Dan Tenenbaum

- Original Message -
 From: Joshua Ulrich josh.m.ulr...@gmail.com
 To: Jeroen Ooms jeroen.o...@stat.ucla.edu
 Cc: r-devel r-devel@r-project.org
 Sent: Wednesday, March 19, 2014 2:59:53 PM
 Subject: Re: [Rd] [RFC] A case for freezing CRAN

 On Wed, Mar 19, 2014 at 4:28 PM, Jeroen Ooms
 jeroen.o...@stat.ucla.edu wrote:
  On Wed, Mar 19, 2014 at 11:50 AM, Joshua Ulrich
  josh.m.ulr...@gmail.com
  wrote:

  The suggested solution is not described in the referenced article.
   It
  was not suggested that it be the operating system's responsibility
  to
  distribute snapshots, nor was it suggested to create binary
  repositories for specific operating systems, nor was it suggested
  to
  freeze only a subset of CRAN packages.

  IMO this is an implementation detail. If we could all agree on a
  particular
  set of cran packages to be used with a certain release of R, then
  it doesn't
  matter how the 'snapshotting' gets implemented. It could be a
  separate
  repository, or a directory on cran with symbolic links, or a page
  somewhere
  with hyperlinks to the respective source packages. Or you can put
  all
  packages in a big zip file, or include it in your OS distribution.
  You can
  even distribute your entire repo on cdroms (debian style!) or do
  all of the
  above.

  The hard problem is not implementation. The hard part is that for
  reproducibility to work, we need community wide conventions on
  which
  versions of cran packages are used by a particular release of R.
  Local
  downstream solutions are impractical, because this results in
  scripts/packages that only work within your niche using this
  particular
  snapshot. I expect that requiring every script be executed in the
  context of
  dependencies from some particular third party repository will make
  reproducibility even less common. Therefore I am trying to make a
  case for a
  solution that would naturally improve reliability/reproducibility
  of R code
  without any effort by the end-user.

 So implementation isn't a problem.  The problem is that you need a
 way
 to force people not to be able to use different package versions than
 what existed at the time of each R release.  I said this in my
 previous email, but you removed and did not address it: However, you
 would need to find a way to actively _prevent_ people from installing
 newer versions of packages with the stable R releases.  Frankly, I
 would stop using CRAN if this policy were adopted.

I don't see how the proposal forces anyone to do anything. If you have an old 
version of R and you still want to install newer versions of packages, you can 
download them from their CRAN landing page. As I understand it, the proposal 
only addresses what packages would be installed **by default** for a given 
version of R.

People would be free to override those default settings (by downloading newer 
packages as described above) but they should then not expect to be able to 
reproduce an earlier analysis since they'll have the wrong package versions. If 
they don't care, that's fine (provided that no other problems arise, such as 
the newer package depending on a feature of R that doesn't exist in the version 
you're running).

Dan

 I suggest you go build this yourself.  You have all the code
 available
 on CRAN, and the dates at which each package was published.  If
 others
 who care about reproducible research find what you've built useful,
 you will create the very community you want.  And you won't have to
 force one single person to change their workflow.

 Best,
 --
 Joshua Ulrich  |  about.me/joshuaulrich
 FOSS Trading  |  www.fosstrading.com

 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Jeroen Ooms

On Wed, Mar 19, 2014 at 2:59 PM, Joshua Ulrich josh.m.ulr...@gmail.com wrote:

 So implementation isn't a problem.  The problem is that you need a way
 to force people not to be able to use different package versions than
 what existed at the time of each R release.  I said this in my
 previous email, but you removed and did not address it: However, you
 would need to find a way to actively _prevent_ people from installing
 newer versions of packages with the stable R releases.  Frankly, I
 would stop using CRAN if this policy were adopted.

I am not proposing to force anything to anyone, those are your
words. Please read the proposal more carefully before derailing the
discussion. Below *verbatim* a section from the paper:

To fully make the transition to a staged CRAN, the default behavior of
the package manager must be modified to download packages from the
stable branch of the current version of R, rather than the latest
development release. As such, all users on a given version of R will
be using the same version of each CRAN package, regardless on when it
was installed. The user could still be given an option to try and
install the development version from the unstable branch, for example
by adding an additional parameter to install.packages named
devel=TRUE. However when installing an unstable package, it must be
flagged, and the user must be warned that this version is not properly
tested and might not be working as expected. Furthermore, when loading
this package a warning could be shown with the version number so that
it is also obvious from the output that results were produced using a
non-standard version of the contributed package. Finally, users that
would always like to use the very latest versions of all packages,
e.g. developers, could install the r-devel release of R. This version
contains the latest commits by R Core and downloads packages from the
devel branch on CRAN, but should not be used or in production or
reproducible research settings.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Joshua Ulrich

On Wed, Mar 19, 2014 at 5:16 PM, Jeroen Ooms jeroen.o...@stat.ucla.edu wrote:
 On Wed, Mar 19, 2014 at 2:59 PM, Joshua Ulrich josh.m.ulr...@gmail.com 
 wrote:

 So implementation isn't a problem.  The problem is that you need a way
 to force people not to be able to use different package versions than
 what existed at the time of each R release.  I said this in my
 previous email, but you removed and did not address it: However, you
 would need to find a way to actively _prevent_ people from installing
 newer versions of packages with the stable R releases.  Frankly, I
 would stop using CRAN if this policy were adopted.

 I am not proposing to force anything to anyone, those are your
 words. Please read the proposal more carefully before derailing the
 discussion. Below *verbatim* a section from the paper:

snip

Yes force is too strong a word.  You want a barrier (however small)
to prevent people from installing newer (or older) versions of
packages than those that correspond to a given R release.

I still think you're going to have a very hard time convincing CRAN
maintainers to take up your cause, even if you were to build support
for it.  Especially because there's nothing stopping anyone else from
doing it.

--
Joshua Ulrich  |  about.me/joshuaulrich
FOSS Trading  |  www.fosstrading.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Hervé Pagès




On 03/19/2014 02:59 PM, Joshua Ulrich wrote:

On Wed, Mar 19, 2014 at 4:28 PM, Jeroen Ooms jeroen.o...@stat.ucla.edu wrote:

On Wed, Mar 19, 2014 at 11:50 AM, Joshua Ulrich josh.m.ulr...@gmail.com
wrote:


The suggested solution is not described in the referenced article.  It
was not suggested that it be the operating system's responsibility to
distribute snapshots, nor was it suggested to create binary
repositories for specific operating systems, nor was it suggested to
freeze only a subset of CRAN packages.



IMO this is an implementation detail. If we could all agree on a particular
set of cran packages to be used with a certain release of R, then it doesn't
matter how the 'snapshotting' gets implemented. It could be a separate
repository, or a directory on cran with symbolic links, or a page somewhere
with hyperlinks to the respective source packages. Or you can put all
packages in a big zip file, or include it in your OS distribution. You can
even distribute your entire repo on cdroms (debian style!) or do all of the
above.

The hard problem is not implementation. The hard part is that for
reproducibility to work, we need community wide conventions on which
versions of cran packages are used by a particular release of R. Local
downstream solutions are impractical, because this results in
scripts/packages that only work within your niche using this particular
snapshot. I expect that requiring every script be executed in the context of
dependencies from some particular third party repository will make
reproducibility even less common. Therefore I am trying to make a case for a
solution that would naturally improve reliability/reproducibility of R code
without any effort by the end-user.


So implementation isn't a problem.  The problem is that you need a way
to force people not to be able to use different package versions than
what existed at the time of each R release.  I said this in my
previous email, but you removed and did not address it: However, you
would need to find a way to actively _prevent_ people from installing
newer versions of packages with the stable R releases.  Frankly, I
would stop using CRAN if this policy were adopted.

I suggest you go build this yourself.  You have all the code available
on CRAN, and the dates at which each package was published.  If others
who care about reproducible research find what you've built useful,
you will create the very community you want.  And you won't have to
force one single person to change their workflow.


Yeah we've already heard this do it yourself kind of answer. Not a
very productive one honestly.

Well actually that's what we've done for the Bioconductor repositories:
we freeze the BioC packages for each version of Bioconductor. But since
this freezing doesn't happen at the CRAN level, and many BioC packages
depend on CRAN packages, the freezing is only at the surface. Would be
much better if the freezing was all the way down to the bottom of the
sea. (Note that it is already if you install binary packages only.)

Yes it's technically possible to work around this by also hosting
frozen versions of CRAN, one per version of Bioconductor, and have
biocLite() (the tool BioC users use for installing packages) point to
these frozen versions of CRAN in order to get the correct dependencies
for any given version of BioC. However we don't do that because that
would mean extra costs for us in terms of storage space and bandwidth.
And also because we believe that it would be more effective and would
ultimately benefit the entire R community (and not just the BioC
community) if this problem was addressed upstream.

H.



Best,
--
Joshua Ulrich  |  about.me/joshuaulrich
FOSS Trading  |  www.fosstrading.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Romain Francois

Weighting in. FWIW, I find the proposal conceptually quite interesting. 

For package developers, it does not have to be a frustration to have to wait a 
new version of R to release their code. Anticipated frustration was my initial 
reaction. Thinking about this more, I think this could be changed into 
opportunity. 

Since the pattern here is to use Rcpp as an example of something causing 
compatibility headaches, and I have some responsibility there, maybe I can 
comment on this. I would find it extremely valuable if there was only one 
unique version of Rcpp for a given released version of R. 

Users would have to wait longer to have the new stuff, but one can argue that 
at least they get something that is more tested. 

Would it be helpful for authors of package that have lots of dependency to 
start having stricter depends declarations in their DESCRIPTION files, e.g. : 

Depends: R (== 3.1.0)

?

Romain


For example, personally I’m waiting for 3.1.0 for releasing Rcpp11 because I 
want to leverage some C++11 support that has been included in R. It has been 
frustrating to have to wait, but it does change the way I make changes to the 
codebase. Perhaps it is a good habit to take. And it does not need « more work 
» for others, just more discipline and self control from people implementing 
this pattern. 

also, declaring a strict dependency requirement against a released version of R 
perhaps could resume the drama of « you were asked to test this against a very 
recent version of R-devel, and guess what a few hours ago I’ve just added a new 
test that makes your package non R CMD check worthy ». So less work for CRAN 
maintainers then. 

Le 19 mars 2014 à 23:57, Hervé Pagès hpa...@fhcrc.org a écrit :

 
 
 On 03/19/2014 02:59 PM, Joshua Ulrich wrote:
 On Wed, Mar 19, 2014 at 4:28 PM, Jeroen Ooms jeroen.o...@stat.ucla.edu 
 wrote:
 On Wed, Mar 19, 2014 at 11:50 AM, Joshua Ulrich josh.m.ulr...@gmail.com
 wrote:
 
 The suggested solution is not described in the referenced article.  It
 was not suggested that it be the operating system's responsibility to
 distribute snapshots, nor was it suggested to create binary
 repositories for specific operating systems, nor was it suggested to
 freeze only a subset of CRAN packages.
 
 
 IMO this is an implementation detail. If we could all agree on a particular
 set of cran packages to be used with a certain release of R, then it doesn't
 matter how the 'snapshotting' gets implemented. It could be a separate
 repository, or a directory on cran with symbolic links, or a page somewhere
 with hyperlinks to the respective source packages. Or you can put all
 packages in a big zip file, or include it in your OS distribution. You can
 even distribute your entire repo on cdroms (debian style!) or do all of the
 above.
 
 The hard problem is not implementation. The hard part is that for
 reproducibility to work, we need community wide conventions on which
 versions of cran packages are used by a particular release of R. Local
 downstream solutions are impractical, because this results in
 scripts/packages that only work within your niche using this particular
 snapshot. I expect that requiring every script be executed in the context of
 dependencies from some particular third party repository will make
 reproducibility even less common. Therefore I am trying to make a case for a
 solution that would naturally improve reliability/reproducibility of R code
 without any effort by the end-user.
 
 So implementation isn't a problem.  The problem is that you need a way
 to force people not to be able to use different package versions than
 what existed at the time of each R release.  I said this in my
 previous email, but you removed and did not address it: However, you
 would need to find a way to actively _prevent_ people from installing
 newer versions of packages with the stable R releases.  Frankly, I
 would stop using CRAN if this policy were adopted.
 
 I suggest you go build this yourself.  You have all the code available
 on CRAN, and the dates at which each package was published.  If others
 who care about reproducible research find what you've built useful,
 you will create the very community you want.  And you won't have to
 force one single person to change their workflow.
 
 Yeah we've already heard this do it yourself kind of answer. Not a
 very productive one honestly.
 
 Well actually that's what we've done for the Bioconductor repositories:
 we freeze the BioC packages for each version of Bioconductor. But since
 this freezing doesn't happen at the CRAN level, and many BioC packages
 depend on CRAN packages, the freezing is only at the surface. Would be
 much better if the freezing was all the way down to the bottom of the
 sea. (Note that it is already if you install binary packages only.)
 
 Yes it's technically possible to work around this by also hosting
 frozen versions of CRAN, one per version of Bioconductor, and have
 biocLite() (the tool BioC

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Gavin Simpson

What am I overlooking?

That this is already available and possible in R today, but perhaps
not widely used. Developers do tend to only include a lower bound if
they include any bounds at all on package dependencies.

As I mentioned elsewhere, R packages often aren't built against
other R packages and often developers may have a range of versions
being tested against, some of which may not be on CRAN yet.

Technically, all packages on CRAN would need to have a dependency cap
on R-devel, but as that is a moving target until it is released I
don't see in practice how enforcing an upper limit on the R dependency
would  work. The way CRAN works, you can't just set a dependency on R
== 3.0.x say. (As far as I understand CRAN's policies.)

For packages it is quite trivial for the developers to manually add
the required info for the upperbound, less so the lower bound, but you
could just pick a known working version. An upper range on the
dependencies could be stated as whatever version is current on CRAN.
But then what happens? Unbeknownst to you, a few days after you
release to CRAN your package foo with stated dependency on bar = 1.2,
bar = 1.8, the developer of bar releases bar v 2.0 and your package
no longer passes checks, CRAN gets in touch and you have to resubmit
another version. This could be desirable in terms of helping
contribute to reproducibility exercises, but incurs more effort on the
CRAN maintainers and package maintainers. Now, this might be an issue
because of the desire on CRAN's behalf to have some elements of human
intervention in the submission process, but you either work with CRAN
or do your own thing.

As Bioconductor have shown (for example) it is possible, if people
want to put in time and effort and have a community buy into an ethos,
to achieve staged releases etc.

G

On 19 March 2014 12:58, Carl Boettiger cboet...@gmail.com wrote:
 Dear list,

 I'm curious what people would think of a more modest proposal at this time:

 State the version of the dependencies used by the package authors when the
 package was built.

 Eventually CRAN could enforce such a statement be present in the
 description. We encourage users to declare the version of the packages they
 use in publications, so why not have the same expectation of developers?
  This would help address the problem of archived packages that Jeroen
 raises, as it is currently it is impossible to reliably install archived
 packages because their dependencies have since been updated and are no
 longer compatible.  (Even if it passes checks and installs, we have no way
 of knowing if the upstream changes have introduced a bug).  This
 information would be relatively straight forward to capture, shouldn't
 change the way anyone currently uses CRAN, and should address a major pain
 point anyone trying to install archived versions from CRAN has probably
 encountered.  What am I overlooking?

 Carl


 On Wed, Mar 19, 2014 at 11:36 AM, Spencer Graves 
 spencer.gra...@structuremonitoring.com wrote:

   What about having this purpose met with something like an expansion
 of R-Forge?  We could have packages submitted to R-Forge rather than CRAN,
 and people who wanted the latest could get it from R-Forge.  If changes I
 make on R-Forge break a reverse dependency, emails explaining the problem
 are sent to both me and the maintainer for the package I broke.


   The budget for R-Forge would almost certainly need to be increased:
  They currently disable many of the tests they once ran.


   Regarding budget, the R Project would get more donations if they
 asked for them and made it easier to contribute.  I've tried multiple times
 without success to find a way to donate.  I didn't try hard, but it
 shouldn't be hard ;-)  (And donations should be accepted in US dollars and
 Euros -- and maybe other currencies.) There should be a procedure whereby
 anyone could receive a pro forma invoice, which they can pay or ignore as
 they choose.  I mention this, because many grants could cover a reasonable
 fee provided they have an invoice.


   Spencer Graves


 On 3/19/2014 10:59 AM, Jeroen Ooms wrote:

 On Wed, Mar 19, 2014 at 5:52 AM, Duncan Murdoch murdoch.dun...@gmail.com
 wrote:

  I don't see why CRAN needs to be involved in this effort at all.  A third
 party could take snapshots of CRAN at R release dates, and make those
 available to package users in a separate repository.  It is not hard to
 set
 a different repository than CRAN as the default location from which to
 obtain packages.

  I am happy to see many people giving this some thought and engage in the
 discussion.

 Several have suggested that staging  freezing can be simply done by a
 third party. This solution and its limitations is also described in the
 paper [1] in the section titled R: downstream staging and repackaging.

 If this would solve the problem without affecting CRAN, we would have been
 done this obviously. In fact, as described in the paper and

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Gavin Simpson

Given that R is (has) moved to a 12 month release cycle, I don't want
to either i) wait a year to get new packages (or allow users to use
new versions of my packages), or ii) have to run R-devel just to use
new packages. (or be on R-testing for that matter).

People then will start finding ways around these limitations and then
we're back to square one of having people use a set of R packages and
R versions that could potentially be all over the place.

As a package developer, it is pretty easy to say I've tested my
package works with these other packages and their versions, and set
DESCRIPTION to reflect only those versions as allowed (or a range as a
package matures and the maintainer has tested against more versions of
the dependencies). CRAN may well not like this if your package no
longer builds/checks on their system but then you have a choice to
make; stick to your reproducibility guns  forsake CRAN in favour of
something else (github, one's own repo), or relent and meet CRANs
requirements.

On 19 March 2014 16:57, Hervé Pagès hpa...@fhcrc.org wrote:


 On 03/19/2014 02:59 PM, Joshua Ulrich wrote:

 On Wed, Mar 19, 2014 at 4:28 PM, Jeroen Ooms jeroen.o...@stat.ucla.edu
 wrote:

 On Wed, Mar 19, 2014 at 11:50 AM, Joshua Ulrich josh.m.ulr...@gmail.com
 wrote:


 The suggested solution is not described in the referenced article.  It
 was not suggested that it be the operating system's responsibility to
 distribute snapshots, nor was it suggested to create binary
 repositories for specific operating systems, nor was it suggested to
 freeze only a subset of CRAN packages.



 IMO this is an implementation detail. If we could all agree on a
 particular
 set of cran packages to be used with a certain release of R, then it
 doesn't
 matter how the 'snapshotting' gets implemented. It could be a separate
 repository, or a directory on cran with symbolic links, or a page
 somewhere
 with hyperlinks to the respective source packages. Or you can put all
 packages in a big zip file, or include it in your OS distribution. You
 can
 even distribute your entire repo on cdroms (debian style!) or do all of
 the
 above.

 The hard problem is not implementation. The hard part is that for
 reproducibility to work, we need community wide conventions on which
 versions of cran packages are used by a particular release of R. Local
 downstream solutions are impractical, because this results in
 scripts/packages that only work within your niche using this particular
 snapshot. I expect that requiring every script be executed in the context
 of
 dependencies from some particular third party repository will make
 reproducibility even less common. Therefore I am trying to make a case
 for a
 solution that would naturally improve reliability/reproducibility of R
 code
 without any effort by the end-user.

 So implementation isn't a problem.  The problem is that you need a way
 to force people not to be able to use different package versions than
 what existed at the time of each R release.  I said this in my
 previous email, but you removed and did not address it: However, you
 would need to find a way to actively _prevent_ people from installing
 newer versions of packages with the stable R releases.  Frankly, I
 would stop using CRAN if this policy were adopted.

 I suggest you go build this yourself.  You have all the code available
 on CRAN, and the dates at which each package was published.  If others
 who care about reproducible research find what you've built useful,
 you will create the very community you want.  And you won't have to
 force one single person to change their workflow.


 Yeah we've already heard this do it yourself kind of answer. Not a
 very productive one honestly.

 Well actually that's what we've done for the Bioconductor repositories:
 we freeze the BioC packages for each version of Bioconductor. But since
 this freezing doesn't happen at the CRAN level, and many BioC packages
 depend on CRAN packages, the freezing is only at the surface. Would be
 much better if the freezing was all the way down to the bottom of the
 sea. (Note that it is already if you install binary packages only.)

 Yes it's technically possible to work around this by also hosting
 frozen versions of CRAN, one per version of Bioconductor, and have
 biocLite() (the tool BioC users use for installing packages) point to
 these frozen versions of CRAN in order to get the correct dependencies
 for any given version of BioC. However we don't do that because that
 would mean extra costs for us in terms of storage space and bandwidth.
 And also because we believe that it would be more effective and would
 ultimately benefit the entire R community (and not just the BioC
 community) if this problem was addressed upstream.


 H.


 Best,
 --
 Joshua Ulrich  |  about.me/joshuaulrich
 FOSS Trading  |  www.fosstrading.com

 __
 R-devel@r-project.org mailing list

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Michael Weylandt



On Mar 19, 2014, at 18:42, Joshua Ulrich josh.m.ulr...@gmail.com wrote:

 On Wed, Mar 19, 2014 at 5:16 PM, Jeroen Ooms jeroen.o...@stat.ucla.edu 
 wrote:
 On Wed, Mar 19, 2014 at 2:59 PM, Joshua Ulrich josh.m.ulr...@gmail.com 
 wrote:
 
 So implementation isn't a problem.  The problem is that you need a way
 to force people not to be able to use different package versions than
 what existed at the time of each R release.  I said this in my
 previous email, but you removed and did not address it: However, you
 would need to find a way to actively _prevent_ people from installing
 newer versions of packages with the stable R releases.  Frankly, I
 would stop using CRAN if this policy were adopted.
 
 I am not proposing to force anything to anyone, those are your
 words. Please read the proposal more carefully before derailing the
 discussion. Below *verbatim* a section from the paper:
 snip
 
 Yes force is too strong a word.  You want a barrier (however small)
 to prevent people from installing newer (or older) versions of
 packages than those that correspond to a given R release.


Jeroen,

Reading this thread again, is it a fair summary of your position to say 
reproducibility by default is more important than giving users access to the 
newest bug fixes and features by default? It's certainly arguable, but I'm not 
sure I'm convinced: I'd imagine that the ratio of new work being done vs 
reproductions is rather high and the current setup optimizes for that already. 

What I'm trying to figure out is why the standard install the following list 
of package versions isn't good enough in your eyes? Is it the lack of CRAN 
provided binaries or the fact that the user has to proactively set up their 
environment to replicate that of published results?

In your XML example, it seems the problem was that the reproducer didn't check 
that the same package versions as the reproducee and instead assumed that 
'latest' would be the same. Annoying yes, but easy to solve. 

Michael

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Gavin Simpson

Michael,

I think the issue is that Jeroen wants to take that responsibility out
of the hands of the person trying to reproduce a work. If it used R
3.0.x and packages A, B and C then it would be trivial to to install
that version of R and then pull down the stable versions of A B and C
for that version of R. At the moment, one might note the packages used
and even their versions, but what about the versions of the packages
that the used packages rely upon  so on? What if developers don't
state know working versions of dependencies?

The problem is how the heck do you know which versions of packages are
needed if developers don't record these dependencies in sufficient
detail? The suggested solution is to freeze CRAN at intervals
alongside R releases. Then you'd know what the stable versions were.

Or we could just get package developers to be more thorough in
documenting dependencies. Or R CMD check could refuse to pass if a
package is listed as a dependency but with no version qualifiers. Or
have R CMD build add an upper bound (from the current, at build-time
version of dependencies on CRAN) if the package developer didn't
include and upper bound. Or... The first is unliekly to happen
consistently, and no-one wants *more* checks and hoops to jump through
:-)

To my mind it is incumbent upon those wanting reproducibility to build
the tools to enable users to reproduce works. When you write a paper
or release a tool, you will have tested it with a specific set of
packages. It is relatively easy to work out what those versions are
(there are tools in R for this). What is required is an automated way
to record that info in an agreed upon way in an approved
file/location, and have a tool that facilitates setting up a package
library sufficient with which to reproduce a work. That approval
doesn't need to come from CRAN or R Core - we can store anything in
./inst.

Reproducibility is a very important part of doing science, but not
everyone using CRAN is doing that. Why force everyone to march to the
reproducibility drum? I would place the onus elsewhere to make this
work.

Gavin
A scientist, very much interested in reproducibility of my work and others.

On 19 March 2014 19:55, Michael Weylandt michael.weyla...@gmail.com wrote:


 On Mar 19, 2014, at 18:42, Joshua Ulrich josh.m.ulr...@gmail.com wrote:

 On Wed, Mar 19, 2014 at 5:16 PM, Jeroen Ooms jeroen.o...@stat.ucla.edu 
 wrote:
 On Wed, Mar 19, 2014 at 2:59 PM, Joshua Ulrich josh.m.ulr...@gmail.com 
 wrote:

 So implementation isn't a problem.  The problem is that you need a way
 to force people not to be able to use different package versions than
 what existed at the time of each R release.  I said this in my
 previous email, but you removed and did not address it: However, you
 would need to find a way to actively _prevent_ people from installing
 newer versions of packages with the stable R releases.  Frankly, I
 would stop using CRAN if this policy were adopted.

 I am not proposing to force anything to anyone, those are your
 words. Please read the proposal more carefully before derailing the
 discussion. Below *verbatim* a section from the paper:
 snip

 Yes force is too strong a word.  You want a barrier (however small)
 to prevent people from installing newer (or older) versions of
 packages than those that correspond to a given R release.


 Jeroen,

 Reading this thread again, is it a fair summary of your position to say 
 reproducibility by default is more important than giving users access to the 
 newest bug fixes and features by default? It's certainly arguable, but I'm 
 not sure I'm convinced: I'd imagine that the ratio of new work being done vs 
 reproductions is rather high and the current setup optimizes for that already.

 What I'm trying to figure out is why the standard install the following list 
 of package versions isn't good enough in your eyes? Is it the lack of CRAN 
 provided binaries or the fact that the user has to proactively set up their 
 environment to replicate that of published results?

 In your XML example, it seems the problem was that the reproducer didn't 
 check that the same package versions as the reproducee and instead assumed 
 that 'latest' would be the same. Annoying yes, but easy to solve.

 Michael

 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel



-- 
Gavin Simpson, PhD

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Michael Weylandt



On Mar 19, 2014, at 22:17, Gavin Simpson ucfa...@gmail.com wrote:

 Michael,
 
 I think the issue is that Jeroen wants to take that responsibility out
 of the hands of the person trying to reproduce a work. If it used R
 3.0.x and packages A, B and C then it would be trivial to to install
 that version of R and then pull down the stable versions of A B and C
 for that version of R. At the moment, one might note the packages used
 and even their versions, but what about the versions of the packages
 that the used packages rely upon  so on? What if developers don't
 state know working versions of dependencies?

Doesn't sessionInfo() give all of this?

If you want to be very worried about every last bit, I suppose it should also 
include options(), compiler flags, compiler version, BLAS details, etc.  (Good 
talk on the dregs of a floating point number and how hard it is to reproduce 
them across processors http://www.youtube.com/watch?v=GIlp4rubv8U)

 
 The problem is how the heck do you know which versions of packages are
 needed if developers don't record these dependencies in sufficient
 detail? The suggested solution is to freeze CRAN at intervals
 alongside R releases. Then you'd know what the stable versions were.

Only if you knew which R release was used. 

 
 Or we could just get package developers to be more thorough in
 documenting dependencies. Or R CMD check could refuse to pass if a
 package is listed as a dependency but with no version qualifiers. Or
 have R CMD build add an upper bound (from the current, at build-time
 version of dependencies on CRAN) if the package developer didn't
 include and upper bound. Or... The first is unliekly to happen
 consistently, and no-one wants *more* checks and hoops to jump through
 :-)
 
 To my mind it is incumbent upon those wanting reproducibility to build
 the tools to enable users to reproduce works.

But the tools already allow it with minimal effort. If the author can't even 
include session info, how can we be sure the version of R is known. If we can't 
know which version of R, can we ever change R at all? Etc to absurdity. 

My (serious) point is that the tools are in place, but ramming them down folks' 
throats by intentionally keeping them on older versions by default is too much. 

 When you write a paper
 or release a tool, you will have tested it with a specific set of
 packages. It is relatively easy to work out what those versions are
 (there are tools in R for this). What is required is an automated way
 to record that info in an agreed upon way in an approved
 file/location, and have a tool that facilitates setting up a package
 library sufficient with which to reproduce a work. That approval
 doesn't need to come from CRAN or R Core - we can store anything in
 ./inst.

I think the package version and published paper cases are different. 

For the latter, the recipe is simple: if you want the same results, use the 
same  software (as noted by sessionInfoPlus() or equiv)

For the former, I think you start straying into this NP complete problem: 
http://people.debian.org/~dburrows/model.pdf 

Yes, a good config can (and should be recorded) but isn't that exactly what 
sessionInfo() gives?

 
 Reproducibility is a very important part of doing science, but not
 everyone using CRAN is doing that. Why force everyone to march to the
 reproducibility drum? I would place the onus elsewhere to make this
 work.
 

Agreed: reproducibility is the onus of the author, not the reader


 Gavin
 A scientist, very much interested in reproducibility of my work and others.

Michael
In finance, where we call it Auditability and care very much as well :-)


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Jeroen Ooms

On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
michael.weyla...@gmail.com wrote:
 Reading this thread again, is it a fair summary of your position to say 
 reproducibility by default is more important than giving users access to the 
 newest bug fixes and features by default? It's certainly arguable, but I'm 
 not sure I'm convinced: I'd imagine that the ratio of new work being done vs 
 reproductions is rather high and the current setup optimizes for that already.

I think that separating development from released branches can give us
both reliability/reproducibility (stable branch) as well as new
features (unstable branch). The user gets to pick (and you can pick
both!). The same is true for r-base: when using a 'released' version
you get 'stable' base packages that are up to 12 months old. If you
want to have the latest stuff you download a nightly build of r-devel.
For regular users and reproducible research it is recommended to use
the stable branch. However if you are a developer (e.g. package
author) you might want to develop/test/check your work with the latest
r-devel.

I think that extending the R release cycle to CRAN would result both
in more stable released versions of R, as well as more freedom for
package authors to implement rigorous change in the unstable branch.
When writing a script that is part of a production pipeline, or sweave
paper that should be reproducible 10 years from now, or a book on
using R, you use stable version of R, which is guaranteed to behave
the same over time. However when developing packages that should be
compatible with the upcoming release of R, you use r-devel which has
the latest versions of other CRAN and base packages.


 What I'm trying to figure out is why the standard install the following list 
 of package versions isn't good enough in your eyes?

Almost nobody does this because it is cumbersome and impractical. We
can do so much better than this. Note that in order to install old
packages you also need to investigate which versions of dependencies
of those packages were used. On win/osx, users need to manually build
those packages which can be a pain. All in all it makes reproducible
research difficult and expensive and error prone. At the end of the
day most published results obtain with R just won't be reproducible.

Also I believe that keeping it simple is essential for solutions to be
practical. If every script has to be run inside an environment with
custom libraries, it takes away much of its power. Running a bash or
python script in Linux is so easy and reliable that entire
distributions are based on it. I don't understand why we make our
lives so difficult in R.

In my estimation, a system where stable versions of R pull packages
from a stable branch of CRAN will naturally resolve the majority of
the reproducibility and reliability problems with R. And in contrast
to what some people here are suggesting it does not introduce any
limitations. If you want to get the latest stuff, you either grab a
copy of r-devel, or just enable the testing branch and off you go.
Debian 'testing' works in a similar way, see
http://www.debian.org/devel/testing.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Michael Weylandt

On Mar 19, 2014, at 22:45, Jeroen Ooms jeroen.o...@stat.ucla.edu wrote:

 On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
 michael.weyla...@gmail.com wrote:
 Reading this thread again, is it a fair summary of your position to say 
 reproducibility by default is more important than giving users access to 
 the newest bug fixes and features by default? It's certainly arguable, but 
 I'm not sure I'm convinced: I'd imagine that the ratio of new work being 
 done vs reproductions is rather high and the current setup optimizes for 
 that already.
 
 I think that separating development from released branches can give us
 both reliability/reproducibility (stable branch) as well as new
 features (unstable branch). The user gets to pick (and you can pick
 both!). The same is true for r-base: when using a 'released' version
 you get 'stable' base packages that are up to 12 months old. If you
 want to have the latest stuff you download a nightly build of r-devel.
 For regular users and reproducible research it is recommended to use
 the stable branch. However if you are a developer (e.g. package
 author) you might want to develop/test/check your work with the latest
 r-devel.

I think where you are getting push back (e.g., Frank Harrell and Josh Ulrich) 
is from saying that 'stable' is the right branch for 'regular users.' And I 
tend to agree: I think most folks need features and bug fixes more than they 
need to reproduce a particular paper with no effort on their end. 

 
 I think that extending the R release cycle to CRAN would result both
 in more stable released versions of R, as well as more freedom for
 package authors to implement rigorous change in the unstable branch.

Not sure what exactly you mean by this sentence. 

 When writing a script that is part of a production pipeline, or sweave
 paper that should be reproducible 10 years from now, or a book on
 using R, you use stable version of R, which is guaranteed to behave
 the same over time.

Only if you never upgrade anything... But that's the case already, isn't it?


 However when developing packages that should be
 compatible with the upcoming release of R, you use r-devel which has
 the latest versions of other CRAN and base packages.
 
 
 What I'm trying to figure out is why the standard install the following 
 list of package versions isn't good enough in your eyes?
 
 Almost nobody does this because it is cumbersome and impractical. We
 can do so much better than this. Note that in order to install old
 packages you also need to investigate which versions of dependencies
 of those packages were used. On win/osx, users need to manually build
 those packages which can be a pain. All in all it makes reproducible
 research difficult and expensive and error prone. At the end of the
 day most published results obtain with R just won't be reproducible.

So you want CRAN to host old binaries ad infinitum? I think that's entirely 
reasonable/doable if (big if) storage and network are free. 

 
 Also I believe that keeping it simple is essential for solutions to be
 practical. If every script has to be run inside an environment with
 custom libraries, it takes away much of its power. Running a bash or
 python script in Linux is so easy and reliable that entire
 distributions are based on it. I don't understand why we make our
 lives so difficult in R.

Because for Debian style (stop the world on release) distro, there are no 
upgrades within a release. And that's only halfway reasonable because of 
Debian's shockingly good QA. 

It's certainly not true for, e.g., Arch. 

I've been looking at python incompatibilities across different RHEL versions 
lately. There's simply no way to get around explicit version pinning (either by 
release number or date, but when you have many moving pieces, picking a set of 
release numbers is much easier than finding a single day when they all happened 
to work together) if it has to work exactly as it used to. 

 
 In my estimation, a system where stable versions of R pull packages
 from a stable branch of CRAN will naturally resolve the majority of
 the reproducibility and reliability problems with R.

And what everyone else is saying is if you want to reproduce results made with 
old software,  download and use the old software. Both can me made to work -- 
it's just a matter of pros and cons of different defaults. 


 And in contrast
 to what some people here are suggesting it does not introduce any
 limitations. If you want to get the latest stuff, you either grab a
 copy of r-devel, or just enable the testing branch and off you go.
 Debian 'testing' works in a similar way, see
 http://www.debian.org/devel/testing.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Karl Millar

I think what you really want here is the ability to easily identify
and sync to CRAN snapshots.

The easy way to do this is setup a CRAN mirror, but back it up with
version control, so that it's easy to reproduce the exact state of
CRAN at any given point in time.  CRAN's not particularly large and
doesn't churn a whole lot, so most version control systems should be
able to handle that without difficulty.

Using svn, mod_dav_svn and (maybe) mod_rewrite, you could setup the
server so that e.g.:
   http://my.cran.mirror/repos/2013-01-01/
is a mirror of how CRAN looked at midnight 2013-01-01.

Users can then set their repository to that URL, and will have a
stable snapshot to work with, and can have all their packages built
with that snapshot if they like.  For reproducibility purposes, all
users need to do is to agree on the same date to use.  For publication
purposes, the date of the snapshot should be sufficient.

We'd need a version of update.packages() that force-syncs all the
packages to the version in the repository, even if they're downgrades,
but otherwise it ought to be fairly straight-forward.

FWIW, we do something similar internally at Google.  All the packages
that a user has installed come from the same source control revision,
where we know that all the package versions are mutually compatible.
It saves a lot of headaches, and users can rollback to any previous
point in time easily if they run into problems.


On Wed, Mar 19, 2014 at 7:45 PM, Jeroen Ooms jeroen.o...@stat.ucla.edu wrote:
 On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
 michael.weyla...@gmail.com wrote:
 Reading this thread again, is it a fair summary of your position to say 
 reproducibility by default is more important than giving users access to 
 the newest bug fixes and features by default? It's certainly arguable, but 
 I'm not sure I'm convinced: I'd imagine that the ratio of new work being 
 done vs reproductions is rather high and the current setup optimizes for 
 that already.

 I think that separating development from released branches can give us
 both reliability/reproducibility (stable branch) as well as new
 features (unstable branch). The user gets to pick (and you can pick
 both!). The same is true for r-base: when using a 'released' version
 you get 'stable' base packages that are up to 12 months old. If you
 want to have the latest stuff you download a nightly build of r-devel.
 For regular users and reproducible research it is recommended to use
 the stable branch. However if you are a developer (e.g. package
 author) you might want to develop/test/check your work with the latest
 r-devel.

 I think that extending the R release cycle to CRAN would result both
 in more stable released versions of R, as well as more freedom for
 package authors to implement rigorous change in the unstable branch.
 When writing a script that is part of a production pipeline, or sweave
 paper that should be reproducible 10 years from now, or a book on
 using R, you use stable version of R, which is guaranteed to behave
 the same over time. However when developing packages that should be
 compatible with the upcoming release of R, you use r-devel which has
 the latest versions of other CRAN and base packages.


 What I'm trying to figure out is why the standard install the following 
 list of package versions isn't good enough in your eyes?

 Almost nobody does this because it is cumbersome and impractical. We
 can do so much better than this. Note that in order to install old
 packages you also need to investigate which versions of dependencies
 of those packages were used. On win/osx, users need to manually build
 those packages which can be a pain. All in all it makes reproducible
 research difficult and expensive and error prone. At the end of the
 day most published results obtain with R just won't be reproducible.

 Also I believe that keeping it simple is essential for solutions to be
 practical. If every script has to be run inside an environment with
 custom libraries, it takes away much of its power. Running a bash or
 python script in Linux is so easy and reliable that entire
 distributions are based on it. I don't understand why we make our
 lives so difficult in R.

 In my estimation, a system where stable versions of R pull packages
 from a stable branch of CRAN will naturally resolve the majority of
 the reproducibility and reliability problems with R. And in contrast
 to what some people here are suggesting it does not introduce any
 limitations. If you want to get the latest stuff, you either grab a
 copy of r-devel, or just enable the testing branch and off you go.
 Debian 'testing' works in a similar way, see
 http://www.debian.org/devel/testing.

 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

62 matches

Mail list logo