subject:"\[Rd\] \[RFC\] A case for freezing CRAN"

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-24 Thread Gábor Csárdi

FWIW, I am mirroring CRAN at github now, here:
https://github.com/cran

One can install specific package versions using the devtools package:
library(devtools)
install_github("cran/@")

In addition, one can also install versions based on the R version, e.g.:
install_github("cran/@R-2.15.3")
installs the version that was on CRAN when R-2.15.3 was released.

This is not very convenient yet, because the dependencies should be
installed based on the R versions as well. This is in the works.

This is an experiment, and I am not yet committed to maintaining it in the
long run. We'll see how it works and if it has the potential to be useful.

Plans for features:
- convenient install of packages from CRAN "snapshots", with all
dependencies coming from the same snapshot.
- web page with package search, summaries, etc.
- binaries

Help is welcome, especially advice and feedback:
https://github.com/metacran/tools/issues

Best,
Gabor

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-24 Thread Martin Maechler

>>>>> Hervé Pagès 
>>>>> on Thu, 20 Mar 2014 15:23:57 -0700 writes:

> On 03/20/2014 01:28 PM, Ted Byers wrote:
>> On Thu, Mar 20, 2014 at 3:14 PM, Hervé Pagès
>> mailto:hpa...@fhcrc.org>> wrote:
>> 
>> On 03/20/2014 03:52 AM, Duncan Murdoch wrote:
>> 
>> On 14-03-20 2:15 AM, Dan Tenenbaum wrote:
>> 
>> 
>> 
>> - Original Message -
>> 
>> From: "David Winsemius" > <mailto:dwinsem...@comcast.net>> To: "Jeroen Ooms"
>> > <mailto:jeroen.o...@stat.ucla.edu>> Cc: "r-devel"
>> mailto:r-devel@r-project.org>>
>> Sent: Wednesday, March 19, 2014 11:03:32 PM Subject: Re:
>> [Rd] [RFC] A case for freezing CRAN
>> 
>> 
>> On Mar 19, 2014, at 7:45 PM, Jeroen Ooms wrote:
>> 
>> On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
>> > <mailto:michael.weyla...@gmail.com>> wrote:
>> 
>> Reading this thread again, is it a fair summary of your
>> position to say "reproducibility by default is more
>> important than giving users access to the newest bug
>> fixes and features by default?"  It's certainly arguable,
>> but I'm not sure I'm convinced: I'd imagine that the
>> ratio of new work being done vs reproductions is rather
>> high and the current setup optimizes for that already.
>> 
>> 
>> I think that separating development from released
>> branches can give us both reliability/reproducibility
>> (stable branch) as well as new features (unstable
>> branch). The user gets to pick (and you can pick
>> both!). The same is true for r-base: when using a
>> 'released' version you get 'stable' base packages that
>> are up to 12 months old. If you want to have the latest
>> stuff you download a nightly build of r-devel.  For
>> regular users and reproducible research it is recommended
>> to use the stable branch. However if you are a developer
>> (e.g. package author) you might want to
>> develop/test/check your work with the latest r-devel.
>> 
>> I think that extending the R release cycle to CRAN would
>> result both in more stable released versions of R, as
>> well as more freedom for package authors to implement
>> rigorous change in the unstable branch.  When writing a
>> script that is part of a production pipeline, or sweave
>> paper that should be reproducible 10 years from now, or a
>> book on using R, you use stable version of R, which is
>> guaranteed to behave the same over time. However when
>> developing packages that should be compatible with the
>> upcoming release of R, you use r-devel which has the
>> latest versions of other CRAN and base packages.
>> 
>> 
>> 
>> As I remember ... The example demonstrating the need for
>> this was an XML package that cause an extract from a
>> website where the headers were misinterpreted as data in
>> one version of pkg:XML and not in another. That seems
>> fairly unconvincing. Data cleaning and validation is a
>> basic task of data analysis. It also seems excessive to
>> assert that it is the responsibility of CRAN to maintain
>> a synced binary archive that will be available in ten
>> years.
>> 
>> 
>> 
>> CRAN already does this, the bin/windows/contrib directory
>> has subdirectories going back to 1.7, with packages dated
>> October 2004. I don't see why it is burdensome to
>> continue to archive these.  It would be nice if source
>> versions had a similar archive.
>> 
>> 
>> The bin/windows/contrib directories are updated every day
>> for active R versions.  It's only when Uwe decides that a
>> version is no longer worth active support that he stops
>> doing updates, and it "freezes".  A consequence of this
>> is that the snapshots preserved in those older
>> directories are unlikely to match what someone who keeps
>> up to date with R releases is using.  Their purpose is to
>> make sure that those older versions aren't completely
>> useless, but they aren't what Jeroen was asking for.
>> 
>> 
>> But it is almost completely useless from a
>> reproducibility point of

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-21 Thread Tom Short

For me, the most important aspect is being able to reproduce my own
work. Some other tools offer interesting approaches to managing
packages:

* NPM -- The Node Package Manager for Node.js loads a local copy of
all packages and dependencies. This helps ensure reproducibility and
avoids dependency issues. Different projects in different directories
can then use different package versions.

* Julia -- Julia's package manager is based on git, so users should
have a local copy of all package versions they've used. Theoretically,
you could use separate git repos for different projects, and merge as
desired.

I've thought about putting my local R library into a git repository.
Then, I could clone that into a project directory and use
.libPaths(".Rlibrary")  in a .Rprofile file to set the library
directory to the clone. In addition to handling package versions, this
might be nice for installing packages that are rarely used (my library
directory tends to get cluttered if I start trying out packages).
Another addition could be a local script that starts a specific
version of R.

For now, I don't have much incentive to do this. For the packages that
I use, R's been pretty good to me with backwards compatibility.

I do like the idea of a CRAN mirror that's under version control.




On Tue, Mar 18, 2014 at 4:24 PM, Jeroen Ooms  wrote:
> This came up again recently with an irreproducible paper. Below an
> attempt to make a case for extending the r-devel/r-release cycle to
> CRAN packages. These suggestions are not in any way intended as
> criticism on anyone or the status quo.
>
> The proposal described in [1] is to freeze a snapshot of CRAN along
> with every release of R. In this design, updates for contributed
> packages treated the same as updates for base packages in the sense
> that they are only published to the r-devel branch of CRAN and do not
> affect users of "released" versions of R. Thereby all users, stacks
> and applications using a particular version of R will by default be
> using the identical version of each CRAN package. The bioconductor
> project uses similar policies.
>
> This system has several important advantages:
>
> ## Reproducibility
>
> Currently r/sweave/knitr scripts are unstable because of ambiguity
> introduced by constantly changing cran packages. This causes scripts
> to break or change behavior when upstream packages are updated, which
> makes reproducing old results extremely difficult.
>
> A common counter-argument is that script authors should document
> package versions used in the script using sessionInfo(). However even
> if authors would manually do this, reconstructing the author's
> environment from this information is cumbersome and often nearly
> impossible, because binary packages might no longer be available,
> dependency conflicts, etc. See [1] for a worked example. In practice,
> the current system causes many results or documents generated with R
> no to be reproducible, sometimes already after a few months.
>
> In a system where contributed packages inherit the r-base release
> cycle, scripts will behave the same across users/systems/time within a
> given version of R. This severely reduces ambiguity of R behavior, and
> has the potential of making reproducibility a natural part of the
> language, rather than a tedious exercise.
>
> ## Repository Management
>
> Just like scripts suffer from upstream changes, so do packages
> depending on other packages. A particular package that has been
> developed and tested against the current version of a particular
> dependency is not guaranteed to work against *any future version* of
> that dependency. Therefore, packages inevitably break over time as
> their dependencies are updated.
>
> One recent example is the Rcpp 0.11 release, which required all
> reverse dependencies to be rebuild/modified. This updated caused some
> serious disruption on our production servers. Initially we refrained
> from updating Rcpp on these servers to prevent currently installed
> packages depending on Rcpp to stop working. However soon after the
> Rcpp 0.11 release, many other cran packages started to require Rcpp >=
> 0.11, and our users started complaining about not being able to
> install those packages. This resulted in the impossible situation
> where currently installed packages would not work with the new Rcpp,
> but newly installed packages would not work with the old Rcpp.
>
> Current CRAN policies blame this problem on package authors. However
> as is explained in [1], this policy does not solve anything, is
> unsustainable with growing repository size, and sets completely the
> wrong incentives for contributing code. Progress comes with breaking
> changes, and the system should be able to accommodate this. Much of
> the trouble could have been prevented by a system that does not push
> bleeding edge updates straight to end-users, but has a devel branch
> where conflicts are resolved before publishing them in the next
> r-release.
>
>

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-21 Thread Philippe GROSJEAN


On 21 Mar 2014, at 11:08, Rainer M Krug  wrote:

> Jari Oksanen  writes:
> 
>> On 21/03/2014, at 10:40 AM, Rainer M Krug wrote:
>> 
>>> 
>>> 
>>> This is a long and (mainly) interesting discussion, which is fanning out
>>> in many different directions, and I think many are not that relevant to
>>> the OP's suggestion. 
>>> 
>>> I see the advantages of having such a dynamic CRAN, but also of having a
>>> more stable CRAN. I prefer CRAN as it is now, but ion many cases a more
>>> stable CRAN might b an advantage. So having releases of CRAN might make
>>> sense. But then there is the archiving issue of CRAN.
>>> 
>>> The suggestion was made to move the responsibility away from CRAN and
>>> the R infrastructure to the user / researcher to guarantee that the
>>> results can be re-run years later. It would be nice to have this build
>>> in CRAN, but let's stick at the scenario that the user should care for
>>> reproducability.
>> 
>> There are two different problems that alternate in the discussion:
>> reproducibility and breakage of CRAN dependencies. Frozen CRAN could
>> make *approximate* reproducibility easier to achieve, but real
>> reproducibility needs stricter solutions. Actual sessionInfo() is
>> minimal information, but re-building a spitting image of old
>> environment may still be demanding (but in many cases this does not
>> matter).
>> 
>> Another problem is that CRAN is so volatile that new versions of
>> packages break other packages or old scripts. Here the main problem is
>> how package developers work. Freezing CRAN would not change that: if
>> package maintainers release breaking code, that would be frozen. I
>> think that most packages do not make distinction between development
>> and release branches, and CRAN policy won't change that.
>> 
>> I can sympathize with package maintainers having 150 reverse
>> dependencies. My main package only has ~50, and it is sure that I
>> won't test them all with new release. I sometimes tried, but I could
>> not even get all those built because they had other dependencies on
>> packages that failed. Even those that I could test failed to detect
>> problems (in one case all examples were \dontrun and passed nicely
>> tests). I only wish that if people *really* depend on my package, they
>> test it against R-Forge version and alert me before CRAN releases, but
>> that is not very likely (I guess many dependencies are not *really*
>> necessary, but only concern marginal features of the package, but CRAN
>> forces to declare those).
> 
We work on these too. So far, for latest CRAN version, we have successfully 
installed 4999 packages among the 5321 CRAN package on our platform. Regarding 
conflicts in term of function names, around 2000 packages are clean, but the 
rest produce more than 11,000 pairs of conflicts (i.e., same function name in 
different packages). For dependency errors, look at the cited references 
earlier. It is strange that a large portion of R CMD check errors on CRAN occur 
and disappear *without any version update* of a package or any of its direct or 
indirect dependencies! That is, a fraction of errors or warnings seem to appear 
and disappear without any code update. We have traced back some of these to 
interaction with the net (e.g., example or vignette downloading data from a 
server and the server may be sometimes unavailable). So, yes, a complex and 
difficult topic.


> Breakage of CRAN packages is a problem, to which I can not comment
> much. I have no idea how this could be saved unless one introduces more
> checks, which nobody wants. CRAN is a (more or less) open repository for
> packages written by engineers / programmers but also scientists of other
> fields - and that is the strength of CRAN - a central repository to find
> packages which conform to a minimal standard and format. 
> 
>> 
>> Still a few words about reproducibility of scripts: this can be hardly
>> achieved with good coverage, because many scripts are so very ad
>> hoc. When I edit and review manuscripts for journals, I very often get
>> Sweave or knitr scripts that "just work", where "just" means "just so
>> and so". Often they do not work at all, because they had some
>> undeclared private functionalities or stray files in the author
>> workspace that did not travel with the Sweave document. 
> 
> One reason why I *always* start my R sessions --vanilla and ave a local
> initialization script which I call manually. 
> 
>> I think these
>> -- published scientific papers -- are the main field where the code
>> really should be reproducible, but they often are the hardest to
>> reproduce. 
> 
> And this is completely ouyt of the hands of R / CRAN / ... and in the
> hand of Journals and Authors. But R could provide a framework to make
> this more easy in form of a package which provides functions to make
> this a one-command approach.
> 
>> Nothing CRAN people do can help with sloppy code scientists
>> write for publications. You know, they are scientists -

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-21 Thread Rainer M Krug

Jari Oksanen  writes:

> On 21/03/2014, at 10:40 AM, Rainer M Krug wrote:
>
>> 
>> 
>> This is a long and (mainly) interesting discussion, which is fanning out
>> in many different directions, and I think many are not that relevant to
>> the OP's suggestion. 
>> 
>> I see the advantages of having such a dynamic CRAN, but also of having a
>> more stable CRAN. I prefer CRAN as it is now, but ion many cases a more
>> stable CRAN might b an advantage. So having releases of CRAN might make
>> sense. But then there is the archiving issue of CRAN.
>> 
>> The suggestion was made to move the responsibility away from CRAN and
>> the R infrastructure to the user / researcher to guarantee that the
>> results can be re-run years later. It would be nice to have this build
>> in CRAN, but let's stick at the scenario that the user should care for
>> reproducability.
>
> There are two different problems that alternate in the discussion:
> reproducibility and breakage of CRAN dependencies. Frozen CRAN could
> make *approximate* reproducibility easier to achieve, but real
> reproducibility needs stricter solutions. Actual sessionInfo() is
> minimal information, but re-building a spitting image of old
> environment may still be demanding (but in many cases this does not
> matter).
>
> Another problem is that CRAN is so volatile that new versions of
> packages break other packages or old scripts. Here the main problem is
> how package developers work. Freezing CRAN would not change that: if
> package maintainers release breaking code, that would be frozen. I
> think that most packages do not make distinction between development
> and release branches, and CRAN policy won't change that.
>
> I can sympathize with package maintainers having 150 reverse
> dependencies. My main package only has ~50, and it is sure that I
> won't test them all with new release. I sometimes tried, but I could
> not even get all those built because they had other dependencies on
> packages that failed. Even those that I could test failed to detect
> problems (in one case all examples were \dontrun and passed nicely
> tests). I only wish that if people *really* depend on my package, they
> test it against R-Forge version and alert me before CRAN releases, but
> that is not very likely (I guess many dependencies are not *really*
> necessary, but only concern marginal features of the package, but CRAN
> forces to declare those).

Breakage of CRAN packages is a problem, to which I can not comment
much. I have no idea how this could be saved unless one introduces more
checks, which nobody wants. CRAN is a (more or less) open repository for
packages written by engineers / programmers but also scientists of other
fields - and that is the strength of CRAN - a central repository to find
packages which conform to a minimal standard and format. 

>
> Still a few words about reproducibility of scripts: this can be hardly
> achieved with good coverage, because many scripts are so very ad
> hoc. When I edit and review manuscripts for journals, I very often get
> Sweave or knitr scripts that "just work", where "just" means "just so
> and so". Often they do not work at all, because they had some
> undeclared private functionalities or stray files in the author
> workspace that did not travel with the Sweave document. 

One reason why I *always* start my R sessions --vanilla and ave a local
initialization script which I call manually. 

> I think these
> -- published scientific papers -- are the main field where the code
> really should be reproducible, but they often are the hardest to
> reproduce. 

And this is completely ouyt of the hands of R / CRAN / ... and in the
hand of Journals and Authors. But R could provide a framework to make
this more easy in form of a package which provides functions to make
this a one-command approach.

> Nothing CRAN people do can help with sloppy code scientists
> write for publications. You know, they are scientists -- not
> engineers.

Absolutely - and I am also a sloppy scientists - I put my code online,
but hope that not many people ask me later about it.

Cheers,

Rainer

>
> Cheers, Jari Oksanen
>> 
>> Leaving the issue of compilation out, a package which is creating a
>> custom installation of the R version which includes the source of the R
>> version used and the sources of the packages in a on Linux compilable
>> format, given that the relevant dependencies are installed, would be a
>> huge step forward. 
>> 
>> I know - compilation on Windows (and sometimes Mac) is a serious
>> problem), but to archive *all* binaries and to re-compile all older
>> versions of R and all packages would be an impossible task.
>> 
>> Apart from that - doing your analysis in a Virtual Machine and then
>> simply archiving this Virtual Machine, would also be an option, but only
>> for the more tech savy users.
>> 
>> In a nutshell: I think a package would be able to provide the solution
>> for a local archiving to make it possible to re-run the s

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-21 Thread Jari Oksanen

On 21/03/2014, at 10:40 AM, Rainer M Krug wrote:

> 
> 
> This is a long and (mainly) interesting discussion, which is fanning out
> in many different directions, and I think many are not that relevant to
> the OP's suggestion. 
> 
> I see the advantages of having such a dynamic CRAN, but also of having a
> more stable CRAN. I prefer CRAN as it is now, but ion many cases a more
> stable CRAN might b an advantage. So having releases of CRAN might make
> sense. But then there is the archiving issue of CRAN.
> 
> The suggestion was made to move the responsibility away from CRAN and
> the R infrastructure to the user / researcher to guarantee that the
> results can be re-run years later. It would be nice to have this build
> in CRAN, but let's stick at the scenario that the user should care for
> reproducability.

There are two different problems that alternate in the discussion: 
reproducibility and breakage of CRAN dependencies. Frozen CRAN could make 
*approximate* reproducibility easier to achieve, but real reproducibility needs 
stricter solutions. Actual sessionInfo() is minimal information, but 
re-building a spitting image of old environment may still be demanding (but in 
many cases this does not matter). 

Another problem is that CRAN is so volatile that new versions of packages break 
other packages or old scripts. Here the main problem is how package developers 
work. Freezing CRAN would not change that: if package maintainers release 
breaking code, that would be frozen. I think that most packages do not make 
distinction between development and release branches, and CRAN policy won't 
change that. 

I can sympathize with package maintainers having 150 reverse dependencies. My 
main package only has ~50, and it is sure that I won't test them all with new 
release. I sometimes tried, but I could not even get all those built because 
they had other dependencies on packages that failed. Even those that I could 
test failed to detect problems (in one case all examples were \dontrun and 
passed nicely tests). I only wish that if people *really* depend on my package, 
they test it against R-Forge version and alert me before CRAN releases, but 
that is not very likely (I guess many dependencies are not *really* necessary, 
but only concern marginal features of the package, but CRAN forces to declare 
those). 

Still a few words about reproducibility of scripts: this can be hardly achieved 
with good coverage, because many scripts are so very ad hoc. When I edit and 
review manuscripts for journals, I very often get Sweave or knitr scripts that 
"just work", where "just" means "just so and so". Often they do not work at 
all, because they had some undeclared private functionalities or stray files in 
the author workspace that did not travel with the Sweave document. I think 
these -- published scientific papers -- are the main field where the code 
really should be reproducible, but they often are the hardest to reproduce. 
Nothing CRAN people do can help with sloppy code scientists write for 
publications. You know, they are scientists -- not engineers. 

Cheers, Jari Oksanen
> 
> Leaving the issue of compilation out, a package which is creating a
> custom installation of the R version which includes the source of the R
> version used and the sources of the packages in a on Linux compilable
> format, given that the relevant dependencies are installed, would be a
> huge step forward. 
> 
> I know - compilation on Windows (and sometimes Mac) is a serious
> problem), but to archive *all* binaries and to re-compile all older
> versions of R and all packages would be an impossible task.
> 
> Apart from that - doing your analysis in a Virtual Machine and then
> simply archiving this Virtual Machine, would also be an option, but only
> for the more tech savy users.
> 
> In a nutshell: I think a package would be able to provide the solution
> for a local archiving to make it possible to re-run the simulation with
> the same tools at a later stage - although guarantees would not be
> possible.
> 
> Cheers,
> 
> Rainer
> -- 
> Rainer M. Krug
> email: Rainerkrugsde
> PGP: 0x0F52F982
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-21 Thread Philippe Grosjean

This is becoming an extremely long thread, and it is going in too many 
directions. However, I would like to mention here our ongoing five years 
projects ECOS project for the study of Open Source Ecosystems, among which, 
CRAN. You can find info here: 
http://informatique.umons.ac.be/genlog/projects/ecos/. We are in the second 
year now.

We are currently working on CRAN maintainability questions. See:

- Claes Maelick, Mens Tom, Grosjean Philippe, "On the maintainability of CRAN 
packages" in IEEE CSMR-WCRE 2014 Software Evolution Week, Antwerpen, Belgique, 
2014 (2014)

- Mens Tom, Claes Maelick, Grosjean Philippe, Serebrenik Alexander, "Studying 
Evolving Software Ecosystems based on Ecological Models" in Mens Tom, 
Serebrenik Alexander, Cleve Anthony, "Evolving Software Systems" , Springer, 
Mens Tom, Serebrenik Alexander, Cleve Anthony, 978-3-642-45397-7 (2014)

Currently, we are building an Open Source system based on Virtualbox and 
Vagrant to recreate a virtual machine under Linux (Debian and Ubuntu considered 
for the moment) that would be as close as possible as a "simulated CRAN 
environment as it was at any given date". Our plans are to replay CRAN back in 
time and to instrumentize that platform to measure what we need for our 
ecological studies of CRAN.

The connection with this thread is the possibility to reuse this system for 
proposing something useful for reproducible research, that is, a reproducible 
platform, in the definition of reproducibility vs replicability Jeroen Ooms 
mentions. It would then be enough to record the date some R code was run on 
that platform (and perhaps whether it is 32 or 64 bit system) to be able to 
rebuild a similar software environment with all corresponding CRAN packages of 
the right version easily installable. In case something specific is required in 
addition to software proposed by default, Vagrant allows provisioning the 
Virtual machine in an easy way too… but then, the provisioning script must be 
provided too (not much a problem). Info required to rebuild the platform is 
shrunk down to a few kb Ascii text file. This is something easy to put together 
with your R code in, say, additional material of a publication. 

Please, keep in mind that many platform-specific features in R (graphic 
devices, string encoding, and many more) may be a problem too for reproducing 
published results. Hence, the idea to use a virtual box using only one OS, 
Linux, no matter if you work on Windows, or Mac OS X, or… Solaris (anyone 
there?).

PhG

On 20 Mar 2014, at 21:53, Jeroen Ooms  wrote:

> On Thu, Mar 20, 2014 at 1:28 PM, Ted Byers  wrote:
>> 
>> Herve Pages mentions the risk of irreproducibility across three minor
>> revisions of version 1.0 of Matrix.  My gut reaction would be that if the
>> results are not reproducible across such minor revisions of one library,
>> they are probably just so much BS.
>> 
> 
> Perhaps this is just terminology, but what you refer to I would generally
> call 'replication'. Of course being able to replicate results with other
> data or other software is important to validate claims. But being able to
> reproduce how the original results were obtained is an important part of
> this process.
> 
> If someone is publishing results that I think are questionable and I cannot
> replicate them, I want to know exactly how those outcomes were obtained in
> the first place, so that I can 'debug' the problem. It's quite important to
> be able to trace back if incorrect results were a result of a bug,
> incompetence or fraud.
> 
> Let's take the example of the Reinhart and Rogoff case. The results
> obviously were not replicable, but without more information it was just the
> word of a grad students vs two Harvard professors. Only after reproducing
> the original analysis it was possible to point out the errors and proof
> that the original were incorrect.
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-21 Thread Rainer M Krug

Jari Oksanen  writes:

> Freezing CRAN solves no problem of reproducibility. If you know the
> sessionInfo() or the version of R, the packages used and their
> versions, you can reproduce that set up. If you do not know, then you
> cannot. You can try guess: source code of old release versions of R
> and old packages are in CRAN archive, and these files have dates. So
> you can collect a snapshot of R and packages for a given date. This is
> not an ideal solution, but it is the same level of reproducibility
> that you get with strictly frozen CRAN. CRAN is no the sole source of
> packages, and even with strictly frozen CRAN the users may have used
> packages from other source. I am sure that if CRAN would be frozen
> (but I assume it happens the same day hell freezes), people would
> increasingly often use other package sources than CRAN. The choice is
> easy if the alternatives are to wait for the next year for the bug fix
> release, or do the analysis now and use package versions in R-Forge or
> github. Then you could not assume that frozen CRAN packages were used.

Agree completely here - the solution would be a package, which is
packaging the source (or even binaries?) of your local R setup including
R and packages used. The solution is local - not on a server.

>
> CRAN policy is not made in this mailing list, and CRAN maintainers are
> so silent that it hurts ears. 

+1

> However, I hope they won't freeze CRAN.

Yes and no - if they do, we need a devel branch which acts like the
current CRAN.

>
> Strict reproduction seems to be harder than I first imagined:
> ./configure && make really failed for R 2.14.1 and older in my office
> desktop. To reproduce older analysis, I would also need to install
> older tool sets (I suspect gfortran and cairo libraries).

Absolutely - let's not go there. And then there is also the hardware
issue.

>
> CRAN is one source of R packages, and certainly its policy does not
> suit all developers. There is no policy that suits all.  Frozen CRAN
> would suit some, but certainly would deter some others.
>
> There seems to a common sentiment here that the only reason anybody
> would use R older than 3.0.3 is to reproduce old results. My
> experience form the Real Life(™) is that many of us use computers that
> we do not own, but they are the property of our employer. This may
> mean that we are not allowed to install there any software or we have
> to pay, or the Department of project has to pay, to the computer
> administration for installing new versions of software (our
> case).  

> This is often called security. Personally I avoid this by using
> Mac laptop and Linux desktop: these are not supported by the
> University computer administration and I can do what I please with
> these, but poor Windows users are stuck. 

Nicely put.

> Computer classes are also
> maintained by centralized computer administration. This January they
> had new R, but last year it was still two years old. However, users
> can install packages in their personal "folders" so that they can use
> current packages even with older R. Therefore I want to take care that
> the packages I maintain also run in older R. Therefore I also applaud
> the current CRAN policy where new versions of packages are
> "backported" to previous R release: Even if you are stuck with stale
> R, you need not be stuck with stale packages. Currently I cannot test
> with older R than 2.14.2, though, but I do that regularly and
> certainly before CRAN releases.  If somebody wants to prevent this,
> they can set their package to unnecessarily depend on the current
> version of R. I would regard this as antisocial, but nobody would ask
> what I think about this so it does not matter.
>
> The development branch of my package is in R-Forge, and only bug fixes
> and (hopefully) non-breaking enhancements (isolated so that they do
> not influence other functions, safe so that API does not change or
> format of the output does not change) are merged to the CRAN release
> branch. This policy was adopted because it fits the current CRAN
> policy, and probably would need to change if CRAN policy changes.
>
> Cheers, Jari Oksanen

-- 
Rainer M. Krug
email: Rainerkrugsde
PGP: 0x0F52F982


pgp00NlNd0VYd.pgp
Description: PGP signature
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-21 Thread Rainer M Krug



This is a long and (mainly) interesting discussion, which is fanning out
in many different directions, and I think many are not that relevant to
the OP's suggestion. 

I see the advantages of having such a dynamic CRAN, but also of having a
more stable CRAN. I prefer CRAN as it is now, but ion many cases a more
stable CRAN might b an advantage. So having releases of CRAN might make
sense. But then there is the archiving issue of CRAN.

The suggestion was made to move the responsibility away from CRAN and
the R infrastructure to the user / researcher to guarantee that the
results can be re-run years later. It would be nice to have this build
in CRAN, but let's stick at the scenario that the user should care for
reproducability.

Leaving the issue of compilation out, a package which is creating a
custom installation of the R version which includes the source of the R
version used and the sources of the packages in a on Linux compilable
format, given that the relevant dependencies are installed, would be a
huge step forward. 

I know - compilation on Windows (and sometimes Mac) is a serious
problem), but to archive *all* binaries and to re-compile all older
versions of R and all packages would be an impossible task.

Apart from that - doing your analysis in a Virtual Machine and then
simply archiving this Virtual Machine, would also be an option, but only
for the more tech savy users.

In a nutshell: I think a package would be able to provide the solution
for a local archiving to make it possible to re-run the simulation with
the same tools at a later stage - although guarantees would not be
possible.

Cheers,

Rainer
-- 
Rainer M. Krug
email: Rainerkrugsde
PGP: 0x0F52F982

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-21 Thread Jari Oksanen

Freezing CRAN solves no problem of reproducibility. If you know the 
sessionInfo() or the version of R, the packages used and their versions, you 
can reproduce that set up. If you do not know, then you cannot. You can try 
guess: source code of old release versions of R and old packages are in CRAN 
archive, and these files have dates. So you can collect a snapshot of R and 
packages for a given date. This is not an ideal solution, but it is the same 
level of reproducibility that you get with strictly frozen CRAN. CRAN is no the 
sole source of packages, and even with strictly frozen CRAN the users may have 
used packages from other source. I am sure that if CRAN would be frozen (but I 
assume it happens the same day hell freezes), people would increasingly often 
use other package sources than CRAN. The choice is easy if the alternatives are 
to wait for the next year for the bug fix release, or do the analysis now and 
use package versions in R-Forge or github. Then you could not assume that 
frozen CRAN packages were used.

CRAN policy is not made in this mailing list, and CRAN maintainers are so 
silent that it hurts ears. However, I hope they won't freeze CRAN. 

Strict reproduction seems to be harder than I first imagined: ./configure && 
make really failed for R 2.14.1 and older in my office desktop. To reproduce 
older analysis, I would also need to install older tool sets (I suspect 
gfortran and cairo libraries).

CRAN is one source of R packages, and certainly its policy does not suit all 
developers. There is no policy that suits all.  Frozen CRAN would suit some, 
but certainly would deter some others. 

There seems to a common sentiment here that the only reason anybody would use R 
older than 3.0.3 is to reproduce old results. My experience form the Real 
Life(™) is that many of us use computers that we do not own, but they are the 
property of our employer. This may mean that we are not allowed to install 
there any software or we have to pay, or the Department of project has to pay, 
to the computer administration for installing new versions of software (our 
case). This is often called security. Personally I avoid this by using Mac 
laptop and Linux desktop: these are not supported by the University computer 
administration and I can do what I please with these, but poor Windows users 
are stuck. Computer classes are also maintained by centralized computer 
administration. This January they had new R, but last year it was still two 
years old. However, users can install packages in their personal "folders" so 
that they can use current packages even with older R. Therefore I want to take 
care that the packages I maintain also run in older R. Therefore I also applaud 
the current CRAN policy where new versions of packages are "backported" to 
previous R release: Even if you are stuck with stale R, you need not be stuck 
with stale packages. Currently I cannot test with older R than 2.14.2, though, 
but I do that regularly and certainly before CRAN releases.  If somebody wants 
to prevent this, they can set their package to unnecessarily depend on the 
current version of R. I would regard this as antisocial, but nobody would ask 
what I think about this so it does not matter.

The development branch of my package is in R-Forge, and only bug fixes and 
(hopefully) non-breaking enhancements (isolated so that they do not influence 
other functions, safe so that API does not change or  format of the output does 
not change) are merged to the CRAN release branch. This policy was adopted 
because it fits the current CRAN policy, and probably would need to change if 
CRAN policy changes.

Cheers, Jari Oksanen
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Dan Tenenbaum



- Original Message -
> From: "Gábor Csárdi" 
> To: "r-devel" 
> Sent: Thursday, March 20, 2014 6:23:33 PM
> Subject: Re: [Rd] [RFC] A case for freezing CRAN
> 
> Much of the discussion was about reproducibility so far. Let me
> emphasize
> another point from Jeroen's proposal.
> 
> This is hard to measure of course, but I think I can say that the
> existence
> and the quality of CRAN and its packages contributed immensely to the
> success of R and the success of people using R. Having one central,
> well
> controlled and tested package repository is a huge advantage for the
> users.
> (I know that there are other repositories, but they are either
> similarly
> well controlled and specialized (BioC), or less used.) It would be
> great to
> keep it like this.
> 
> I also think that the current CRAN policy is not ideal for further
> growth.
> In particular, updating a package with many reverse dependencies is a
> frustrating process, for everybody. As a maintainer with ~150 reverse
> dependencies, I think not twice, but ten times if I really want to
> publish
> a new version on CRAN. I cannot speak for other maintainers of
> course, but
> I have a feeling that I am not alone.
> 
> Tying CRAN packages to R releases would help, because then I would
> not have
> to worry about breaking packages in the stable version of CRAN, only
> in
> CRAN-devel.
> 
> Somebody mentioned that it is good not to do this because then users
> get
> bug fixes and new features earlier. Well, in my case, the opposite it
> true.
> As I am not updating, they actually get it (much) later. If it wasn't
> such
> a hassle, I would definitely update more often, about once a month.
> Now my
> goal is more like once a year.
> 

These are good points. Not only do maintainers think twice (or more) before 
updating packages but it also seems that there are CRAN policies that 
discourage frequent updates. Whereas Bioconductor welcomes frequent updates 
because they usually fix problems and help us understand 
interoperability/dependency issues. Probably the main reason for this 
difference is the existence of a devel branch where breakage can happen and 
it's not the end of the world.





> Again, I cannot speak for others, but I believe the current policy
> does not
> help progress, and is not sustainable in the long run. It penalizes
> the
> maintainers of "more important" (= many rev. dependencies, that is,
> which
> probably also means many users) packages, and I fear they will slowly
> move
> away from CRAN. I don't think this is what anybody in the R community
> would
> want.
> 
> Best,
> Gabor
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Tim Triche, Jr.

Except that tests (as vignettes) are mandatory for BioC. So if something blows 
up you hear about it right quick :-)

--t

> On Mar 20, 2014, at 7:15 PM, Gábor Csárdi  wrote:
> 
> On Thu, Mar 20, 2014 at 9:45 PM, William Dunlap  wrote:
> 
>>> In particular, updating a package with many reverse dependencies is a
>>> frustrating process, for everybody. As a maintainer with ~150 reverse
>>> dependencies, I think not twice, but ten times if I really want to
>> publish
>>> a new version on CRAN.
>> 
>> It might be easier if more of those packages came with good test suites.
> 
> Test suites are great, but I don't think this would make my job easier.
> More tests means more potential breakage. The extreme of not having any
> examples and tests in these 150 packages would be the easiest for _me_,
> actually. Not for the users, though.
> 
> What would really help is either fully versioned package dependencies
> (daydreaming here), or having a CRAN-devel repository, that changes and
> might break often, and a CRAN-stable that does not change (much).
> 
> Gabor
> 
> [...]
> 
>[[alternative HTML version deleted]]
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Tim Triche, Jr.

Heh, you just described BioC

--t

> On Mar 20, 2014, at 7:15 PM, Gábor Csárdi  wrote:
> 
> On Thu, Mar 20, 2014 at 9:45 PM, William Dunlap  wrote:
> 
>>> In particular, updating a package with many reverse dependencies is a
>>> frustrating process, for everybody. As a maintainer with ~150 reverse
>>> dependencies, I think not twice, but ten times if I really want to
>> publish
>>> a new version on CRAN.
>> 
>> It might be easier if more of those packages came with good test suites.
> 
> Test suites are great, but I don't think this would make my job easier.
> More tests means more potential breakage. The extreme of not having any
> examples and tests in these 150 packages would be the easiest for _me_,
> actually. Not for the users, though.
> 
> What would really help is either fully versioned package dependencies
> (daydreaming here), or having a CRAN-devel repository, that changes and
> might break often, and a CRAN-stable that does not change (much).
> 
> Gabor
> 
> [...]
> 
>[[alternative HTML version deleted]]
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Gábor Csárdi

On Thu, Mar 20, 2014 at 9:45 PM, William Dunlap  wrote:

> > In particular, updating a package with many reverse dependencies is a
> > frustrating process, for everybody. As a maintainer with ~150 reverse
> > dependencies, I think not twice, but ten times if I really want to
> publish
> > a new version on CRAN.
>
> It might be easier if more of those packages came with good test suites.
>

Test suites are great, but I don't think this would make my job easier.
More tests means more potential breakage. The extreme of not having any
examples and tests in these 150 packages would be the easiest for _me_,
actually. Not for the users, though.

What would really help is either fully versioned package dependencies
(daydreaming here), or having a CRAN-devel repository, that changes and
might break often, and a CRAN-stable that does not change (much).

Gabor

[...]

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread William Dunlap

> In particular, updating a package with many reverse dependencies is a
> frustrating process, for everybody. As a maintainer with ~150 reverse
> dependencies, I think not twice, but ten times if I really want to publish
> a new version on CRAN.

It might be easier if more of those packages came with good test suites.

Bill Dunlap
TIBCO Software
wdunlap tibco.com


> -Original Message-
> From: r-devel-boun...@r-project.org [mailto:r-devel-boun...@r-project.org] On 
> Behalf
> Of Gábor Csárdi
> Sent: Thursday, March 20, 2014 6:24 PM
> To: r-devel
> Subject: Re: [Rd] [RFC] A case for freezing CRAN
> 
> Much of the discussion was about reproducibility so far. Let me emphasize
> another point from Jeroen's proposal.
> 
> This is hard to measure of course, but I think I can say that the existence
> and the quality of CRAN and its packages contributed immensely to the
> success of R and the success of people using R. Having one central, well
> controlled and tested package repository is a huge advantage for the users.
> (I know that there are other repositories, but they are either similarly
> well controlled and specialized (BioC), or less used.) It would be great to
> keep it like this.
> 
> I also think that the current CRAN policy is not ideal for further growth.
> In particular, updating a package with many reverse dependencies is a
> frustrating process, for everybody. As a maintainer with ~150 reverse
> dependencies, I think not twice, but ten times if I really want to publish
> a new version on CRAN. I cannot speak for other maintainers of course, but
> I have a feeling that I am not alone.
> 
> Tying CRAN packages to R releases would help, because then I would not have
> to worry about breaking packages in the stable version of CRAN, only in
> CRAN-devel.
> 
> Somebody mentioned that it is good not to do this because then users get
> bug fixes and new features earlier. Well, in my case, the opposite it true.
> As I am not updating, they actually get it (much) later. If it wasn't such
> a hassle, I would definitely update more often, about once a month. Now my
> goal is more like once a year.
> 
> Again, I cannot speak for others, but I believe the current policy does not
> help progress, and is not sustainable in the long run. It penalizes the
> maintainers of "more important" (= many rev. dependencies, that is, which
> probably also means many users) packages, and I fear they will slowly move
> away from CRAN. I don't think this is what anybody in the R community would
> want.
> 
> Best,
> Gabor
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Gábor Csárdi

Much of the discussion was about reproducibility so far. Let me emphasize
another point from Jeroen's proposal.

This is hard to measure of course, but I think I can say that the existence
and the quality of CRAN and its packages contributed immensely to the
success of R and the success of people using R. Having one central, well
controlled and tested package repository is a huge advantage for the users.
(I know that there are other repositories, but they are either similarly
well controlled and specialized (BioC), or less used.) It would be great to
keep it like this.

I also think that the current CRAN policy is not ideal for further growth.
In particular, updating a package with many reverse dependencies is a
frustrating process, for everybody. As a maintainer with ~150 reverse
dependencies, I think not twice, but ten times if I really want to publish
a new version on CRAN. I cannot speak for other maintainers of course, but
I have a feeling that I am not alone.

Tying CRAN packages to R releases would help, because then I would not have
to worry about breaking packages in the stable version of CRAN, only in
CRAN-devel.

Somebody mentioned that it is good not to do this because then users get
bug fixes and new features earlier. Well, in my case, the opposite it true.
As I am not updating, they actually get it (much) later. If it wasn't such
a hassle, I would definitely update more often, about once a month. Now my
goal is more like once a year.

Again, I cannot speak for others, but I believe the current policy does not
help progress, and is not sustainable in the long run. It penalizes the
maintainers of "more important" (= many rev. dependencies, that is, which
probably also means many users) packages, and I fear they will slowly move
away from CRAN. I don't think this is what anybody in the R community would
want.

Best,
Gabor

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Hervé Pagès


On 03/20/2014 03:29 PM, Uwe Ligges wrote:



On 20.03.2014 23:23, Hervé Pagès wrote:



On 03/20/2014 01:28 PM, Ted Byers wrote:

On Thu, Mar 20, 2014 at 3:14 PM, Hervé Pagès mailto:hpa...@fhcrc.org>> wrote:

On 03/20/2014 03:52 AM, Duncan Murdoch wrote:

On 14-03-20 2:15 AM, Dan Tenenbaum wrote:



- Original Message -

From: "David Winsemius" mailto:dwinsem...@comcast.net>>
To: "Jeroen Ooms" mailto:jeroen.o...@stat.ucla.edu>>
Cc: "r-devel" mailto:r-devel@r-project.org>>
Sent: Wednesday, March 19, 2014 11:03:32 PM
    Subject: Re: [Rd] [RFC] A case for freezing CRAN


On Mar 19, 2014, at 7:45 PM, Jeroen Ooms wrote:

On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
mailto:michael.weyla...@gmail.com>> wrote:

Reading this thread again, is it a fair summary
of your position
to say "reproducibility by default is more
important than giving
users access to the newest bug fixes and
features by default?"
It's certainly arguable, but I'm not sure I'm
convinced: I'd
imagine that the ratio of new work being done vs
reproductions is
rather high and the current setup optimizes for
that already.


I think that separating development from released
branches can give
us
both reliability/reproducibility (stable branch) as
well as new
features (unstable branch). The user gets to pick
(and you can pick
both!). The same is true for r-base: when using a
'released'
version
you get 'stable' base packages that are up to 12
months old. If you
want to have the latest stuff you download a nightly
build of
r-devel.
For regular users and reproducible research it is
recommended to
use
the stable branch. However if you are a developer
(e.g. package
author) you might want to develop/test/check your
work with the
latest
r-devel.

I think that extending the R release cycle to CRAN
would result
both
in more stable released versions of R, as well as
more freedom for
package authors to implement rigorous change in the
unstable
branch.
When writing a script that is part of a production
pipeline, or
sweave
paper that should be reproducible 10 years from now,
or a book on
using R, you use stable version of R, which is
guaranteed to behave
the same over time. However when developing packages
that should be
compatible with the upcoming release of R, you use
r-devel which
has
the latest versions of other CRAN and base packages.



As I remember ... The example demonstrating the need for
this was an
XML package that cause an extract from a website where
the headers
were misinterpreted as data in one version of pkg:XML
and not in
another. That seems fairly unconvincing. Data cleaning
and
validation is a basic task of data analysis. It also
seems excessive
to assert that it is the responsibility of CRAN to
maintain a synced
binary archive that will be available in ten years.



CRAN already does this, the bin/windows/contrib directory
has
subdirectories going back to 1.7, with packages dated
October 2004. I
don't see why it is burdensome to continue to archive these.
It would
be nice if source versions had a similar archive.


The bin/windows/contrib directories are updated every day for
active R
versions.  It's only when Uwe decides that a version is no
longer worth
active support that he stops doing upda

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Uwe Ligges




On 20.03.2014 23:23, Hervé Pagès wrote:



On 03/20/2014 01:28 PM, Ted Byers wrote:

On Thu, Mar 20, 2014 at 3:14 PM, Hervé Pagès mailto:hpa...@fhcrc.org>> wrote:

On 03/20/2014 03:52 AM, Duncan Murdoch wrote:

On 14-03-20 2:15 AM, Dan Tenenbaum wrote:



- Original Message -

From: "David Winsemius" mailto:dwinsem...@comcast.net>>
To: "Jeroen Ooms" mailto:jeroen.o...@stat.ucla.edu>>
Cc: "r-devel" mailto:r-devel@r-project.org>>
Sent: Wednesday, March 19, 2014 11:03:32 PM
    Subject: Re: [Rd] [RFC] A case for freezing CRAN


On Mar 19, 2014, at 7:45 PM, Jeroen Ooms wrote:

On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
mailto:michael.weyla...@gmail.com>> wrote:

Reading this thread again, is it a fair summary
of your position
to say "reproducibility by default is more
important than giving
users access to the newest bug fixes and
features by default?"
It's certainly arguable, but I'm not sure I'm
convinced: I'd
imagine that the ratio of new work being done vs
reproductions is
rather high and the current setup optimizes for
that already.


I think that separating development from released
branches can give
us
both reliability/reproducibility (stable branch) as
well as new
features (unstable branch). The user gets to pick
(and you can pick
both!). The same is true for r-base: when using a
'released'
version
you get 'stable' base packages that are up to 12
months old. If you
want to have the latest stuff you download a nightly
build of
r-devel.
For regular users and reproducible research it is
recommended to
use
the stable branch. However if you are a developer
(e.g. package
author) you might want to develop/test/check your
work with the
latest
r-devel.

I think that extending the R release cycle to CRAN
would result
both
in more stable released versions of R, as well as
more freedom for
package authors to implement rigorous change in the
unstable
branch.
When writing a script that is part of a production
pipeline, or
sweave
paper that should be reproducible 10 years from now,
or a book on
using R, you use stable version of R, which is
guaranteed to behave
the same over time. However when developing packages
that should be
compatible with the upcoming release of R, you use
r-devel which
has
the latest versions of other CRAN and base packages.



As I remember ... The example demonstrating the need for
this was an
XML package that cause an extract from a website where
the headers
were misinterpreted as data in one version of pkg:XML
and not in
another. That seems fairly unconvincing. Data cleaning
and
validation is a basic task of data analysis. It also
seems excessive
to assert that it is the responsibility of CRAN to
maintain a synced
binary archive that will be available in ten years.



CRAN already does this, the bin/windows/contrib directory has
subdirectories going back to 1.7, with packages dated
October 2004. I
don't see why it is burdensome to continue to archive these.
It would
be nice if source versions had a similar archive.


The bin/windows/contrib directories are updated every day for
active R
versions.  It's only when Uwe decides that a version is no
longer worth
active support that he stops doing updates, and it "freezes".  A

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Hervé Pagès




On 03/20/2014 01:28 PM, Ted Byers wrote:

On Thu, Mar 20, 2014 at 3:14 PM, Hervé Pagès mailto:hpa...@fhcrc.org>> wrote:

On 03/20/2014 03:52 AM, Duncan Murdoch wrote:

On 14-03-20 2:15 AM, Dan Tenenbaum wrote:



- Original Message -

From: "David Winsemius" mailto:dwinsem...@comcast.net>>
To: "Jeroen Ooms" mailto:jeroen.o...@stat.ucla.edu>>
Cc: "r-devel" mailto:r-devel@r-project.org>>
Sent: Wednesday, March 19, 2014 11:03:32 PM
    Subject: Re: [Rd] [RFC] A case for freezing CRAN


On Mar 19, 2014, at 7:45 PM, Jeroen Ooms wrote:

On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
mailto:michael.weyla...@gmail.com>> wrote:

Reading this thread again, is it a fair summary
of your position
to say "reproducibility by default is more
important than giving
users access to the newest bug fixes and
features by default?"
It's certainly arguable, but I'm not sure I'm
convinced: I'd
imagine that the ratio of new work being done vs
reproductions is
rather high and the current setup optimizes for
that already.


I think that separating development from released
branches can give
us
both reliability/reproducibility (stable branch) as
well as new
features (unstable branch). The user gets to pick
(and you can pick
both!). The same is true for r-base: when using a
'released'
version
you get 'stable' base packages that are up to 12
months old. If you
want to have the latest stuff you download a nightly
build of
r-devel.
For regular users and reproducible research it is
recommended to
use
the stable branch. However if you are a developer
(e.g. package
author) you might want to develop/test/check your
work with the
latest
r-devel.

I think that extending the R release cycle to CRAN
would result
both
in more stable released versions of R, as well as
more freedom for
package authors to implement rigorous change in the
unstable
branch.
When writing a script that is part of a production
pipeline, or
sweave
paper that should be reproducible 10 years from now,
or a book on
using R, you use stable version of R, which is
guaranteed to behave
the same over time. However when developing packages
that should be
compatible with the upcoming release of R, you use
r-devel which
has
the latest versions of other CRAN and base packages.



As I remember ... The example demonstrating the need for
this was an
XML package that cause an extract from a website where
the headers
were misinterpreted as data in one version of pkg:XML
and not in
another. That seems fairly unconvincing. Data cleaning and
validation is a basic task of data analysis. It also
seems excessive
to assert that it is the responsibility of CRAN to
maintain a synced
binary archive that will be available in ten years.



CRAN already does this, the bin/windows/contrib directory has
subdirectories going back to 1.7, with packages dated
October 2004. I
don't see why it is burdensome to continue to archive these.
It would
be nice if source versions had a similar archive.


The bin/windows/contrib directories are updated every day for
active R
versions.  It's only when Uwe decides that a version is no
longer worth
active support that he stops doing updates, and it "freezes".  A
consequence of this is that the sna

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Ted Byers

On Thu, Mar 20, 2014 at 5:27 PM, Tim Triche, Jr. wrote:

> > There is nothing like backups with due attention to detail.
>
> Agreed, although given the complexity of dependencies among packages, this
> might entail several GB of snapshots per paper (if not several TB for some
> papers) in various cases.  Anyone who is reasonably prolific then gets the
> exciting prospect of managing these backups.
>
> Isn't that what support staff is for?  ;-)  But, storage space is cheap,
and as tedious as managing backups can be (definitely not fun), it is
managable.


> At least if I grind out a vignette with a bunch of Bioconductor packages
> and call sessionInfo() at the end, I can find out later on (if, say, things
> stop working) what was the state of the tree when it last worked, and what
> might have changed since then.  If a self-contained C++ or FORTRAN program
> is sufficient to perform an entire analysis, that's awesome, and it ought
> to be stuffed into revision control (doesn't everyone already do this?).
>  But once you start using tools that depend on other tools, it becomes
> substantially more difficult to ensure that
>
> 1) a comprehensive snapshot is taken
> 2) reviewers, possibly on different platforms and/or major versions, can
> run using that snapshot
> 3) some means of a quick sanity check ("does this analysis even return
> sensible results?") can be run
>
> Hopefully this is better articulated than my previous missive.
>
> Tell me about it.  Oh, wait, you already did.  ;-)

I understand this, as I routinely work with complex distributed systems
involving multiple programming languages and other diverse tools.  But such
is part of the overhead of doing quality work.


> I believe we fundamentally agree; some of the particulars may be an issue
> of notation or typical workflow.
>
>
> I agree that we fundamentally agree  ;-)

>From my experience, the issues addressed in this thread are probably best
handled by in the package developers and those authors that use their
packages, rather than imposing additional work on those responsible for
CRAN, especially when the means for doing things a little differently than
how CRAN does it are readily available.

Cheers

Ted
R.E.(Ted) Byers, Ph.D.,Ed.D.


>
> Statistics is the grammar of science.
> Karl Pearson 
>
>
> On Thu, Mar 20, 2014 at 2:13 PM, Ted Byers  wrote:
>
>> On Thu, Mar 20, 2014 at 4:53 PM, Jeroen Ooms > >wrote:
>>
>> > On Thu, Mar 20, 2014 at 1:28 PM, Ted Byers 
>> wrote:
>> >>
>> >> Herve Pages mentions the risk of irreproducibility across three minor
>> >> revisions of version 1.0 of Matrix.  My gut reaction would be that if
>> the
>> >> results are not reproducible across such minor revisions of one
>> library,
>> >> they are probably just so much BS.
>> >>
>> >
>> > Perhaps this is just terminology, but what you refer to I would
>> generally
>> > call 'replication'. Of course being able to replicate results with other
>> > data or other software is important to validate claims. But being able
>> to
>> > reproduce how the original results were obtained is an important part of
>> > this process.
>> >
>> > Fair enough.
>>
>>
>> > If someone is publishing results that I think are questionable and I
>> > cannot replicate them, I want to know exactly how those outcomes were
>> > obtained in the first place, so that I can 'debug' the problem. It's
>> quite
>> > important to be able to trace back if incorrect results were a result
>> of a
>> > bug, incompetence or fraud.
>> >
>> > OK.  That is where archives come in.  When I had to deal with that sort
>> of
>> thing, I provided copies of both data and code to whoever asked.  It ought
>> not be hard for authors to make an archive, to e.g. an optical disk, that
>> includes the software used along with the data, and store it like any
>> other
>> backup, so it can be provided to anyone upon request.
>>
>>
>> > Let's take the example of the Reinhart and Rogoff case. The results
>> > obviously were not replicable, but without more information it was just
>> the
>> > word of a grad students vs two Harvard professors. Only after
>> reproducing
>> > the original analysis it was possible to point out the errors and proof
>> > that the original were incorrect.
>> >
>> >
>> >
>> >
>> > Ok, but, if the practice I used were used, then a copy of the optical
>> disk
>> to which everything relevant was stored would solve that problem (and it
>> would be extremely easy for the researcher or his/her supervisor to do).
>>  I
>> once had a reviewer complain he couldn't reproduce my results, so I sent
>> him my code, which, translated into any of the Algol family of languages,
>> would allow  him, or anyone else, to replicate my results regardless of
>> their programming language of choice.  Once he had my code, he found his
>> error and reported back that he had finally replicated my results.
>>  Several
>> of my colleagues used the same practice, with the same conse

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Tim Triche, Jr.

> There is nothing like backups with due attention to detail.

Agreed, although given the complexity of dependencies among packages, this
might entail several GB of snapshots per paper (if not several TB for some
papers) in various cases.  Anyone who is reasonably prolific then gets the
exciting prospect of managing these backups.

At least if I grind out a vignette with a bunch of Bioconductor packages
and call sessionInfo() at the end, I can find out later on (if, say, things
stop working) what was the state of the tree when it last worked, and what
might have changed since then.  If a self-contained C++ or FORTRAN program
is sufficient to perform an entire analysis, that's awesome, and it ought
to be stuffed into revision control (doesn't everyone already do this?).
 But once you start using tools that depend on other tools, it becomes
substantially more difficult to ensure that

1) a comprehensive snapshot is taken
2) reviewers, possibly on different platforms and/or major versions, can
run using that snapshot
3) some means of a quick sanity check ("does this analysis even return
sensible results?") can be run

Hopefully this is better articulated than my previous missive.

I believe we fundamentally agree; some of the particulars may be an issue
of notation or typical workflow.



Statistics is the grammar of science.
Karl Pearson 


On Thu, Mar 20, 2014 at 2:13 PM, Ted Byers  wrote:

> On Thu, Mar 20, 2014 at 4:53 PM, Jeroen Ooms  >wrote:
>
> > On Thu, Mar 20, 2014 at 1:28 PM, Ted Byers 
> wrote:
> >>
> >> Herve Pages mentions the risk of irreproducibility across three minor
> >> revisions of version 1.0 of Matrix.  My gut reaction would be that if
> the
> >> results are not reproducible across such minor revisions of one library,
> >> they are probably just so much BS.
> >>
> >
> > Perhaps this is just terminology, but what you refer to I would generally
> > call 'replication'. Of course being able to replicate results with other
> > data or other software is important to validate claims. But being able to
> > reproduce how the original results were obtained is an important part of
> > this process.
> >
> > Fair enough.
>
>
> > If someone is publishing results that I think are questionable and I
> > cannot replicate them, I want to know exactly how those outcomes were
> > obtained in the first place, so that I can 'debug' the problem. It's
> quite
> > important to be able to trace back if incorrect results were a result of
> a
> > bug, incompetence or fraud.
> >
> > OK.  That is where archives come in.  When I had to deal with that sort
> of
> thing, I provided copies of both data and code to whoever asked.  It ought
> not be hard for authors to make an archive, to e.g. an optical disk, that
> includes the software used along with the data, and store it like any other
> backup, so it can be provided to anyone upon request.
>
>
> > Let's take the example of the Reinhart and Rogoff case. The results
> > obviously were not replicable, but without more information it was just
> the
> > word of a grad students vs two Harvard professors. Only after reproducing
> > the original analysis it was possible to point out the errors and proof
> > that the original were incorrect.
> >
> >
> >
> >
> > Ok, but, if the practice I used were used, then a copy of the optical
> disk
> to which everything relevant was stored would solve that problem (and it
> would be extremely easy for the researcher or his/her supervisor to do).  I
> once had a reviewer complain he couldn't reproduce my results, so I sent
> him my code, which, translated into any of the Algol family of languages,
> would allow  him, or anyone else, to replicate my results regardless of
> their programming language of choice.  Once he had my code, he found his
> error and reported back that he had finally replicated my results.  Several
> of my colleagues used the same practice, with the same consequences
> (whenever questioned, they just provide their code, and related software,
> and then their results were reproduced).  There is nothing like backups
> with due attention to detail.
>
> Cheers
>
> Ted
>
> --
> R.E.(Ted) Byers, Ph.D.,Ed.D.
>
> [[alternative HTML version deleted]]
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Ted Byers

On Thu, Mar 20, 2014 at 5:11 PM, Tim Triche, Jr. wrote:

> That doesn't make sense.
>
> If an API changes (e.g. in Matrix) and a program written against the old
> API can no longer run, that is a very different issue than if the same
> numbers (data) give different results.  The latter is what I am guessing
> you address.  The former is what I believe most people are concerned about
> here.  Or at least I hope that's so.
>
> The problem you describe is the classic case of a failure of backward
compatibility.  That is completely different from the question of
reproducibility or replicability.  And, since I, among others, noticed the
question of reproducibility had arisen, I felt a need to primarily address
that.

I do not have a quibble with anything else you wrote (or with anything in
this thread related to the issue of backward compatibility), and I have
enough experience to know both that it is a hard problem and that there are
a number of different solutions people have used.  Appropriate management
of deprecation of features is one, and the use of code freezes is another.
Version control is a third.  Each option carries its own advantages and
disadvantages.


> It's more an issue of usability than reproducibility in such a case, far
> as I can tell (see e.g.
> http://liorpachter.wordpress.com/2014/03/18/reproducibility-vs-usability/).  
> If the same data produces substantially different results (not
> attributable to e.g. better handling of machine precision and so forth,
> although that could certainly be a bugaboo in many cases... anyone who has
> programmed numerical routines in FORTRAN already knows this) then yes,
> that's a different type of bug.  But in order to uncover the latter type of
> bug, the code has to run in the first place.  After a while it becomes
> rather impenetrable if no thought is given to these changes.
>
> So the Bioconductor solution, as Herve noted, is to have freezes and
> releases.  There can be old bugs enshrined in people's code due to using
> old versions, and those can be traced even after many releases have come
> and gone, because there is a point-in-time snapshot of about when these
> things occurred.  As with (say) ANSI C++, deprecation notices stay in place
> for a year before anything is actually done to remove a function or break
> an API.  It's not impossible, it just requires more discipline than
> declaring that the same program should be written multiple times on
> multiple platforms every time.  The latter isn't an efficient use of
> anyone's time.
>
> Most of these analyses are not about putting a man on the moon or making
> sure a dam does not break.  They're relatively low-consequence exploratory
> sorties.  If something comes of them, it would be nice to have a
> point-in-time reference to check and see whether the original results were
> hooey.  That's a lot quicker and more efficient than rewriting everything
> from scratch (which, in some fields, simply ensures things won't get
> checked).
>
> My $0.02, since we do still have those to bedevil cashiers.
>
>
>
> Statistics is the grammar of science.
> Karl Pearson 
>
>
> Cheers

Ted

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Ted Byers

On Thu, Mar 20, 2014 at 4:53 PM, Jeroen Ooms wrote:

> On Thu, Mar 20, 2014 at 1:28 PM, Ted Byers  wrote:
>>
>> Herve Pages mentions the risk of irreproducibility across three minor
>> revisions of version 1.0 of Matrix.  My gut reaction would be that if the
>> results are not reproducible across such minor revisions of one library,
>> they are probably just so much BS.
>>
>
> Perhaps this is just terminology, but what you refer to I would generally
> call 'replication'. Of course being able to replicate results with other
> data or other software is important to validate claims. But being able to
> reproduce how the original results were obtained is an important part of
> this process.
>
> Fair enough.


> If someone is publishing results that I think are questionable and I
> cannot replicate them, I want to know exactly how those outcomes were
> obtained in the first place, so that I can 'debug' the problem. It's quite
> important to be able to trace back if incorrect results were a result of a
> bug, incompetence or fraud.
>
> OK.  That is where archives come in.  When I had to deal with that sort of
thing, I provided copies of both data and code to whoever asked.  It ought
not be hard for authors to make an archive, to e.g. an optical disk, that
includes the software used along with the data, and store it like any other
backup, so it can be provided to anyone upon request.


> Let's take the example of the Reinhart and Rogoff case. The results
> obviously were not replicable, but without more information it was just the
> word of a grad students vs two Harvard professors. Only after reproducing
> the original analysis it was possible to point out the errors and proof
> that the original were incorrect.
>
>
>
>
> Ok, but, if the practice I used were used, then a copy of the optical disk
to which everything relevant was stored would solve that problem (and it
would be extremely easy for the researcher or his/her supervisor to do).  I
once had a reviewer complain he couldn't reproduce my results, so I sent
him my code, which, translated into any of the Algol family of languages,
would allow  him, or anyone else, to replicate my results regardless of
their programming language of choice.  Once he had my code, he found his
error and reported back that he had finally replicated my results.  Several
of my colleagues used the same practice, with the same consequences
(whenever questioned, they just provide their code, and related software,
and then their results were reproduced).  There is nothing like backups
with due attention to detail.

Cheers

Ted

-- 
R.E.(Ted) Byers, Ph.D.,Ed.D.

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Tim Triche, Jr.

That doesn't make sense.

If an API changes (e.g. in Matrix) and a program written against the old
API can no longer run, that is a very different issue than if the same
numbers (data) give different results.  The latter is what I am guessing
you address.  The former is what I believe most people are concerned about
here.  Or at least I hope that's so.

It's more an issue of usability than reproducibility in such a case, far as
I can tell (see e.g.
http://liorpachter.wordpress.com/2014/03/18/reproducibility-vs-usability/).
 If the same data produces substantially different results (not
attributable to e.g. better handling of machine precision and so forth,
although that could certainly be a bugaboo in many cases... anyone who has
programmed numerical routines in FORTRAN already knows this) then yes,
that's a different type of bug.  But in order to uncover the latter type of
bug, the code has to run in the first place.  After a while it becomes
rather impenetrable if no thought is given to these changes.

So the Bioconductor solution, as Herve noted, is to have freezes and
releases.  There can be old bugs enshrined in people's code due to using
old versions, and those can be traced even after many releases have come
and gone, because there is a point-in-time snapshot of about when these
things occurred.  As with (say) ANSI C++, deprecation notices stay in place
for a year before anything is actually done to remove a function or break
an API.  It's not impossible, it just requires more discipline than
declaring that the same program should be written multiple times on
multiple platforms every time.  The latter isn't an efficient use of
anyone's time.

Most of these analyses are not about putting a man on the moon or making
sure a dam does not break.  They're relatively low-consequence exploratory
sorties.  If something comes of them, it would be nice to have a
point-in-time reference to check and see whether the original results were
hooey.  That's a lot quicker and more efficient than rewriting everything
from scratch (which, in some fields, simply ensures things won't get
checked).

My $0.02, since we do still have those to bedevil cashiers.

Statistics is the grammar of science.
Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>

On Thu, Mar 20, 2014 at 1:28 PM, Ted Byers  wrote:

> On Thu, Mar 20, 2014 at 3:14 PM, Hervé Pagès  wrote:
>
> > On 03/20/2014 03:52 AM, Duncan Murdoch wrote:
> >
> >> On 14-03-20 2:15 AM, Dan Tenenbaum wrote:
> >>
> >>>
> >>>
> >>> - Original Message -
> >>>
> >>>> From: "David Winsemius" 
> >>>> To: "Jeroen Ooms" 
> >>>> Cc: "r-devel" 
> >>>> Sent: Wednesday, March 19, 2014 11:03:32 PM
> >>>> Subject: Re: [Rd] [RFC] A case for freezing CRAN
> >>>>
> >>>>
> >>>> On Mar 19, 2014, at 7:45 PM, Jeroen Ooms wrote:
> >>>>
> >>>>  On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
> >>>>>  wrote:
> >>>>>
> >>>>>> Reading this thread again, is it a fair summary of your position
> >>>>>> to say "reproducibility by default is more important than giving
> >>>>>> users access to the newest bug fixes and features by default?"
> >>>>>> It's certainly arguable, but I'm not sure I'm convinced: I'd
> >>>>>> imagine that the ratio of new work being done vs reproductions is
> >>>>>> rather high and the current setup optimizes for that already.
> >>>>>>
> >>>>>
> >>>>> I think that separating development from released branches can give
> >>>>> us
> >>>>> both reliability/reproducibility (stable branch) as well as new
> >>>>> features (unstable branch). The user gets to pick (and you can pick
> >>>>> both!). The same is true for r-base: when using a 'released'
> >>>>> version
> >>>>> you get 'stable' base packages that are up to 12 months old. If you
> >>>>> want to have the latest stuff you download a nightly build of
> >>>>> r-devel.
> >>>>> For regular users and reproducible research it is recommended to
> >>>>> use
> >>>>> the stable branch. However if you are a developer (e.g. package
> >>>>> author) you might want to develop/test/check your work with the
> >>>>> latest
> >>>>> r-devel.
> >>>>>
> >>>>&g

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Jeroen Ooms

On Thu, Mar 20, 2014 at 1:28 PM, Ted Byers  wrote:
>
> Herve Pages mentions the risk of irreproducibility across three minor
> revisions of version 1.0 of Matrix.  My gut reaction would be that if the
> results are not reproducible across such minor revisions of one library,
> they are probably just so much BS.
>

Perhaps this is just terminology, but what you refer to I would generally
call 'replication'. Of course being able to replicate results with other
data or other software is important to validate claims. But being able to
reproduce how the original results were obtained is an important part of
this process.

If someone is publishing results that I think are questionable and I cannot
replicate them, I want to know exactly how those outcomes were obtained in
the first place, so that I can 'debug' the problem. It's quite important to
be able to trace back if incorrect results were a result of a bug,
incompetence or fraud.

Let's take the example of the Reinhart and Rogoff case. The results
obviously were not replicable, but without more information it was just the
word of a grad students vs two Harvard professors. Only after reproducing
the original analysis it was possible to point out the errors and proof
that the original were incorrect.

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Ted Byers

On Thu, Mar 20, 2014 at 3:14 PM, Hervé Pagès  wrote:

> On 03/20/2014 03:52 AM, Duncan Murdoch wrote:
>
>> On 14-03-20 2:15 AM, Dan Tenenbaum wrote:
>>
>>>
>>>
>>> - Original Message -
>>>
>>>> From: "David Winsemius" 
>>>> To: "Jeroen Ooms" 
>>>> Cc: "r-devel" 
>>>> Sent: Wednesday, March 19, 2014 11:03:32 PM
>>>> Subject: Re: [Rd] [RFC] A case for freezing CRAN
>>>>
>>>>
>>>> On Mar 19, 2014, at 7:45 PM, Jeroen Ooms wrote:
>>>>
>>>>  On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
>>>>>  wrote:
>>>>>
>>>>>> Reading this thread again, is it a fair summary of your position
>>>>>> to say "reproducibility by default is more important than giving
>>>>>> users access to the newest bug fixes and features by default?"
>>>>>> It's certainly arguable, but I'm not sure I'm convinced: I'd
>>>>>> imagine that the ratio of new work being done vs reproductions is
>>>>>> rather high and the current setup optimizes for that already.
>>>>>>
>>>>>
>>>>> I think that separating development from released branches can give
>>>>> us
>>>>> both reliability/reproducibility (stable branch) as well as new
>>>>> features (unstable branch). The user gets to pick (and you can pick
>>>>> both!). The same is true for r-base: when using a 'released'
>>>>> version
>>>>> you get 'stable' base packages that are up to 12 months old. If you
>>>>> want to have the latest stuff you download a nightly build of
>>>>> r-devel.
>>>>> For regular users and reproducible research it is recommended to
>>>>> use
>>>>> the stable branch. However if you are a developer (e.g. package
>>>>> author) you might want to develop/test/check your work with the
>>>>> latest
>>>>> r-devel.
>>>>>
>>>>> I think that extending the R release cycle to CRAN would result
>>>>> both
>>>>> in more stable released versions of R, as well as more freedom for
>>>>> package authors to implement rigorous change in the unstable
>>>>> branch.
>>>>> When writing a script that is part of a production pipeline, or
>>>>> sweave
>>>>> paper that should be reproducible 10 years from now, or a book on
>>>>> using R, you use stable version of R, which is guaranteed to behave
>>>>> the same over time. However when developing packages that should be
>>>>> compatible with the upcoming release of R, you use r-devel which
>>>>> has
>>>>> the latest versions of other CRAN and base packages.
>>>>>
>>>>
>>>>
>>>> As I remember ... The example demonstrating the need for this was an
>>>> XML package that cause an extract from a website where the headers
>>>> were misinterpreted as data in one version of pkg:XML and not in
>>>> another. That seems fairly unconvincing. Data cleaning and
>>>> validation is a basic task of data analysis. It also seems excessive
>>>> to assert that it is the responsibility of CRAN to maintain a synced
>>>> binary archive that will be available in ten years.
>>>>
>>>
>>>
>>> CRAN already does this, the bin/windows/contrib directory has
>>> subdirectories going back to 1.7, with packages dated October 2004. I
>>> don't see why it is burdensome to continue to archive these. It would
>>> be nice if source versions had a similar archive.
>>>
>>
>> The bin/windows/contrib directories are updated every day for active R
>> versions.  It's only when Uwe decides that a version is no longer worth
>> active support that he stops doing updates, and it "freezes".  A
>> consequence of this is that the snapshots preserved in those older
>> directories are unlikely to match what someone who keeps up to date with
>> R releases is using.  Their purpose is to make sure that those older
>> versions aren't completely useless, but they aren't what Jeroen was
>> asking for.
>>
>
> But it is almost completely useless from a reproducibility point of
> view to get random package versions. For example if some people try
> to use R-2.13.2 today to re

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Hervé Pagès


On 03/20/2014 03:52 AM, Duncan Murdoch wrote:

On 14-03-20 2:15 AM, Dan Tenenbaum wrote:



- Original Message -

From: "David Winsemius" 
To: "Jeroen Ooms" 
Cc: "r-devel" 
Sent: Wednesday, March 19, 2014 11:03:32 PM
Subject: Re: [Rd] [RFC] A case for freezing CRAN


On Mar 19, 2014, at 7:45 PM, Jeroen Ooms wrote:


On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
 wrote:

Reading this thread again, is it a fair summary of your position
to say "reproducibility by default is more important than giving
users access to the newest bug fixes and features by default?"
It's certainly arguable, but I'm not sure I'm convinced: I'd
imagine that the ratio of new work being done vs reproductions is
rather high and the current setup optimizes for that already.


I think that separating development from released branches can give
us
both reliability/reproducibility (stable branch) as well as new
features (unstable branch). The user gets to pick (and you can pick
both!). The same is true for r-base: when using a 'released'
version
you get 'stable' base packages that are up to 12 months old. If you
want to have the latest stuff you download a nightly build of
r-devel.
For regular users and reproducible research it is recommended to
use
the stable branch. However if you are a developer (e.g. package
author) you might want to develop/test/check your work with the
latest
r-devel.

I think that extending the R release cycle to CRAN would result
both
in more stable released versions of R, as well as more freedom for
package authors to implement rigorous change in the unstable
branch.
When writing a script that is part of a production pipeline, or
sweave
paper that should be reproducible 10 years from now, or a book on
using R, you use stable version of R, which is guaranteed to behave
the same over time. However when developing packages that should be
compatible with the upcoming release of R, you use r-devel which
has
the latest versions of other CRAN and base packages.



As I remember ... The example demonstrating the need for this was an
XML package that cause an extract from a website where the headers
were misinterpreted as data in one version of pkg:XML and not in
another. That seems fairly unconvincing. Data cleaning and
validation is a basic task of data analysis. It also seems excessive
to assert that it is the responsibility of CRAN to maintain a synced
binary archive that will be available in ten years.



CRAN already does this, the bin/windows/contrib directory has
subdirectories going back to 1.7, with packages dated October 2004. I
don't see why it is burdensome to continue to archive these. It would
be nice if source versions had a similar archive.


The bin/windows/contrib directories are updated every day for active R
versions.  It's only when Uwe decides that a version is no longer worth
active support that he stops doing updates, and it "freezes".  A
consequence of this is that the snapshots preserved in those older
directories are unlikely to match what someone who keeps up to date with
R releases is using.  Their purpose is to make sure that those older
versions aren't completely useless, but they aren't what Jeroen was
asking for.


But it is almost completely useless from a reproducibility point of
view to get random package versions. For example if some people try
to use R-2.13.2 today to reproduce an analysis that was published
2 years ago, they'll get Matrix 1.0-4 on Windows, Matrix 1.0-3 on Mac,
and Matrix 1.1-2-2 on Unix. And none of them of course is what was used
by the authors of the paper (they used Matrix 1.0-1, which is what was
current when they ran their analysis).

A big improvement from a reproducibility point of view would be to
(a) have a clear cut for the freezes, (b) freeze the source
packages as well as the binary packages, and (c) freeze the same
versions of source and binaries. For example the freeze of
bin/windows/contrib/x.y, bin/macosx/contrib/x.y and contrib/x.y
could happen when the R-x.y series itself freezes (i.e. no more
minor versions planned for this series).

Cheers,
H.



Karl Millar's suggestion seems like an ideal solution to this problem.
Any CRAN mirror could implement it.  If someone sets this up and commits
to maintaining it, I'd be happy to work on the necessary changes to the
install.packages/update.packages code to allow people to use it from
within R.

Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Jari Oksanen

On 20/03/2014, at 14:14 PM, S Ellison wrote:

>> If we could all agree on a particular set
>> of cran packages to be used with a certain release of R, then it doesn't 
>> matter
>> how the 'snapshotting' gets implemented.
> 
> This is pretty much the sticking point, though. I see no practical way of 
> reaching that agreement without the kind of decision authority (and effort) 
> that Linux distro maintainers put in to the internal consistency of each 
> distribution.
> 
> CRAN doesn't try to do that; it's just a place to access packages offered by 
> maintainers. 
> 
> As a package maintainer, I think support for critical version dependencies in 
> the imports or dependency lists is a good idea that individual package 
> maintainers could relatively easily manage, but I think freezing CRAN as a 
> whole or adopting single release cycles for CRAN would be thoroughly 
> impractical.
> 

I have a feeling that this discussion has floated between two different 
arguments in favour of freezing: discontent with package authors who break 
their packages within R release cycle, and ability to reproduce old results. In 
the beginning the first argument was more prominent, but now the discussion has 
drifted to reproducing old results. 

I cannot see how freezing CRAN would help with package authors who do not 
separate development and CRAN release branches but introduce broken code, or 
code that breaks other packages. Freezing a broken snapshot would only mean 
that the situation cannot be cured before next R release, and then new breakage 
could be introduced. Result would be dysfunctional CRAN. I think that quite a 
few of the package updates are bug fixes and minor enhancements. Further, I do 
think that these should be "backported" to previous versions of R: users of 
previous version of R should also benefit from bug fixes. This also is the 
current CRAN policy and I think this is a good policy. Personally, I try to 
keep my packages in such a condition that they will also work in previous 
versions of R so that people do not need to upgrade R to have bug fixes in 
packages. 

The policy is the same with Linux maintainers: they do not just build a 
consistent release, but maintain the release by providing bug fixes. In Linux 
distributions, end of life equals freezing, or not providing new versions of 
software.

Another issue is reproducing old analyses. This is a valuable thing, and 
sessionInfo and ability to get certain versions of package certainly are steps 
forward. It looks that guaranteed reproduction is a hard task, though. For 
instance, R 2.14.2 is the oldest version of R that I can build out of the box 
in my Linux desktop. I have earlier built older, even much older, R versions, 
but something has happened in my OS that crashes the build process. To 
reproduce an old analysis, I also should install an older version of my OS,  
then build old R and then get the old versions of packages. It is nice if the 
last step is made easier.

Cheers, Jari Oksanen

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread S Ellison

>  If we could all agree on a particular set
> of cran packages to be used with a certain release of R, then it doesn't 
> matter
> how the 'snapshotting' gets implemented.

This is pretty much the sticking point, though. I see no practical way of 
reaching that agreement without the kind of decision authority (and effort) 
that Linux distro maintainers put in to the internal consistency of each 
distribution.

CRAN doesn't try to do that; it's just a place to access packages offered by 
maintainers. 

As a package maintainer, I think support for critical version dependencies in 
the imports or dependency lists is a good idea that individual package 
maintainers could relatively easily manage, but I think freezing CRAN as a 
whole or adopting single release cycles for CRAN would be thoroughly 
impractical.

S Ellison





***
This email and any attachments are confidential. Any use...{{dropped:8}}

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Roger Bivand

Gavin Simpson  gmail.com> writes:

> 
...
> 
> 
> To my mind it is incumbent upon those wanting reproducibility to build
> the tools to enable users to reproduce works. When you write a paper
> or release a tool, you will have tested it with a specific set of
> packages. It is relatively easy to work out what those versions are
> (there are tools in R for this). What is required is an automated way
> to record that info in an agreed upon way in an approved
> file/location, and have a tool that facilitates setting up a package
> library sufficient with which to reproduce a work. That approval
> doesn't need to come from CRAN or R Core - we can store anything in
> ./inst.

Gavin,

Thanks for contributing useful insights. With reference to Jeroen's proposal
and the discussion so far, I can see where the problem lies, but the
proposed solutions are very invasive. What might offer a less invasive
resolution is through a robust and predictable schema for sessionInfo()
content, permitting ready parsing, so that (using Hadley's interjection) the
reproducer could reconstruct the original execution environment at least as
far as R and package versions are concerned.

In fact, I'd argue that the responsibility for securing reproducibility lies
with the originating author or organisation, so that work where
reproducibility is desired should include such a standardised record. 

There is an additional problem not addressed directly in this thread but
mentioned in some contributions, upstream of R. The further problem upstream
is actually in the external dependencies and compilers, beyond that in
hardware. So raising consciousness about the importance of being able to
query version information to enable reproducibility is important.

Next, encapsulating the information permitting its parsing would perhaps
enable the original execution environment to be reconstructed locally by
installing external dependencies, then R, then packages from source, using
the same versions of build train components if possible (and noting
mismatches if not). Maybe ressurect StatDataML in addition to RData
serialization of the version dependencies? Of course, current R and package
versions may provide reproducibility, but if they don't, one would use the
parseable record of the original development environment 

> 
> Reproducibility is a very important part of doing "science", but not
> everyone using CRAN is doing that. Why force everyone to march to the
> reproducibility drum? I would place the onus elsewhere to make this
> work.

Exactly.

Roger

> 
> Gavin
> A scientist, very much interested in reproducibility of my work and others.
> 
...
> >
> > __
> > R-devel  r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Duncan Murdoch

On 14-03-20 2:15 AM, Dan Tenenbaum wrote:

- Original Message -

From: "David Winsemius" 
To: "Jeroen Ooms" 
Cc: "r-devel" 
Sent: Wednesday, March 19, 2014 11:03:32 PM
Subject: Re: [Rd] [RFC] A case for freezing CRAN

On Mar 19, 2014, at 7:45 PM, Jeroen Ooms wrote:

On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
 wrote:

Reading this thread again, is it a fair summary of your position
to say "reproducibility by default is more important than giving
users access to the newest bug fixes and features by default?"
It's certainly arguable, but I'm not sure I'm convinced: I'd
imagine that the ratio of new work being done vs reproductions is
rather high and the current setup optimizes for that already.

I think that separating development from released branches can give
us
both reliability/reproducibility (stable branch) as well as new
features (unstable branch). The user gets to pick (and you can pick
both!). The same is true for r-base: when using a 'released'
version
you get 'stable' base packages that are up to 12 months old. If you
want to have the latest stuff you download a nightly build of
r-devel.
For regular users and reproducible research it is recommended to
use
the stable branch. However if you are a developer (e.g. package
author) you might want to develop/test/check your work with the
latest
r-devel.

I think that extending the R release cycle to CRAN would result
both
in more stable released versions of R, as well as more freedom for
package authors to implement rigorous change in the unstable
branch.
When writing a script that is part of a production pipeline, or
sweave
paper that should be reproducible 10 years from now, or a book on
using R, you use stable version of R, which is guaranteed to behave
the same over time. However when developing packages that should be
compatible with the upcoming release of R, you use r-devel which
has
the latest versions of other CRAN and base packages.

As I remember ... The example demonstrating the need for this was an
XML package that cause an extract from a website where the headers
were misinterpreted as data in one version of pkg:XML and not in
another. That seems fairly unconvincing. Data cleaning and
validation is a basic task of data analysis. It also seems excessive
to assert that it is the responsibility of CRAN to maintain a synced
binary archive that will be available in ten years.

CRAN already does this, the bin/windows/contrib directory has subdirectories 
going back to 1.7, with packages dated October 2004. I don't see why it is 
burdensome to continue to archive these. It would be nice if source versions 
had a similar archive.

The bin/windows/contrib directories are updated every day for active R 
versions.  It's only when Uwe decides that a version is no longer worth 
active support that he stops doing updates, and it "freezes".  A 
consequence of this is that the snapshots preserved in those older 
directories are unlikely to match what someone who keeps up to date with 
R releases is using.  Their purpose is to make sure that those older 
versions aren't completely useless, but they aren't what Jeroen was 
asking for.

Karl Millar's suggestion seems like an ideal solution to this problem. 
Any CRAN mirror could implement it.  If someone sets this up and commits 
to maintaining it, I'd be happy to work on the necessary changes to the 
install.packages/update.packages code to allow people to use it from 
within R.

Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Rainer M Krug

Hadley Wickham  writes:

>> What would be more useful in terms of reproducibility is the capability of
>> installing a specific version of a package from a repository using
>> install.packages(), which would require archiving older versions in a
>> coordinated fashion. I know CRAN archives old versions, but I am not aware
>> if we can programmatically query the repository about this.
>
> See devtools::install_version().
>
> The main caveat is that you also need to be able to build the package,
> and ensure you have dependencies that work with that version.

The compiling will always be the problem when using older source
packages, whatever is done.

But for the dependencies: an automatic parsing of the dependencies
(DEPENDS, IMPORTS, ...) would help a lot. 

Together with a command which scans the installed package in the session
and stores them in a parsable human readable format so that all packages
(with the specified version) required can be installed with one command,
and I think the problem would be much closer to be solved.

Rainer

>
> Hadley

-- 
Rainer M. Krug
email: Rainerkrugsde
PGP: 0x0F52F982

pgpfAhwo1bQBT.pgp
Description: PGP signature
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Rainer M Krug

Michael Weylandt  writes:

> On Mar 19, 2014, at 22:17, Gavin Simpson  wrote:
>
>> Michael,
>> 
>> I think the issue is that Jeroen wants to take that responsibility out
>> of the hands of the person trying to reproduce a work. If it used R
>> 3.0.x and packages A, B and C then it would be trivial to to install
>> that version of R and then pull down the stable versions of A B and C
>> for that version of R. At the moment, one might note the packages used
>> and even their versions, but what about the versions of the packages
>> that the used packages rely upon & so on? What if developers don't
>> state know working versions of dependencies?
>
> Doesn't sessionInfo() give all of this?
>
> If you want to be very worried about every last bit, I suppose it
> should also include options(), compiler flags, compiler version, BLAS
> details, etc.  (Good talk on the dregs of a floating point number and
> how hard it is to reproduce them across processors
> http://www.youtube.com/watch?v=GIlp4rubv8U)

In principle yes - but this calls specifically for a package which is
extracting the info and stores it into a human readable format, which
can then be used to re-install (automatically) all the versions for
(hopefully) reproducibility - because if there are external libraries
included, you HAVE problems.

>
>> 
>> The problem is how the heck do you know which versions of packages are
>> needed if developers don't record these dependencies in sufficient
>> detail? The suggested solution is to freeze CRAN at intervals
>> alongside R releases. Then you'd know what the stable versions were.
>
> Only if you knew which R release was used. 

Well - that would be easier to specify in a paper then the version infos
of all packages needed - and which ones of the installed ones are
actually needed? OK - the ones specified in library() calls. But wait -
there are dependencies, imports, ... That is a lot of digging - I wpul;d
not know how to do this out of my head, except by digging through the
DESCRIPTION files of the packages...

>
>> 
>> Or we could just get package developers to be more thorough in
>> documenting dependencies. Or R CMD check could refuse to pass if a
>> package is listed as a dependency but with no version qualifiers. Or
>> have R CMD build add an upper bound (from the current, at build-time
>> version of dependencies on CRAN) if the package developer didn't
>> include and upper bound. Or... The first is unliekly to happen
>> consistently, and no-one wants *more* checks and hoops to jump through
>> :-)
>> 
>> To my mind it is incumbent upon those wanting reproducibility to build
>> the tools to enable users to reproduce works.
>
> But the tools already allow it with minimal effort. If the author
> can't even include session info, how can we be sure the version of R
> is known. If we can't know which version of R, can we ever change R at
> all? Etc to absurdity.
>
> My (serious) point is that the tools are in place, but ramming them
> down folks' throats by intentionally keeping them on older versions by
> default is too much.
>
>> When you write a paper
>> or release a tool, you will have tested it with a specific set of
>> packages. It is relatively easy to work out what those versions are
>> (there are tools in R for this). What is required is an automated way
>> to record that info in an agreed upon way in an approved
>> file/location, and have a tool that facilitates setting up a package
>> library sufficient with which to reproduce a work. That approval
>> doesn't need to come from CRAN or R Core - we can store anything in
>> ./inst.
>
> I think the package version and published paper cases are different. 
>
> For the latter, the recipe is simple: if you want the same results,
> use the same software (as noted by sessionInfoPlus() or equiv)

Dependencies, imports, package versions, ... not that straight forward I
would say.

>
> For the former, I think you start straying into this NP complete problem: 
> http://people.debian.org/~dburrows/model.pdf 
>
> Yes, a good config can (and should be recorded) but isn't that exactly what 
> sessionInfo() gives?
>
>> 
>> Reproducibility is a very important part of doing "science", but not
>> everyone using CRAN is doing that. Why force everyone to march to the
>> reproducibility drum? I would place the onus elsewhere to make this
>> work.
>> 
>
> Agreed: reproducibility is the onus of the author, not the reader

Exactly - but also the authors of the software which is aimed at being
used in the context of reproducibility - the tools should be there to
make it easy!

My points are:

1) I think the snapshot idea of CRAN is a good idea which should be
followed
2) The snapshots should be incorporated at CRAN as I assume that CRAN
will be there longer then any third party repository.
3) the default for the user should *not* change, i.e. normal users will
always get the newest packages as it is now
4) If this can / will not be done because of workload, storage space,
...

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Dan Tenenbaum



- Original Message -
> From: "David Winsemius" 
> To: "Jeroen Ooms" 
> Cc: "r-devel" 
> Sent: Wednesday, March 19, 2014 11:03:32 PM
> Subject: Re: [Rd] [RFC] A case for freezing CRAN
> 
> 
> On Mar 19, 2014, at 7:45 PM, Jeroen Ooms wrote:
> 
> > On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
> >  wrote:
> >> Reading this thread again, is it a fair summary of your position
> >> to say "reproducibility by default is more important than giving
> >> users access to the newest bug fixes and features by default?"
> >> It's certainly arguable, but I'm not sure I'm convinced: I'd
> >> imagine that the ratio of new work being done vs reproductions is
> >> rather high and the current setup optimizes for that already.
> > 
> > I think that separating development from released branches can give
> > us
> > both reliability/reproducibility (stable branch) as well as new
> > features (unstable branch). The user gets to pick (and you can pick
> > both!). The same is true for r-base: when using a 'released'
> > version
> > you get 'stable' base packages that are up to 12 months old. If you
> > want to have the latest stuff you download a nightly build of
> > r-devel.
> > For regular users and reproducible research it is recommended to
> > use
> > the stable branch. However if you are a developer (e.g. package
> > author) you might want to develop/test/check your work with the
> > latest
> > r-devel.
> > 
> > I think that extending the R release cycle to CRAN would result
> > both
> > in more stable released versions of R, as well as more freedom for
> > package authors to implement rigorous change in the unstable
> > branch.
> > When writing a script that is part of a production pipeline, or
> > sweave
> > paper that should be reproducible 10 years from now, or a book on
> > using R, you use stable version of R, which is guaranteed to behave
> > the same over time. However when developing packages that should be
> > compatible with the upcoming release of R, you use r-devel which
> > has
> > the latest versions of other CRAN and base packages.
> 
> 
> As I remember ... The example demonstrating the need for this was an
> XML package that cause an extract from a website where the headers
> were misinterpreted as data in one version of pkg:XML and not in
> another. That seems fairly unconvincing. Data cleaning and
> validation is a basic task of data analysis. It also seems excessive
> to assert that it is the responsibility of CRAN to maintain a synced
> binary archive that will be available in ten years. 


CRAN already does this, the bin/windows/contrib directory has subdirectories 
going back to 1.7, with packages dated October 2004. I don't see why it is 
burdensome to continue to archive these. It would be nice if source versions 
had a similar archive.

Dan




> Bug fixes would
> be inhibited for years not unlike SAS and Excel. What next?
> Perhaps al bugs should be labeled as features?  Surely this
> CRAN-of-the-future would be offering something that no other
> statistical package currently offers, nicht wahr?
> 
> Why not leave it to the authors to specify the packages which version
> numbers were used in their publications. The authors of the packages
> would get recognition and the dependencies would be recorded.
> 
> --
> David.
> > 
> > 
> >> What I'm trying to figure out is why the standard "install the
> >> following list of package versions" isn't good enough in your
> >> eyes?
> > 
> > Almost nobody does this because it is cumbersome and impractical.
> > We
> > can do so much better than this. Note that in order to install old
> > packages you also need to investigate which versions of
> > dependencies
> > of those packages were used. On win/osx, users need to manually
> > build
> > those packages which can be a pain. All in all it makes
> > reproducible
> > research difficult and expensive and error prone. At the end of the
> > day most published results obtain with R just won't be
> > reproducible.
> > 
> > Also I believe that keeping it simple is essential for solutions to
> > be
> > practical. If every script has to be run inside an environment with
> > custom libraries, it takes away much of its power. Running a bash
> > or
> > python script in Linux is so easy and reliable that entire
> > distributions are based on it. I don't understand why we make our
> > lives so diff

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread David Winsemius


On Mar 19, 2014, at 7:45 PM, Jeroen Ooms wrote:

> On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
>  wrote:
>> Reading this thread again, is it a fair summary of your position to say 
>> "reproducibility by default is more important than giving users access to 
>> the newest bug fixes and features by default?" It's certainly arguable, but 
>> I'm not sure I'm convinced: I'd imagine that the ratio of new work being 
>> done vs reproductions is rather high and the current setup optimizes for 
>> that already.
> 
> I think that separating development from released branches can give us
> both reliability/reproducibility (stable branch) as well as new
> features (unstable branch). The user gets to pick (and you can pick
> both!). The same is true for r-base: when using a 'released' version
> you get 'stable' base packages that are up to 12 months old. If you
> want to have the latest stuff you download a nightly build of r-devel.
> For regular users and reproducible research it is recommended to use
> the stable branch. However if you are a developer (e.g. package
> author) you might want to develop/test/check your work with the latest
> r-devel.
> 
> I think that extending the R release cycle to CRAN would result both
> in more stable released versions of R, as well as more freedom for
> package authors to implement rigorous change in the unstable branch.
> When writing a script that is part of a production pipeline, or sweave
> paper that should be reproducible 10 years from now, or a book on
> using R, you use stable version of R, which is guaranteed to behave
> the same over time. However when developing packages that should be
> compatible with the upcoming release of R, you use r-devel which has
> the latest versions of other CRAN and base packages.


As I remember ... The example demonstrating the need for this was an XML 
package that cause an extract from a website where the headers were 
misinterpreted as data in one version of pkg:XML and not in another. That seems 
fairly unconvincing. Data cleaning and validation is a basic task of data 
analysis. It also seems excessive to assert that it is the responsibility of 
CRAN to maintain a synced binary archive that will be available in ten years. 
Bug fixes would be inhibited for years not unlike SAS and Excel. What next? 
Perhaps al bugs should be labeled as features?  Surely this CRAN-of-the-future 
would be offering something that no other statistical package currently offers, 
nicht wahr?

Why not leave it to the authors to specify the packages which version numbers 
were used in their publications. The authors of the packages would get 
recognition and the dependencies would be recorded.

-- 
David.
> 
> 
>> What I'm trying to figure out is why the standard "install the following 
>> list of package versions" isn't good enough in your eyes?
> 
> Almost nobody does this because it is cumbersome and impractical. We
> can do so much better than this. Note that in order to install old
> packages you also need to investigate which versions of dependencies
> of those packages were used. On win/osx, users need to manually build
> those packages which can be a pain. All in all it makes reproducible
> research difficult and expensive and error prone. At the end of the
> day most published results obtain with R just won't be reproducible.
> 
> Also I believe that keeping it simple is essential for solutions to be
> practical. If every script has to be run inside an environment with
> custom libraries, it takes away much of its power. Running a bash or
> python script in Linux is so easy and reliable that entire
> distributions are based on it. I don't understand why we make our
> lives so difficult in R.
> 
> In my estimation, a system where stable versions of R pull packages
> from a stable branch of CRAN will naturally resolve the majority of
> the reproducibility and reliability problems with R. And in contrast
> to what some people here are suggesting it does not introduce any
> limitations. If you want to get the latest stuff, you either grab a
> copy of r-devel, or just enable the testing branch and off you go.
> Debian 'testing' works in a similar way, see
> http://www.debian.org/devel/testing.
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

David Winsemius
Alameda, CA, USA

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Karl Millar

I think what you really want here is the ability to easily identify
and sync to CRAN snapshots.

The easy way to do this is setup a CRAN mirror, but back it up with
version control, so that it's easy to reproduce the exact state of
CRAN at any given point in time.  CRAN's not particularly large and
doesn't churn a whole lot, so most version control systems should be
able to handle that without difficulty.

Using svn, mod_dav_svn and (maybe) mod_rewrite, you could setup the
server so that e.g.:
   http://my.cran.mirror/repos/2013-01-01/
is a mirror of how CRAN looked at midnight 2013-01-01.

Users can then set their repository to that URL, and will have a
stable snapshot to work with, and can have all their packages built
with that snapshot if they like.  For reproducibility purposes, all
users need to do is to agree on the same date to use.  For publication
purposes, the date of the snapshot should be sufficient.

We'd need a version of update.packages() that force-syncs all the
packages to the version in the repository, even if they're downgrades,
but otherwise it ought to be fairly straight-forward.

FWIW, we do something similar internally at Google.  All the packages
that a user has installed come from the same source control revision,
where we know that all the package versions are mutually compatible.
It saves a lot of headaches, and users can rollback to any previous
point in time easily if they run into problems.


On Wed, Mar 19, 2014 at 7:45 PM, Jeroen Ooms  wrote:
> On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
>  wrote:
>> Reading this thread again, is it a fair summary of your position to say 
>> "reproducibility by default is more important than giving users access to 
>> the newest bug fixes and features by default?" It's certainly arguable, but 
>> I'm not sure I'm convinced: I'd imagine that the ratio of new work being 
>> done vs reproductions is rather high and the current setup optimizes for 
>> that already.
>
> I think that separating development from released branches can give us
> both reliability/reproducibility (stable branch) as well as new
> features (unstable branch). The user gets to pick (and you can pick
> both!). The same is true for r-base: when using a 'released' version
> you get 'stable' base packages that are up to 12 months old. If you
> want to have the latest stuff you download a nightly build of r-devel.
> For regular users and reproducible research it is recommended to use
> the stable branch. However if you are a developer (e.g. package
> author) you might want to develop/test/check your work with the latest
> r-devel.
>
> I think that extending the R release cycle to CRAN would result both
> in more stable released versions of R, as well as more freedom for
> package authors to implement rigorous change in the unstable branch.
> When writing a script that is part of a production pipeline, or sweave
> paper that should be reproducible 10 years from now, or a book on
> using R, you use stable version of R, which is guaranteed to behave
> the same over time. However when developing packages that should be
> compatible with the upcoming release of R, you use r-devel which has
> the latest versions of other CRAN and base packages.
>
>
>> What I'm trying to figure out is why the standard "install the following 
>> list of package versions" isn't good enough in your eyes?
>
> Almost nobody does this because it is cumbersome and impractical. We
> can do so much better than this. Note that in order to install old
> packages you also need to investigate which versions of dependencies
> of those packages were used. On win/osx, users need to manually build
> those packages which can be a pain. All in all it makes reproducible
> research difficult and expensive and error prone. At the end of the
> day most published results obtain with R just won't be reproducible.
>
> Also I believe that keeping it simple is essential for solutions to be
> practical. If every script has to be run inside an environment with
> custom libraries, it takes away much of its power. Running a bash or
> python script in Linux is so easy and reliable that entire
> distributions are based on it. I don't understand why we make our
> lives so difficult in R.
>
> In my estimation, a system where stable versions of R pull packages
> from a stable branch of CRAN will naturally resolve the majority of
> the reproducibility and reliability problems with R. And in contrast
> to what some people here are suggesting it does not introduce any
> limitations. If you want to get the latest stuff, you either grab a
> copy of r-devel, or just enable the testing branch and off you go.
> Debian 'testing' works in a similar way, see
> http://www.debian.org/devel/testing.
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/l

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Michael Weylandt

On Mar 19, 2014, at 22:45, Jeroen Ooms  wrote:

> On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
>  wrote:
>> Reading this thread again, is it a fair summary of your position to say 
>> "reproducibility by default is more important than giving users access to 
>> the newest bug fixes and features by default?" It's certainly arguable, but 
>> I'm not sure I'm convinced: I'd imagine that the ratio of new work being 
>> done vs reproductions is rather high and the current setup optimizes for 
>> that already.
> 
> I think that separating development from released branches can give us
> both reliability/reproducibility (stable branch) as well as new
> features (unstable branch). The user gets to pick (and you can pick
> both!). The same is true for r-base: when using a 'released' version
> you get 'stable' base packages that are up to 12 months old. If you
> want to have the latest stuff you download a nightly build of r-devel.
> For regular users and reproducible research it is recommended to use
> the stable branch. However if you are a developer (e.g. package
> author) you might want to develop/test/check your work with the latest
> r-devel.

I think where you are getting push back (e.g., Frank Harrell and Josh Ulrich) 
is from saying that 'stable' is the right branch for 'regular users.' And I 
tend to agree: I think most folks need features and bug fixes more than they 
need to reproduce a particular paper with no effort on their end. 

> 
> I think that extending the R release cycle to CRAN would result both
> in more stable released versions of R, as well as more freedom for
> package authors to implement rigorous change in the unstable branch.

Not sure what exactly you mean by this sentence. 

> When writing a script that is part of a production pipeline, or sweave
> paper that should be reproducible 10 years from now, or a book on
> using R, you use stable version of R, which is guaranteed to behave
> the same over time.

Only if you never upgrade anything... But that's the case already, isn't it?


> However when developing packages that should be
> compatible with the upcoming release of R, you use r-devel which has
> the latest versions of other CRAN and base packages.
> 
> 
>> What I'm trying to figure out is why the standard "install the following 
>> list of package versions" isn't good enough in your eyes?
> 
> Almost nobody does this because it is cumbersome and impractical. We
> can do so much better than this. Note that in order to install old
> packages you also need to investigate which versions of dependencies
> of those packages were used. On win/osx, users need to manually build
> those packages which can be a pain. All in all it makes reproducible
> research difficult and expensive and error prone. At the end of the
> day most published results obtain with R just won't be reproducible.

So you want CRAN to host old binaries ad infinitum? I think that's entirely 
reasonable/doable if (big if) storage and network are free. 

> 
> Also I believe that keeping it simple is essential for solutions to be
> practical. If every script has to be run inside an environment with
> custom libraries, it takes away much of its power. Running a bash or
> python script in Linux is so easy and reliable that entire
> distributions are based on it. I don't understand why we make our
> lives so difficult in R.

Because for Debian style (stop the world on release) distro, there are no 
upgrades within a release. And that's only halfway reasonable because of 
Debian's shockingly good QA. 

It's certainly not true for, e.g., Arch. 

I've been looking at python incompatibilities across different RHEL versions 
lately. There's simply no way to get around explicit version pinning (either by 
release number or date, but when you have many moving pieces, picking a set of 
release numbers is much easier than finding a single day when they all happened 
to work together) if it has to work exactly as it used to. 

> 
> In my estimation, a system where stable versions of R pull packages
> from a stable branch of CRAN will naturally resolve the majority of
> the reproducibility and reliability problems with R.

And what everyone else is saying is "if you want to reproduce results made with 
old software,  download and use the old software." Both can me made to work -- 
it's just a matter of pros and cons of different defaults. 


> And in contrast
> to what some people here are suggesting it does not introduce any
> limitations. If you want to get the latest stuff, you either grab a
> copy of r-devel, or just enable the testing branch and off you go.
> Debian 'testing' works in a similar way, see
> http://www.debian.org/devel/testing.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Jeroen Ooms

On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
 wrote:
> Reading this thread again, is it a fair summary of your position to say 
> "reproducibility by default is more important than giving users access to the 
> newest bug fixes and features by default?" It's certainly arguable, but I'm 
> not sure I'm convinced: I'd imagine that the ratio of new work being done vs 
> reproductions is rather high and the current setup optimizes for that already.

I think that separating development from released branches can give us
both reliability/reproducibility (stable branch) as well as new
features (unstable branch). The user gets to pick (and you can pick
both!). The same is true for r-base: when using a 'released' version
you get 'stable' base packages that are up to 12 months old. If you
want to have the latest stuff you download a nightly build of r-devel.
For regular users and reproducible research it is recommended to use
the stable branch. However if you are a developer (e.g. package
author) you might want to develop/test/check your work with the latest
r-devel.

I think that extending the R release cycle to CRAN would result both
in more stable released versions of R, as well as more freedom for
package authors to implement rigorous change in the unstable branch.
When writing a script that is part of a production pipeline, or sweave
paper that should be reproducible 10 years from now, or a book on
using R, you use stable version of R, which is guaranteed to behave
the same over time. However when developing packages that should be
compatible with the upcoming release of R, you use r-devel which has
the latest versions of other CRAN and base packages.

> What I'm trying to figure out is why the standard "install the following list 
> of package versions" isn't good enough in your eyes?

Almost nobody does this because it is cumbersome and impractical. We
can do so much better than this. Note that in order to install old
packages you also need to investigate which versions of dependencies
of those packages were used. On win/osx, users need to manually build
those packages which can be a pain. All in all it makes reproducible
research difficult and expensive and error prone. At the end of the
day most published results obtain with R just won't be reproducible.

Also I believe that keeping it simple is essential for solutions to be
practical. If every script has to be run inside an environment with
custom libraries, it takes away much of its power. Running a bash or
python script in Linux is so easy and reliable that entire
distributions are based on it. I don't understand why we make our
lives so difficult in R.

In my estimation, a system where stable versions of R pull packages
from a stable branch of CRAN will naturally resolve the majority of
the reproducibility and reliability problems with R. And in contrast
to what some people here are suggesting it does not introduce any
limitations. If you want to get the latest stuff, you either grab a
copy of r-devel, or just enable the testing branch and off you go.
Debian 'testing' works in a similar way, see
http://www.debian.org/devel/testing.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Michael Weylandt



On Mar 19, 2014, at 22:17, Gavin Simpson  wrote:

> Michael,
> 
> I think the issue is that Jeroen wants to take that responsibility out
> of the hands of the person trying to reproduce a work. If it used R
> 3.0.x and packages A, B and C then it would be trivial to to install
> that version of R and then pull down the stable versions of A B and C
> for that version of R. At the moment, one might note the packages used
> and even their versions, but what about the versions of the packages
> that the used packages rely upon & so on? What if developers don't
> state know working versions of dependencies?

Doesn't sessionInfo() give all of this?

If you want to be very worried about every last bit, I suppose it should also 
include options(), compiler flags, compiler version, BLAS details, etc.  (Good 
talk on the dregs of a floating point number and how hard it is to reproduce 
them across processors http://www.youtube.com/watch?v=GIlp4rubv8U)

> 
> The problem is how the heck do you know which versions of packages are
> needed if developers don't record these dependencies in sufficient
> detail? The suggested solution is to freeze CRAN at intervals
> alongside R releases. Then you'd know what the stable versions were.

Only if you knew which R release was used. 

> 
> Or we could just get package developers to be more thorough in
> documenting dependencies. Or R CMD check could refuse to pass if a
> package is listed as a dependency but with no version qualifiers. Or
> have R CMD build add an upper bound (from the current, at build-time
> version of dependencies on CRAN) if the package developer didn't
> include and upper bound. Or... The first is unliekly to happen
> consistently, and no-one wants *more* checks and hoops to jump through
> :-)
> 
> To my mind it is incumbent upon those wanting reproducibility to build
> the tools to enable users to reproduce works.

But the tools already allow it with minimal effort. If the author can't even 
include session info, how can we be sure the version of R is known. If we can't 
know which version of R, can we ever change R at all? Etc to absurdity. 

My (serious) point is that the tools are in place, but ramming them down folks' 
throats by intentionally keeping them on older versions by default is too much. 

> When you write a paper
> or release a tool, you will have tested it with a specific set of
> packages. It is relatively easy to work out what those versions are
> (there are tools in R for this). What is required is an automated way
> to record that info in an agreed upon way in an approved
> file/location, and have a tool that facilitates setting up a package
> library sufficient with which to reproduce a work. That approval
> doesn't need to come from CRAN or R Core - we can store anything in
> ./inst.

I think the package version and published paper cases are different. 

For the latter, the recipe is simple: if you want the same results, use the 
same  software (as noted by sessionInfoPlus() or equiv)

For the former, I think you start straying into this NP complete problem: 
http://people.debian.org/~dburrows/model.pdf 

Yes, a good config can (and should be recorded) but isn't that exactly what 
sessionInfo() gives?

> 
> Reproducibility is a very important part of doing "science", but not
> everyone using CRAN is doing that. Why force everyone to march to the
> reproducibility drum? I would place the onus elsewhere to make this
> work.
> 

Agreed: reproducibility is the onus of the author, not the reader


> Gavin
> A scientist, very much interested in reproducibility of my work and others.

Michael
In finance, where we call it "Auditability" and care very much as well :-)


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Gavin Simpson

Michael,

I think the issue is that Jeroen wants to take that responsibility out
of the hands of the person trying to reproduce a work. If it used R
3.0.x and packages A, B and C then it would be trivial to to install
that version of R and then pull down the stable versions of A B and C
for that version of R. At the moment, one might note the packages used
and even their versions, but what about the versions of the packages
that the used packages rely upon & so on? What if developers don't
state know working versions of dependencies?

The problem is how the heck do you know which versions of packages are
needed if developers don't record these dependencies in sufficient
detail? The suggested solution is to freeze CRAN at intervals
alongside R releases. Then you'd know what the stable versions were.

Or we could just get package developers to be more thorough in
documenting dependencies. Or R CMD check could refuse to pass if a
package is listed as a dependency but with no version qualifiers. Or
have R CMD build add an upper bound (from the current, at build-time
version of dependencies on CRAN) if the package developer didn't
include and upper bound. Or... The first is unliekly to happen
consistently, and no-one wants *more* checks and hoops to jump through
:-)

To my mind it is incumbent upon those wanting reproducibility to build
the tools to enable users to reproduce works. When you write a paper
or release a tool, you will have tested it with a specific set of
packages. It is relatively easy to work out what those versions are
(there are tools in R for this). What is required is an automated way
to record that info in an agreed upon way in an approved
file/location, and have a tool that facilitates setting up a package
library sufficient with which to reproduce a work. That approval
doesn't need to come from CRAN or R Core - we can store anything in
./inst.

Reproducibility is a very important part of doing "science", but not
everyone using CRAN is doing that. Why force everyone to march to the
reproducibility drum? I would place the onus elsewhere to make this
work.

Gavin
A scientist, very much interested in reproducibility of my work and others.

On 19 March 2014 19:55, Michael Weylandt  wrote:
>
>
> On Mar 19, 2014, at 18:42, Joshua Ulrich  wrote:
>
>> On Wed, Mar 19, 2014 at 5:16 PM, Jeroen Ooms  
>> wrote:
>>> On Wed, Mar 19, 2014 at 2:59 PM, Joshua Ulrich  
>>> wrote:

 So implementation isn't a problem.  The problem is that you need a way
 to force people not to be able to use different package versions than
 what existed at the time of each R release.  I said this in my
 previous email, but you removed and did not address it: "However, you
 would need to find a way to actively _prevent_ people from installing
 newer versions of packages with the stable R releases."  Frankly, I
 would stop using CRAN if this policy were adopted.
>>>
>>> I am not proposing to "force" anything to anyone, those are your
>>> words. Please read the proposal more carefully before derailing the
>>> discussion. Below *verbatim* a section from the paper:
>> 
>>
>> Yes "force" is too strong a word.  You want a barrier (however small)
>> to prevent people from installing newer (or older) versions of
>> packages than those that correspond to a given R release.
>
>
> Jeroen,
>
> Reading this thread again, is it a fair summary of your position to say 
> "reproducibility by default is more important than giving users access to the 
> newest bug fixes and features by default?" It's certainly arguable, but I'm 
> not sure I'm convinced: I'd imagine that the ratio of new work being done vs 
> reproductions is rather high and the current setup optimizes for that already.
>
> What I'm trying to figure out is why the standard "install the following list 
> of package versions" isn't good enough in your eyes? Is it the lack of CRAN 
> provided binaries or the fact that the user has to proactively set up their 
> environment to replicate that of published results?
>
> In your XML example, it seems the problem was that the reproducer didn't 
> check that the same package versions as the reproducee and instead assumed 
> that 'latest' would be the same. Annoying yes, but easy to solve.
>
> Michael
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Gavin Simpson, PhD

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Michael Weylandt

On Mar 19, 2014, at 18:42, Joshua Ulrich  wrote:

> On Wed, Mar 19, 2014 at 5:16 PM, Jeroen Ooms  
> wrote:
>> On Wed, Mar 19, 2014 at 2:59 PM, Joshua Ulrich  
>> wrote:
>>> 
>>> So implementation isn't a problem.  The problem is that you need a way
>>> to force people not to be able to use different package versions than
>>> what existed at the time of each R release.  I said this in my
>>> previous email, but you removed and did not address it: "However, you
>>> would need to find a way to actively _prevent_ people from installing
>>> newer versions of packages with the stable R releases."  Frankly, I
>>> would stop using CRAN if this policy were adopted.
>> 
>> I am not proposing to "force" anything to anyone, those are your
>> words. Please read the proposal more carefully before derailing the
>> discussion. Below *verbatim* a section from the paper:
> 
> 
> Yes "force" is too strong a word.  You want a barrier (however small)
> to prevent people from installing newer (or older) versions of
> packages than those that correspond to a given R release.

Jeroen,

Reading this thread again, is it a fair summary of your position to say 
"reproducibility by default is more important than giving users access to the 
newest bug fixes and features by default?" It's certainly arguable, but I'm not 
sure I'm convinced: I'd imagine that the ratio of new work being done vs 
reproductions is rather high and the current setup optimizes for that already. 

What I'm trying to figure out is why the standard "install the following list 
of package versions" isn't good enough in your eyes? Is it the lack of CRAN 
provided binaries or the fact that the user has to proactively set up their 
environment to replicate that of published results?

In your XML example, it seems the problem was that the reproducer didn't check 
that the same package versions as the reproducee and instead assumed that 
'latest' would be the same. Annoying yes, but easy to solve. 

Michael

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Gavin Simpson

Given that R is (has) moved to a 12 month release cycle, I don't want
to either i) wait a year to get new packages (or allow users to use
new versions of my packages), or ii) have to run R-devel just to use
new packages. (or be on R-testing for that matter).

People then will start finding ways around these limitations and then
we're back to square one of having people use a set of R packages and
R versions that could potentially be all over the place.

As a package developer, it is pretty easy to say I've tested my
package works with these other packages and their versions, and set
DESCRIPTION to reflect only those versions as allowed (or a range as a
package matures and the maintainer has tested against more versions of
the dependencies). CRAN may well not like this if your package no
longer builds/checks on their system but then you have a choice to
make; stick to your reproducibility guns & forsake CRAN in favour of
something else (github, one's own repo), or relent and meet CRANs
requirements.

On 19 March 2014 16:57, Hervé Pagès  wrote:
>
>
> On 03/19/2014 02:59 PM, Joshua Ulrich wrote:
>>
>> On Wed, Mar 19, 2014 at 4:28 PM, Jeroen Ooms 
>> wrote:
>>>
>>> On Wed, Mar 19, 2014 at 11:50 AM, Joshua Ulrich 
>>> wrote:


 The suggested solution is not described in the referenced article.  It
 was not suggested that it be the operating system's responsibility to
 distribute snapshots, nor was it suggested to create binary
 repositories for specific operating systems, nor was it suggested to
 freeze only a subset of CRAN packages.
>>>
>>>
>>>
>>> IMO this is an implementation detail. If we could all agree on a
>>> particular
>>> set of cran packages to be used with a certain release of R, then it
>>> doesn't
>>> matter how the 'snapshotting' gets implemented. It could be a separate
>>> repository, or a directory on cran with symbolic links, or a page
>>> somewhere
>>> with hyperlinks to the respective source packages. Or you can put all
>>> packages in a big zip file, or include it in your OS distribution. You
>>> can
>>> even distribute your entire repo on cdroms (debian style!) or do all of
>>> the
>>> above.
>>>
>>> The hard problem is not implementation. The hard part is that for
>>> reproducibility to work, we need community wide conventions on which
>>> versions of cran packages are used by a particular release of R. Local
>>> downstream solutions are impractical, because this results in
>>> scripts/packages that only work within your niche using this particular
>>> snapshot. I expect that requiring every script be executed in the context
>>> of
>>> dependencies from some particular third party repository will make
>>> reproducibility even less common. Therefore I am trying to make a case
>>> for a
>>> solution that would naturally improve reliability/reproducibility of R
>>> code
>>> without any effort by the end-user.
>>>
>> So implementation isn't a problem.  The problem is that you need a way
>> to force people not to be able to use different package versions than
>> what existed at the time of each R release.  I said this in my
>> previous email, but you removed and did not address it: "However, you
>> would need to find a way to actively _prevent_ people from installing
>> newer versions of packages with the stable R releases."  Frankly, I
>> would stop using CRAN if this policy were adopted.
>>
>> I suggest you go build this yourself.  You have all the code available
>> on CRAN, and the dates at which each package was published.  If others
>> who care about reproducible research find what you've built useful,
>> you will create the very community you want.  And you won't have to
>> force one single person to change their workflow.
>
>
> Yeah we've already heard this "do it yourself" kind of answer. Not a
> very productive one honestly.
>
> Well actually that's what we've done for the Bioconductor repositories:
> we freeze the BioC packages for each version of Bioconductor. But since
> this freezing doesn't happen at the CRAN level, and many BioC packages
> depend on CRAN packages, the freezing is only at the surface. Would be
> much better if the freezing was all the way down to the bottom of the
> sea. (Note that it is already if you install binary packages only.)
>
> Yes it's technically possible to work around this by also hosting
> frozen versions of CRAN, one per version of Bioconductor, and have
> biocLite() (the tool BioC users use for installing packages) point to
> these frozen versions of CRAN in order to get the correct dependencies
> for any given version of BioC. However we don't do that because that
> would mean extra costs for us in terms of storage space and bandwidth.
> And also because we believe that it would be more effective and would
> ultimately benefit the entire R community (and not just the BioC
> community) if this problem was addressed upstream.
>
>
> H.
>
>>
>> Best,
>> --
>> Joshua Ulrich  |  about.me/joshuaulrich
>> FOSS Trading  |  www

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Gavin Simpson

"What am I overlooking?"

That this is already available and possible in R today, but perhaps
not widely used. Developers do tend to only include a lower bound if
they include any bounds at all on package dependencies.

As I mentioned elsewhere, R packages often aren't "built" against
other R packages and often developers may have a range of versions
being tested against, some of which may not be on CRAN yet.

Technically, all packages on CRAN would need to have a dependency cap
on R-devel, but as that is a moving target until it is released I
don't see in practice how enforcing an upper limit on the R dependency
would  work. The way CRAN works, you can't just set a dependency on R
== 3.0.x say. (As far as I understand CRAN's policies.)

For packages it is quite trivial for the developers to manually add
the required info for the upperbound, less so the lower bound, but you
could just pick a known working version. An upper range on the
dependencies could be stated as whatever version is current on CRAN.
But then what happens? Unbeknownst to you, a few days after you
release to CRAN your package foo with stated dependency on bar >= 1.2,
bar <= 1.8, the developer of bar releases bar v 2.0 and your package
no longer passes checks, CRAN gets in touch and you have to resubmit
another version. This could be desirable in terms of helping
contribute to reproducibility exercises, but incurs more effort on the
CRAN maintainers and package maintainers. Now, this might be an issue
because of the desire on CRAN's behalf to have some elements of human
intervention in the submission process, but you either work with CRAN
or do your own thing.

As Bioconductor have shown (for example) it is possible, if people
want to put in time and effort and have a community buy into an ethos,
to achieve staged releases etc.

G

On 19 March 2014 12:58, Carl Boettiger  wrote:
> Dear list,
>
> I'm curious what people would think of a more modest proposal at this time:
>
> State the version of the dependencies used by the package authors when the
> package was built.
>
> Eventually CRAN could enforce such a statement be present in the
> description. We encourage users to declare the version of the packages they
> use in publications, so why not have the same expectation of developers?
>  This would help address the problem of archived packages that Jeroen
> raises, as it is currently it is impossible to reliably install archived
> packages because their dependencies have since been updated and are no
> longer compatible.  (Even if it passes checks and installs, we have no way
> of knowing if the upstream changes have introduced a bug).  This
> information would be relatively straight forward to capture, shouldn't
> change the way anyone currently uses CRAN, and should address a major pain
> point anyone trying to install archived versions from CRAN has probably
> encountered.  What am I overlooking?
>
> Carl
>
>
> On Wed, Mar 19, 2014 at 11:36 AM, Spencer Graves <
> spencer.gra...@structuremonitoring.com> wrote:
>
>>   What about having this purpose met with something like an expansion
>> of R-Forge?  We could have packages submitted to R-Forge rather than CRAN,
>> and people who wanted the latest could get it from R-Forge.  If changes I
>> make on R-Forge break a reverse dependency, emails explaining the problem
>> are sent to both me and the maintainer for the package I broke.
>>
>>
>>   The budget for R-Forge would almost certainly need to be increased:
>>  They currently disable many of the tests they once ran.
>>
>>
>>   Regarding budget, the R Project would get more donations if they
>> asked for them and made it easier to contribute.  I've tried multiple times
>> without success to find a way to donate.  I didn't try hard, but it
>> shouldn't be hard ;-)  (And donations should be accepted in US dollars and
>> Euros -- and maybe other currencies.) There should be a procedure whereby
>> anyone could receive a pro forma invoice, which they can pay or ignore as
>> they choose.  I mention this, because many grants could cover a reasonable
>> fee provided they have an invoice.
>>
>>
>>   Spencer Graves
>>
>>
>> On 3/19/2014 10:59 AM, Jeroen Ooms wrote:
>>
>>> On Wed, Mar 19, 2014 at 5:52 AM, Duncan Murdoch >> >wrote:
>>>
>>>  I don't see why CRAN needs to be involved in this effort at all.  A third
 party could take snapshots of CRAN at R release dates, and make those
 available to package users in a separate repository.  It is not hard to
 set
 a different repository than CRAN as the default location from which to
 obtain packages.

  I am happy to see many people giving this some thought and engage in the
>>> discussion.
>>>
>>> Several have suggested that staging & freezing can be simply done by a
>>> third party. This solution and its limitations is also described in the
>>> paper [1] in the section titled "R: downstream staging and repackaging".
>>>
>>> If this would solve the problem with

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Romain Francois

Weighting in. FWIW, I find the proposal conceptually quite interesting. 

For package developers, it does not have to be a frustration to have to wait a 
new version of R to release their code. Anticipated frustration was my initial 
reaction. Thinking about this more, I think this could be changed into 
opportunity. 

Since the pattern here is to use Rcpp as an example of something causing 
compatibility headaches, and I have some responsibility there, maybe I can 
comment on this. I would find it extremely valuable if there was only one 
unique version of Rcpp for a given released version of R. 

Users would have to wait longer to have the new stuff, but one can argue that 
at least they get something that is more tested. 

Would it be helpful for authors of package that have lots of dependency to 
start having stricter depends declarations in their DESCRIPTION files, e.g. : 

Depends: R (== 3.1.0)

?

Romain

For example, personally I’m waiting for 3.1.0 for releasing Rcpp11 because I 
want to leverage some C++11 support that has been included in R. It has been 
frustrating to have to wait, but it does change the way I make changes to the 
codebase. Perhaps it is a good habit to take. And it does not need « more work 
» for others, just more discipline and self control from people implementing 
this pattern. 

also, declaring a strict dependency requirement against a released version of R 
perhaps could resume the drama of « you were asked to test this against a very 
recent version of R-devel, and guess what a few hours ago I’ve just added a new 
test that makes your package non R CMD check worthy ». So less work for CRAN 
maintainers then. 

Le 19 mars 2014 à 23:57, Hervé Pagès  a écrit :

> 
> 
> On 03/19/2014 02:59 PM, Joshua Ulrich wrote:
>> On Wed, Mar 19, 2014 at 4:28 PM, Jeroen Ooms  
>> wrote:
>>> On Wed, Mar 19, 2014 at 11:50 AM, Joshua Ulrich 
>>> wrote:

 The suggested solution is not described in the referenced article.  It
 was not suggested that it be the operating system's responsibility to
 distribute snapshots, nor was it suggested to create binary
 repositories for specific operating systems, nor was it suggested to
 freeze only a subset of CRAN packages.
>>> 
>>> 
>>> IMO this is an implementation detail. If we could all agree on a particular
>>> set of cran packages to be used with a certain release of R, then it doesn't
>>> matter how the 'snapshotting' gets implemented. It could be a separate
>>> repository, or a directory on cran with symbolic links, or a page somewhere
>>> with hyperlinks to the respective source packages. Or you can put all
>>> packages in a big zip file, or include it in your OS distribution. You can
>>> even distribute your entire repo on cdroms (debian style!) or do all of the
>>> above.
>>> 
>>> The hard problem is not implementation. The hard part is that for
>>> reproducibility to work, we need community wide conventions on which
>>> versions of cran packages are used by a particular release of R. Local
>>> downstream solutions are impractical, because this results in
>>> scripts/packages that only work within your niche using this particular
>>> snapshot. I expect that requiring every script be executed in the context of
>>> dependencies from some particular third party repository will make
>>> reproducibility even less common. Therefore I am trying to make a case for a
>>> solution that would naturally improve reliability/reproducibility of R code
>>> without any effort by the end-user.
>>> 
>> So implementation isn't a problem.  The problem is that you need a way
>> to force people not to be able to use different package versions than
>> what existed at the time of each R release.  I said this in my
>> previous email, but you removed and did not address it: "However, you
>> would need to find a way to actively _prevent_ people from installing
>> newer versions of packages with the stable R releases."  Frankly, I
>> would stop using CRAN if this policy were adopted.
>> 
>> I suggest you go build this yourself.  You have all the code available
>> on CRAN, and the dates at which each package was published.  If others
>> who care about reproducible research find what you've built useful,
>> you will create the very community you want.  And you won't have to
>> force one single person to change their workflow.
> 
> Yeah we've already heard this "do it yourself" kind of answer. Not a
> very productive one honestly.
> 
> Well actually that's what we've done for the Bioconductor repositories:
> we freeze the BioC packages for each version of Bioconductor. But since
> this freezing doesn't happen at the CRAN level, and many BioC packages
> depend on CRAN packages, the freezing is only at the surface. Would be
> much better if the freezing was all the way down to the bottom of the
> sea. (Note that it is already if you install binary packages only.)
> 
> Yes it's technically possible to work around this by also hosting
> frozen versions

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Hervé Pagès




On 03/19/2014 02:59 PM, Joshua Ulrich wrote:

On Wed, Mar 19, 2014 at 4:28 PM, Jeroen Ooms  wrote:

On Wed, Mar 19, 2014 at 11:50 AM, Joshua Ulrich 
wrote:


The suggested solution is not described in the referenced article.  It
was not suggested that it be the operating system's responsibility to
distribute snapshots, nor was it suggested to create binary
repositories for specific operating systems, nor was it suggested to
freeze only a subset of CRAN packages.



IMO this is an implementation detail. If we could all agree on a particular
set of cran packages to be used with a certain release of R, then it doesn't
matter how the 'snapshotting' gets implemented. It could be a separate
repository, or a directory on cran with symbolic links, or a page somewhere
with hyperlinks to the respective source packages. Or you can put all
packages in a big zip file, or include it in your OS distribution. You can
even distribute your entire repo on cdroms (debian style!) or do all of the
above.

The hard problem is not implementation. The hard part is that for
reproducibility to work, we need community wide conventions on which
versions of cran packages are used by a particular release of R. Local
downstream solutions are impractical, because this results in
scripts/packages that only work within your niche using this particular
snapshot. I expect that requiring every script be executed in the context of
dependencies from some particular third party repository will make
reproducibility even less common. Therefore I am trying to make a case for a
solution that would naturally improve reliability/reproducibility of R code
without any effort by the end-user.


So implementation isn't a problem.  The problem is that you need a way
to force people not to be able to use different package versions than
what existed at the time of each R release.  I said this in my
previous email, but you removed and did not address it: "However, you
would need to find a way to actively _prevent_ people from installing
newer versions of packages with the stable R releases."  Frankly, I
would stop using CRAN if this policy were adopted.

I suggest you go build this yourself.  You have all the code available
on CRAN, and the dates at which each package was published.  If others
who care about reproducible research find what you've built useful,
you will create the very community you want.  And you won't have to
force one single person to change their workflow.


Yeah we've already heard this "do it yourself" kind of answer. Not a
very productive one honestly.

Well actually that's what we've done for the Bioconductor repositories:
we freeze the BioC packages for each version of Bioconductor. But since
this freezing doesn't happen at the CRAN level, and many BioC packages
depend on CRAN packages, the freezing is only at the surface. Would be
much better if the freezing was all the way down to the bottom of the
sea. (Note that it is already if you install binary packages only.)

Yes it's technically possible to work around this by also hosting
frozen versions of CRAN, one per version of Bioconductor, and have
biocLite() (the tool BioC users use for installing packages) point to
these frozen versions of CRAN in order to get the correct dependencies
for any given version of BioC. However we don't do that because that
would mean extra costs for us in terms of storage space and bandwidth.
And also because we believe that it would be more effective and would
ultimately benefit the entire R community (and not just the BioC
community) if this problem was addressed upstream.

H.



Best,
--
Joshua Ulrich  |  about.me/joshuaulrich
FOSS Trading  |  www.fosstrading.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Joshua Ulrich

On Wed, Mar 19, 2014 at 5:16 PM, Jeroen Ooms  wrote:
> On Wed, Mar 19, 2014 at 2:59 PM, Joshua Ulrich  
> wrote:
>>
>> So implementation isn't a problem.  The problem is that you need a way
>> to force people not to be able to use different package versions than
>> what existed at the time of each R release.  I said this in my
>> previous email, but you removed and did not address it: "However, you
>> would need to find a way to actively _prevent_ people from installing
>> newer versions of packages with the stable R releases."  Frankly, I
>> would stop using CRAN if this policy were adopted.
>
> I am not proposing to "force" anything to anyone, those are your
> words. Please read the proposal more carefully before derailing the
> discussion. Below *verbatim* a section from the paper:
>


Yes "force" is too strong a word.  You want a barrier (however small)
to prevent people from installing newer (or older) versions of
packages than those that correspond to a given R release.

I still think you're going to have a very hard time convincing CRAN
maintainers to take up your cause, even if you were to build support
for it.  Especially because there's nothing stopping anyone else from
doing it.

--
Joshua Ulrich  |  about.me/joshuaulrich
FOSS Trading  |  www.fosstrading.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Jeroen Ooms

On Wed, Mar 19, 2014 at 2:59 PM, Joshua Ulrich  wrote:
>
> So implementation isn't a problem.  The problem is that you need a way
> to force people not to be able to use different package versions than
> what existed at the time of each R release.  I said this in my
> previous email, but you removed and did not address it: "However, you
> would need to find a way to actively _prevent_ people from installing
> newer versions of packages with the stable R releases."  Frankly, I
> would stop using CRAN if this policy were adopted.

I am not proposing to "force" anything to anyone, those are your
words. Please read the proposal more carefully before derailing the
discussion. Below *verbatim* a section from the paper:

To fully make the transition to a staged CRAN, the default behavior of
the package manager must be modified to download packages from the
stable branch of the current version of R, rather than the latest
development release. As such, all users on a given version of R will
be using the same version of each CRAN package, regardless on when it
was installed. The user could still be given an option to try and
install the development version from the unstable branch, for example
by adding an additional parameter to install.packages named
devel=TRUE. However when installing an unstable package, it must be
flagged, and the user must be warned that this version is not properly
tested and might not be working as expected. Furthermore, when loading
this package a warning could be shown with the version number so that
it is also obvious from the output that results were produced using a
non-standard version of the contributed package. Finally, users that
would always like to use the very latest versions of all packages,
e.g. developers, could install the r-devel release of R. This version
contains the latest commits by R Core and downloads packages from the
devel branch on CRAN, but should not be used or in production or
reproducible research settings.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Dan Tenenbaum



- Original Message -
> From: "Joshua Ulrich" 
> To: "Jeroen Ooms" 
> Cc: "r-devel" 
> Sent: Wednesday, March 19, 2014 2:59:53 PM
> Subject: Re: [Rd] [RFC] A case for freezing CRAN
> 
> On Wed, Mar 19, 2014 at 4:28 PM, Jeroen Ooms
>  wrote:
> > On Wed, Mar 19, 2014 at 11:50 AM, Joshua Ulrich
> > 
> > wrote:
> >>
> >> The suggested solution is not described in the referenced article.
> >>  It
> >> was not suggested that it be the operating system's responsibility
> >> to
> >> distribute snapshots, nor was it suggested to create binary
> >> repositories for specific operating systems, nor was it suggested
> >> to
> >> freeze only a subset of CRAN packages.
> >
> >
> > IMO this is an implementation detail. If we could all agree on a
> > particular
> > set of cran packages to be used with a certain release of R, then
> > it doesn't
> > matter how the 'snapshotting' gets implemented. It could be a
> > separate
> > repository, or a directory on cran with symbolic links, or a page
> > somewhere
> > with hyperlinks to the respective source packages. Or you can put
> > all
> > packages in a big zip file, or include it in your OS distribution.
> > You can
> > even distribute your entire repo on cdroms (debian style!) or do
> > all of the
> > above.
> >
> > The hard problem is not implementation. The hard part is that for
> > reproducibility to work, we need community wide conventions on
> > which
> > versions of cran packages are used by a particular release of R.
> > Local
> > downstream solutions are impractical, because this results in
> > scripts/packages that only work within your niche using this
> > particular
> > snapshot. I expect that requiring every script be executed in the
> > context of
> > dependencies from some particular third party repository will make
> > reproducibility even less common. Therefore I am trying to make a
> > case for a
> > solution that would naturally improve reliability/reproducibility
> > of R code
> > without any effort by the end-user.
> >
> So implementation isn't a problem.  The problem is that you need a
> way
> to force people not to be able to use different package versions than
> what existed at the time of each R release.  I said this in my
> previous email, but you removed and did not address it: "However, you
> would need to find a way to actively _prevent_ people from installing
> newer versions of packages with the stable R releases."  Frankly, I
> would stop using CRAN if this policy were adopted.
> 

I don't see how the proposal forces anyone to do anything. If you have an old 
version of R and you still want to install newer versions of packages, you can 
download them from their CRAN landing page. As I understand it, the proposal 
only addresses what packages would be installed **by default** for a given 
version of R.

People would be free to override those default settings (by downloading newer 
packages as described above) but they should then not expect to be able to 
reproduce an earlier analysis since they'll have the wrong package versions. If 
they don't care, that's fine (provided that no other problems arise, such as 
the newer package depending on a feature of R that doesn't exist in the version 
you're running).

Dan

> I suggest you go build this yourself.  You have all the code
> available
> on CRAN, and the dates at which each package was published.  If
> others
> who care about reproducible research find what you've built useful,
> you will create the very community you want.  And you won't have to
> force one single person to change their workflow.
> 
> Best,
> --
> Joshua Ulrich  |  about.me/joshuaulrich
> FOSS Trading  |  www.fosstrading.com
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Joshua Ulrich

On Wed, Mar 19, 2014 at 4:28 PM, Jeroen Ooms  wrote:
> On Wed, Mar 19, 2014 at 11:50 AM, Joshua Ulrich 
> wrote:
>>
>> The suggested solution is not described in the referenced article.  It
>> was not suggested that it be the operating system's responsibility to
>> distribute snapshots, nor was it suggested to create binary
>> repositories for specific operating systems, nor was it suggested to
>> freeze only a subset of CRAN packages.
>
>
> IMO this is an implementation detail. If we could all agree on a particular
> set of cran packages to be used with a certain release of R, then it doesn't
> matter how the 'snapshotting' gets implemented. It could be a separate
> repository, or a directory on cran with symbolic links, or a page somewhere
> with hyperlinks to the respective source packages. Or you can put all
> packages in a big zip file, or include it in your OS distribution. You can
> even distribute your entire repo on cdroms (debian style!) or do all of the
> above.
>
> The hard problem is not implementation. The hard part is that for
> reproducibility to work, we need community wide conventions on which
> versions of cran packages are used by a particular release of R. Local
> downstream solutions are impractical, because this results in
> scripts/packages that only work within your niche using this particular
> snapshot. I expect that requiring every script be executed in the context of
> dependencies from some particular third party repository will make
> reproducibility even less common. Therefore I am trying to make a case for a
> solution that would naturally improve reliability/reproducibility of R code
> without any effort by the end-user.
>
So implementation isn't a problem.  The problem is that you need a way
to force people not to be able to use different package versions than
what existed at the time of each R release.  I said this in my
previous email, but you removed and did not address it: "However, you
would need to find a way to actively _prevent_ people from installing
newer versions of packages with the stable R releases."  Frankly, I
would stop using CRAN if this policy were adopted.

I suggest you go build this yourself.  You have all the code available
on CRAN, and the dates at which each package was published.  If others
who care about reproducible research find what you've built useful,
you will create the very community you want.  And you won't have to
force one single person to change their workflow.

Best,
--
Joshua Ulrich  |  about.me/joshuaulrich
FOSS Trading  |  www.fosstrading.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Jeroen Ooms

On Wed, Mar 19, 2014 at 11:50 AM, Joshua Ulrich 
wrote:
>
> The suggested solution is not described in the referenced article.  It
> was not suggested that it be the operating system's responsibility to
> distribute snapshots, nor was it suggested to create binary
> repositories for specific operating systems, nor was it suggested to
> freeze only a subset of CRAN packages.

IMO this is an implementation detail. If we could all agree on a particular
set of cran packages to be used with a certain release of R, then it
doesn't matter how the 'snapshotting' gets implemented. It could be a
separate repository, or a directory on cran with symbolic links, or a page
somewhere with hyperlinks to the respective source packages. Or you can put
all packages in a big zip file, or include it in your OS distribution. You
can even distribute your entire repo on cdroms (debian style!) or do all of
the above.

The hard problem is not implementation. The hard part is that for
reproducibility to work, we need community wide conventions on which
versions of cran packages are used by a particular release of R. Local
downstream solutions are impractical, because this results in
scripts/packages that only work within your niche using this particular
snapshot. I expect that requiring every script be executed in the context
of dependencies from some particular third party repository will make
reproducibility even less common. Therefore I am trying to make a case for
a solution that would naturally improve reliability/reproducibility of R
code without any effort by the end-user.

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Hervé Pagès


Hi,

On 03/19/2014 07:00 AM, Kasper Daniel Hansen wrote:

Our experience in Bioconductor is that this is a pretty hard problem.


What's hard and requires a substantial amount of human resources is to
run our build system (set up the build machines, keep up with changes
in R, babysit the builds, assist developers with build issues, etc...)

But *freezing* the CRAN packages for each version of R is *very* easy
to do. The CRAN maintainers already do it for the binary packages.
What could be the reason for not doing it for source packages too?
Maybe in prehistoric times there was this belief that a source package
was aimed to remain compatible with all versions of R, present and
future, but that dream is dead and gone...

Right now the layout of the CRAN package repo is:

  ├── src
  │   └── contrib
  └── bin
  ├── windows
  │   └── contrib
  │   ├ ...
  │   ├ 3.0
  │   ├ 3.1
  │   ├ ...
  └── macosx
  └── contrib
  ├ ...
  ├ 3.0
  ├ 3.1
  ├ ...

when it could be:

  ├── 3.0
  │   ├── src
  │   │   └── contrib
  │   └── bin
  │   ├── windows
  │   │   └── contrib
  │   └── macosx
  │   └── contrib
  ├── 3.1
  │   ├── src
  │   │   └── contrib
  │   └── bin
  │   ├── windows
  │   │   └── contrib
  │   └── macosx
  │   └── contrib
  ├── ...

That is: the split by version is done at the top, not at the bottom.

It doesn't use more disk space than the current layout (you can just
throw the src/contrib/Archive/ folder away, there is no more need
for it).

install.packages() and family would need to be modified a little bit
to work with this new layout. And that's all!

The never ending changes in Mac OS X binary formats can be handled
in a cleaner way i.e. no more symlinks under bin/macosx to keep
backward compatibility with different binary formats and with old
versions of install.packages().

Then in 10 years from now, you can reproduce an analysis that you
did today with R-3.0. Because when you'll install R-3.0 and the
packages required for this analysis, you'll end up with exactly
the same packages as today.

Cheers,
H.



What the OP presumably wants is some guarantee that all packages on CRAN
work well together.  A good example is when Rcpp was updated, it broke
other packages (quick note: The Rcpp developers do a incredible amount of
work to deal with this; it is almost impossible to not have a few days of
chaos).  Ensuring this is not a trivial task, and it requires some buy-in
both from the "repository" and from the developers.

For Bioconductor it is even harder as the dependency graph of Bioconductor
is much more involved than the one for CRAN, where most packages depends
only on a few other packages.  This is why we need to do this for Bioc.

Based on my experience with CRAN I am not sure I see a need for a
coordinated release (or rather, I can sympathize with the need, but I don't
think the effort is worth it).

What would be more useful in terms of reproducibility is the capability of
installing a specific version of a package from a repository using
install.packages(), which would require archiving older versions in a
coordinated fashion. I know CRAN archives old versions, but I am not aware
if we can programmatically query the repository about this.

Best,
Kasper


On Wed, Mar 19, 2014 at 8:52 AM, Joshua Ulrich wrote:


On Tue, Mar 18, 2014 at 3:24 PM, Jeroen Ooms 
wrote:


## Summary

Extending the r-release cycle to CRAN seems like a solution that would
be easy to implement. Package updates simply only get pushed to the
r-devel branches of cran, rather than r-release and r-release-old.
This separates development from production/use in a way that is common
sense in most open source communities. Benefits for R include:


Nothing is ever as simple as it seems (especially from the perspective
of one who won't be doing the work).

There is nothing preventing you (or anyone else) from creating
repositories that do what you suggest.  Create a CRAN mirror (or more
than one) that only include the package versions you think they
should.  Then have your production servers use it (them) instead of
CRAN.

Better yet, make those repositories public.  If many people like your
idea, they will use your new repositories instead of CRAN.  There is
no reason to impose this change on all world-wide CRAN users.

Best,
--
Joshua Ulrich  |  about.me/joshuaulrich
FOSS Trading  |  www.fosstrading.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Jeroen Ooms

On Wed, Mar 19, 2014 at 7:00 AM, Kasper Daniel Hansen
 wrote:
> Our experience in Bioconductor is that this is a pretty hard problem.
>
> What the OP presumably wants is some guarantee that all packages on CRAN work 
> well together.

Obviously we can not guarantee that all packages on CRAN work
together. But what we can do is prevent problems that are introduced
by version ambiguity. If author develops and tests a script/package
with dependency Rcpp 0.10.6, the best chance of making that script or
package work for other users is using Rcpp 0.10.6.

This especially holds if there is a big time difference between the
author creating the pkg/script and someone using it. In practice most
Sweave/knitr scripts used for generating papers and articles can not
be reproduced after a while because the dependency packages have
changed in the mean time. These problem can largely be mitigated with
a release cycle.

I am not arguing that anyone should put manual effort into testing
that packages work together. On the contrary: a system that separates
development from released branches prevents you from having to
continuously test all reverse dependencies for every package update.

My argument is simply that many problems introduced by version
ambiguity can be prevented if we can unite the entire R community
around using a single version of each CRAN package for every specific
release of R. Similar to how linux distributions use a single version
of each software package in a particular release of the distribution.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Carl Boettiger

Dear list,

I'm curious what people would think of a more modest proposal at this time:

State the version of the dependencies used by the package authors when the
package was built.

Eventually CRAN could enforce such a statement be present in the
description. We encourage users to declare the version of the packages they
use in publications, so why not have the same expectation of developers?
 This would help address the problem of archived packages that Jeroen
raises, as it is currently it is impossible to reliably install archived
packages because their dependencies have since been updated and are no
longer compatible.  (Even if it passes checks and installs, we have no way
of knowing if the upstream changes have introduced a bug).  This
information would be relatively straight forward to capture, shouldn't
change the way anyone currently uses CRAN, and should address a major pain
point anyone trying to install archived versions from CRAN has probably
encountered.  What am I overlooking?

Carl


On Wed, Mar 19, 2014 at 11:36 AM, Spencer Graves <
spencer.gra...@structuremonitoring.com> wrote:

>   What about having this purpose met with something like an expansion
> of R-Forge?  We could have packages submitted to R-Forge rather than CRAN,
> and people who wanted the latest could get it from R-Forge.  If changes I
> make on R-Forge break a reverse dependency, emails explaining the problem
> are sent to both me and the maintainer for the package I broke.
>
>
>   The budget for R-Forge would almost certainly need to be increased:
>  They currently disable many of the tests they once ran.
>
>
>   Regarding budget, the R Project would get more donations if they
> asked for them and made it easier to contribute.  I've tried multiple times
> without success to find a way to donate.  I didn't try hard, but it
> shouldn't be hard ;-)  (And donations should be accepted in US dollars and
> Euros -- and maybe other currencies.) There should be a procedure whereby
> anyone could receive a pro forma invoice, which they can pay or ignore as
> they choose.  I mention this, because many grants could cover a reasonable
> fee provided they have an invoice.
>
>
>   Spencer Graves
>
>
> On 3/19/2014 10:59 AM, Jeroen Ooms wrote:
>
>> On Wed, Mar 19, 2014 at 5:52 AM, Duncan Murdoch > >wrote:
>>
>>  I don't see why CRAN needs to be involved in this effort at all.  A third
>>> party could take snapshots of CRAN at R release dates, and make those
>>> available to package users in a separate repository.  It is not hard to
>>> set
>>> a different repository than CRAN as the default location from which to
>>> obtain packages.
>>>
>>>  I am happy to see many people giving this some thought and engage in the
>> discussion.
>>
>> Several have suggested that staging & freezing can be simply done by a
>> third party. This solution and its limitations is also described in the
>> paper [1] in the section titled "R: downstream staging and repackaging".
>>
>> If this would solve the problem without affecting CRAN, we would have been
>> done this obviously. In fact, as described in the paper and pointed out by
>> some people, initiatives such as Debian or Revolution Enterprise already
>> include a frozen library of R packages. Also companies like Google
>> maintain
>> their own internal repository with packages that are used throughout the
>> company.
>>
>> The problem with this approach is that when you using some 3rd party
>> package snapshot, your r/sweave scripts will still only be
>> reliable/reproducible for other users of that specific snapshot. E.g. for
>> the examples above, a script that is written in R 3.0 by a Debian user is
>> not guaranteed to work on R 3.0 in Google, or R 3.0 on some other 3rd
>> party
>> cran snapshot. Hence this solution merely redefines the problem from "this
>> script depends on pkgA 1.1 and pkgB 0.2.3" to "this script depends on
>> repository foo 2.0". And given that most users would still be pulling
>> packages straight from CRAN, it would still be terribly difficult to
>> reproduce a 5 year old sweave script from e.g. JSS.
>>
>> For this reason I believe the only effective place to organize this
>> staging
>> is all the way upstream, on CRAN. Imagine a world where your r/sweave
>> script would be reliable/reproducible, out of the box, on any system, any
>> platform in any company using on R 3.0. No need to investigate which
>> specific packages or cran snapshot the author was using at the time of
>> writing the script, and trying to reconstruct such libraries for each
>> script you want to reproduce. No ambiguity about which package versions
>> are
>> used by R 3.0. However for better or worse, I think this could only be
>> accomplished with a cran release cycle (i.e. "universal snapshots")
>> accompanying the already existing r releases.
>>
>>
>>
>>  The only objection I can see to this is that it requires extra work by
>>> the
>>> third party, rather than extra work by the CRAN team.

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Joshua Ulrich

On Wed, Mar 19, 2014 at 12:59 PM, Jeroen Ooms  wrote:
> On Wed, Mar 19, 2014 at 5:52 AM, Duncan Murdoch 
> wrote:
>
>> I don't see why CRAN needs to be involved in this effort at all.  A third
>> party could take snapshots of CRAN at R release dates, and make those
>> available to package users in a separate repository.  It is not hard to set
>> a different repository than CRAN as the default location from which to
>> obtain packages.
>>
>
> I am happy to see many people giving this some thought and engage in the
> discussion.
>
> Several have suggested that staging & freezing can be simply done by a
> third party. This solution and its limitations is also described in the
> paper [1] in the section titled "R: downstream staging and repackaging".
>
> If this would solve the problem without affecting CRAN, we would have been
> done this obviously. In fact, as described in the paper and pointed out by
> some people, initiatives such as Debian or Revolution Enterprise already
> include a frozen library of R packages. Also companies like Google maintain
> their own internal repository with packages that are used throughout the
> company.
>
The suggested solution is not described in the referenced article.  It
was not suggested that it be the operating system's responsibility to
distribute snapshots, nor was it suggested to create binary
repositories for specific operating systems, nor was it suggested to
freeze only a subset of CRAN packages.

> The problem with this approach is that when you using some 3rd party
> package snapshot, your r/sweave scripts will still only be
> reliable/reproducible for other users of that specific snapshot. E.g. for
> the examples above, a script that is written in R 3.0 by a Debian user is
> not guaranteed to work on R 3.0 in Google, or R 3.0 on some other 3rd party
> cran snapshot. Hence this solution merely redefines the problem from "this
> script depends on pkgA 1.1 and pkgB 0.2.3" to "this script depends on
> repository foo 2.0". And given that most users would still be pulling
> packages straight from CRAN, it would still be terribly difficult to
> reproduce a 5 year old sweave script from e.g. JSS.
>
This can be solved by the third party making the repository public.

> For this reason I believe the only effective place to organize this staging
> is all the way upstream, on CRAN. Imagine a world where your r/sweave
> script would be reliable/reproducible, out of the box, on any system, any
> platform in any company using on R 3.0. No need to investigate which
> specific packages or cran snapshot the author was using at the time of
> writing the script, and trying to reconstruct such libraries for each
> script you want to reproduce. No ambiguity about which package versions are
> used by R 3.0. However for better or worse, I think this could only be
> accomplished with a cran release cycle (i.e. "universal snapshots")
> accompanying the already existing r releases.
>
This could be done by a public third-party repository, independent of
CRAN.  However, you would need to find a way to actively _prevent_
people from installing newer versions of packages with the stable R
releases.

--
Joshua Ulrich  |  about.me/joshuaulrich
FOSS Trading  |  www.fosstrading.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Spencer Graves

  What about having this purpose met with something like an 
expansion of R-Forge?  We could have packages submitted to R-Forge 
rather than CRAN, and people who wanted the latest could get it from 
R-Forge.  If changes I make on R-Forge break a reverse dependency, 
emails explaining the problem are sent to both me and the maintainer for 
the package I broke.



  The budget for R-Forge would almost certainly need to be 
increased:  They currently disable many of the tests they once ran.



  Regarding budget, the R Project would get more donations if they 
asked for them and made it easier to contribute.  I've tried multiple 
times without success to find a way to donate.  I didn't try hard, but 
it shouldn't be hard ;-)  (And donations should be accepted in US 
dollars and Euros -- and maybe other currencies.) There should be a 
procedure whereby anyone could receive a pro forma invoice, which they 
can pay or ignore as they choose.  I mention this, because many grants 
could cover a reasonable fee provided they have an invoice.



  Spencer Graves


On 3/19/2014 10:59 AM, Jeroen Ooms wrote:

On Wed, Mar 19, 2014 at 5:52 AM, Duncan Murdoch wrote:


I don't see why CRAN needs to be involved in this effort at all.  A third
party could take snapshots of CRAN at R release dates, and make those
available to package users in a separate repository.  It is not hard to set
a different repository than CRAN as the default location from which to
obtain packages.


I am happy to see many people giving this some thought and engage in the
discussion.

Several have suggested that staging & freezing can be simply done by a
third party. This solution and its limitations is also described in the
paper [1] in the section titled "R: downstream staging and repackaging".

If this would solve the problem without affecting CRAN, we would have been
done this obviously. In fact, as described in the paper and pointed out by
some people, initiatives such as Debian or Revolution Enterprise already
include a frozen library of R packages. Also companies like Google maintain
their own internal repository with packages that are used throughout the
company.

The problem with this approach is that when you using some 3rd party
package snapshot, your r/sweave scripts will still only be
reliable/reproducible for other users of that specific snapshot. E.g. for
the examples above, a script that is written in R 3.0 by a Debian user is
not guaranteed to work on R 3.0 in Google, or R 3.0 on some other 3rd party
cran snapshot. Hence this solution merely redefines the problem from "this
script depends on pkgA 1.1 and pkgB 0.2.3" to "this script depends on
repository foo 2.0". And given that most users would still be pulling
packages straight from CRAN, it would still be terribly difficult to
reproduce a 5 year old sweave script from e.g. JSS.

For this reason I believe the only effective place to organize this staging
is all the way upstream, on CRAN. Imagine a world where your r/sweave
script would be reliable/reproducible, out of the box, on any system, any
platform in any company using on R 3.0. No need to investigate which
specific packages or cran snapshot the author was using at the time of
writing the script, and trying to reconstruct such libraries for each
script you want to reproduce. No ambiguity about which package versions are
used by R 3.0. However for better or worse, I think this could only be
accomplished with a cran release cycle (i.e. "universal snapshots")
accompanying the already existing r releases.




The only objection I can see to this is that it requires extra work by the
third party, rather than extra work by the CRAN team. I don't think the
total amount of work required is much different.  I'm very unsympathetic to
proposals to dump work on others.


I am merely trying to discuss a technical issue in an attempt to improve
reliability of our software and reproducibility of papers created with R.

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Jeroen Ooms

On Wed, Mar 19, 2014 at 5:52 AM, Duncan Murdoch wrote:

> I don't see why CRAN needs to be involved in this effort at all.  A third
> party could take snapshots of CRAN at R release dates, and make those
> available to package users in a separate repository.  It is not hard to set
> a different repository than CRAN as the default location from which to
> obtain packages.
>

I am happy to see many people giving this some thought and engage in the
discussion.

Several have suggested that staging & freezing can be simply done by a
third party. This solution and its limitations is also described in the
paper [1] in the section titled "R: downstream staging and repackaging".

If this would solve the problem without affecting CRAN, we would have been
done this obviously. In fact, as described in the paper and pointed out by
some people, initiatives such as Debian or Revolution Enterprise already
include a frozen library of R packages. Also companies like Google maintain
their own internal repository with packages that are used throughout the
company.

The problem with this approach is that when you using some 3rd party
package snapshot, your r/sweave scripts will still only be
reliable/reproducible for other users of that specific snapshot. E.g. for
the examples above, a script that is written in R 3.0 by a Debian user is
not guaranteed to work on R 3.0 in Google, or R 3.0 on some other 3rd party
cran snapshot. Hence this solution merely redefines the problem from "this
script depends on pkgA 1.1 and pkgB 0.2.3" to "this script depends on
repository foo 2.0". And given that most users would still be pulling
packages straight from CRAN, it would still be terribly difficult to
reproduce a 5 year old sweave script from e.g. JSS.

For this reason I believe the only effective place to organize this staging
is all the way upstream, on CRAN. Imagine a world where your r/sweave
script would be reliable/reproducible, out of the box, on any system, any
platform in any company using on R 3.0. No need to investigate which
specific packages or cran snapshot the author was using at the time of
writing the script, and trying to reconstruct such libraries for each
script you want to reproduce. No ambiguity about which package versions are
used by R 3.0. However for better or worse, I think this could only be
accomplished with a cran release cycle (i.e. "universal snapshots")
accompanying the already existing r releases.

> The only objection I can see to this is that it requires extra work by the
> third party, rather than extra work by the CRAN team. I don't think the
> total amount of work required is much different.  I'm very unsympathetic to
> proposals to dump work on others.

I am merely trying to discuss a technical issue in an attempt to improve
reliability of our software and reproducibility of papers created with R.

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Geoff Jentry


using the identical version of each CRAN package. The bioconductor
project uses similar policies.


While I agree that this can be an issue, I don't think it is fair to 
compare CRAN to BioC. Unless things have changed, the latter has a more 
rigorous barrier to entry which includes buy in of various ideals (e.g. 
interoperability w/ other BioC packages, making use of BioC constructs, 
the official release cycle). All of that requires extra management 
overhead (read: human effort) which considering that CRAN isn't exactly 
swimming in spare cycles seems unlikely to happen.


It seems like one could set up a curated CRAN-a-like quite easily, 
advertise the heck out of it and let the "market" decide. That is, IMO, 
the beauty of open source.


-J

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Hadley Wickham

> What would be more useful in terms of reproducibility is the capability of
> installing a specific version of a package from a repository using
> install.packages(), which would require archiving older versions in a
> coordinated fashion. I know CRAN archives old versions, but I am not aware
> if we can programmatically query the repository about this.

See devtools::install_version().

The main caveat is that you also need to be able to build the package,
and ensure you have dependencies that work with that version.

Hadley


-- 
http://had.co.nz/

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Dirk Eddelbuettel


Piling on:

On 19 March 2014 at 07:52, Joshua Ulrich wrote:
| There is nothing preventing you (or anyone else) from creating
| repositories that do what you suggest.  Create a CRAN mirror (or more
| than one) that only include the package versions you think they
| should.  Then have your production servers use it (them) instead of
| CRAN.
| 
| Better yet, make those repositories public.  If many people like your
| idea, they will use your new repositories instead of CRAN.  There is
| no reason to impose this change on all world-wide CRAN users.

On 19 March 2014 at 08:52, Duncan Murdoch wrote:
| I don't see why CRAN needs to be involved in this effort at all.  A 
| third party could take snapshots of CRAN at R release dates, and make 
| those available to package users in a separate repository.  It is not 
| hard to set a different repository than CRAN as the default location 
| from which to obtain packages.
| 
| The only objection I can see to this is that it requires extra work by 
| the third party, rather than extra work by the CRAN team. I don't think 
| the total amount of work required is much different.  I'm very 
| unsympathetic to proposals to dump work on others.


And to a first approximation some of those efforts already exist:

  -- 200+ r-cran-* packages in Debian proper

  -- 2000+ r-cran-* packages in Michael's c2d4u (via launchpad)

  -- 5000+ r-cran-* packages in Don's debian-r.debian.net

The only difference here is that Jeroen wants to organize source packages.
But that is just a matter of stacking them in directory trees and calling

setwd("/path/to/root/of/your/repo/version")
tools::write_PACKAGES(".", type="source")'

to create PACKAGES and PACKAGES.gz.

Dirk

-- 
Dirk Eddelbuettel | e...@debian.org | http://dirk.eddelbuettel.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Kasper Daniel Hansen

Our experience in Bioconductor is that this is a pretty hard problem.

What the OP presumably wants is some guarantee that all packages on CRAN
work well together.  A good example is when Rcpp was updated, it broke
other packages (quick note: The Rcpp developers do a incredible amount of
work to deal with this; it is almost impossible to not have a few days of
chaos).  Ensuring this is not a trivial task, and it requires some buy-in
both from the "repository" and from the developers.

For Bioconductor it is even harder as the dependency graph of Bioconductor
is much more involved than the one for CRAN, where most packages depends
only on a few other packages.  This is why we need to do this for Bioc.

Based on my experience with CRAN I am not sure I see a need for a
coordinated release (or rather, I can sympathize with the need, but I don't
think the effort is worth it).

What would be more useful in terms of reproducibility is the capability of
installing a specific version of a package from a repository using
install.packages(), which would require archiving older versions in a
coordinated fashion. I know CRAN archives old versions, but I am not aware
if we can programmatically query the repository about this.

Best,
Kasper

On Wed, Mar 19, 2014 at 8:52 AM, Joshua Ulrich wrote:

> On Tue, Mar 18, 2014 at 3:24 PM, Jeroen Ooms 
> wrote:
> 
> > ## Summary
> >
> > Extending the r-release cycle to CRAN seems like a solution that would
> > be easy to implement. Package updates simply only get pushed to the
> > r-devel branches of cran, rather than r-release and r-release-old.
> > This separates development from production/use in a way that is common
> > sense in most open source communities. Benefits for R include:
> >
> Nothing is ever as simple as it seems (especially from the perspective
> of one who won't be doing the work).
>
> There is nothing preventing you (or anyone else) from creating
> repositories that do what you suggest.  Create a CRAN mirror (or more
> than one) that only include the package versions you think they
> should.  Then have your production servers use it (them) instead of
> CRAN.
>
> Better yet, make those repositories public.  If many people like your
> idea, they will use your new repositories instead of CRAN.  There is
> no reason to impose this change on all world-wide CRAN users.
>
> Best,
> --
> Joshua Ulrich  |  about.me/joshuaulrich
> FOSS Trading  |  www.fosstrading.com
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Duncan Murdoch

I don't see why CRAN needs to be involved in this effort at all.  A 
third party could take snapshots of CRAN at R release dates, and make 
those available to package users in a separate repository.  It is not 
hard to set a different repository than CRAN as the default location 
from which to obtain packages.


The only objection I can see to this is that it requires extra work by 
the third party, rather than extra work by the CRAN team. I don't think 
the total amount of work required is much different.  I'm very 
unsympathetic to proposals to dump work on others.


Duncan Murdoch

On 18/03/2014 4:24 PM, Jeroen Ooms wrote:

This came up again recently with an irreproducible paper. Below an
attempt to make a case for extending the r-devel/r-release cycle to
CRAN packages. These suggestions are not in any way intended as
criticism on anyone or the status quo.

The proposal described in [1] is to freeze a snapshot of CRAN along
with every release of R. In this design, updates for contributed
packages treated the same as updates for base packages in the sense
that they are only published to the r-devel branch of CRAN and do not
affect users of "released" versions of R. Thereby all users, stacks
and applications using a particular version of R will by default be
using the identical version of each CRAN package. The bioconductor
project uses similar policies.

This system has several important advantages:

## Reproducibility

Currently r/sweave/knitr scripts are unstable because of ambiguity
introduced by constantly changing cran packages. This causes scripts
to break or change behavior when upstream packages are updated, which
makes reproducing old results extremely difficult.

A common counter-argument is that script authors should document
package versions used in the script using sessionInfo(). However even
if authors would manually do this, reconstructing the author's
environment from this information is cumbersome and often nearly
impossible, because binary packages might no longer be available,
dependency conflicts, etc. See [1] for a worked example. In practice,
the current system causes many results or documents generated with R
no to be reproducible, sometimes already after a few months.

In a system where contributed packages inherit the r-base release
cycle, scripts will behave the same across users/systems/time within a
given version of R. This severely reduces ambiguity of R behavior, and
has the potential of making reproducibility a natural part of the
language, rather than a tedious exercise.

## Repository Management

Just like scripts suffer from upstream changes, so do packages
depending on other packages. A particular package that has been
developed and tested against the current version of a particular
dependency is not guaranteed to work against *any future version* of
that dependency. Therefore, packages inevitably break over time as
their dependencies are updated.

One recent example is the Rcpp 0.11 release, which required all
reverse dependencies to be rebuild/modified. This updated caused some
serious disruption on our production servers. Initially we refrained
from updating Rcpp on these servers to prevent currently installed
packages depending on Rcpp to stop working. However soon after the
Rcpp 0.11 release, many other cran packages started to require Rcpp >=
0.11, and our users started complaining about not being able to
install those packages. This resulted in the impossible situation
where currently installed packages would not work with the new Rcpp,
but newly installed packages would not work with the old Rcpp.

Current CRAN policies blame this problem on package authors. However
as is explained in [1], this policy does not solve anything, is
unsustainable with growing repository size, and sets completely the
wrong incentives for contributing code. Progress comes with breaking
changes, and the system should be able to accommodate this. Much of
the trouble could have been prevented by a system that does not push
bleeding edge updates straight to end-users, but has a devel branch
where conflicts are resolved before publishing them in the next
r-release.

## Reliability

Another example, this time on a very small scale. We recently
discovered that R code plotting medal counts from the Sochi Olympics
generated different results for users on OSX than it did on
Linux/Windows. After some debugging, we narrowed it down to the XML
package. The application used the following code to scrape results
from the Sochi website:

XML::readHTMLTable("http://www.sochi2014.com/en/speed-skating";, which=2, skip=1)

This code was developed and tested on mac, but results in a different
winner on windows/linux. This happens because the current version of
the XML package on CRAN is 3.98, but the latest mac binary is 3.95.
Apparently this new version of XML introduces a tiny change that
causes html-table-headers to become colnames, rather than a row in the
matrix, resulting in different medal counts.

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Joshua Ulrich

On Tue, Mar 18, 2014 at 3:24 PM, Jeroen Ooms  wrote:

> ## Summary
>
> Extending the r-release cycle to CRAN seems like a solution that would
> be easy to implement. Package updates simply only get pushed to the
> r-devel branches of cran, rather than r-release and r-release-old.
> This separates development from production/use in a way that is common
> sense in most open source communities. Benefits for R include:
>
Nothing is ever as simple as it seems (especially from the perspective
of one who won't be doing the work).

There is nothing preventing you (or anyone else) from creating
repositories that do what you suggest.  Create a CRAN mirror (or more
than one) that only include the package versions you think they
should.  Then have your production servers use it (them) instead of
CRAN.

Better yet, make those repositories public.  If many people like your
idea, they will use your new repositories instead of CRAN.  There is
no reason to impose this change on all world-wide CRAN users.

Best,
--
Joshua Ulrich  |  about.me/joshuaulrich
FOSS Trading  |  www.fosstrading.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Frank Harrell

To me it boils down to one simple question: is an update to a package on 
CRAN more likely to (1) fix a bug, (2) introduce a bug or downward 
incompatibility, or (3) add a new feature or fix a compatibility problem 
without introducing a bug?  I think the probability of (1) | (3) is much 
greater than the probability of (2), hence the current approach 
maximizes user benefit.


Frank
--
Frank E Harrell Jr Professor and Chairman  School of Medicine
   Department of Biostatistics Vanderbilt University

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] [RFC] A case for freezing CRAN

2014-03-18 Thread Jeroen Ooms

This came up again recently with an irreproducible paper. Below an
attempt to make a case for extending the r-devel/r-release cycle to
CRAN packages. These suggestions are not in any way intended as
criticism on anyone or the status quo.

The proposal described in [1] is to freeze a snapshot of CRAN along
with every release of R. In this design, updates for contributed
packages treated the same as updates for base packages in the sense
that they are only published to the r-devel branch of CRAN and do not
affect users of "released" versions of R. Thereby all users, stacks
and applications using a particular version of R will by default be
using the identical version of each CRAN package. The bioconductor
project uses similar policies.

This system has several important advantages:

## Reproducibility

Currently r/sweave/knitr scripts are unstable because of ambiguity
introduced by constantly changing cran packages. This causes scripts
to break or change behavior when upstream packages are updated, which
makes reproducing old results extremely difficult.

A common counter-argument is that script authors should document
package versions used in the script using sessionInfo(). However even
if authors would manually do this, reconstructing the author's
environment from this information is cumbersome and often nearly
impossible, because binary packages might no longer be available,
dependency conflicts, etc. See [1] for a worked example. In practice,
the current system causes many results or documents generated with R
no to be reproducible, sometimes already after a few months.

In a system where contributed packages inherit the r-base release
cycle, scripts will behave the same across users/systems/time within a
given version of R. This severely reduces ambiguity of R behavior, and
has the potential of making reproducibility a natural part of the
language, rather than a tedious exercise.

## Repository Management

Just like scripts suffer from upstream changes, so do packages
depending on other packages. A particular package that has been
developed and tested against the current version of a particular
dependency is not guaranteed to work against *any future version* of
that dependency. Therefore, packages inevitably break over time as
their dependencies are updated.

One recent example is the Rcpp 0.11 release, which required all
reverse dependencies to be rebuild/modified. This updated caused some
serious disruption on our production servers. Initially we refrained
from updating Rcpp on these servers to prevent currently installed
packages depending on Rcpp to stop working. However soon after the
Rcpp 0.11 release, many other cran packages started to require Rcpp >=
0.11, and our users started complaining about not being able to
install those packages. This resulted in the impossible situation
where currently installed packages would not work with the new Rcpp,
but newly installed packages would not work with the old Rcpp.

Current CRAN policies blame this problem on package authors. However
as is explained in [1], this policy does not solve anything, is
unsustainable with growing repository size, and sets completely the
wrong incentives for contributing code. Progress comes with breaking
changes, and the system should be able to accommodate this. Much of
the trouble could have been prevented by a system that does not push
bleeding edge updates straight to end-users, but has a devel branch
where conflicts are resolved before publishing them in the next
r-release.

## Reliability

Another example, this time on a very small scale. We recently
discovered that R code plotting medal counts from the Sochi Olympics
generated different results for users on OSX than it did on
Linux/Windows. After some debugging, we narrowed it down to the XML
package. The application used the following code to scrape results
from the Sochi website:

XML::readHTMLTable("http://www.sochi2014.com/en/speed-skating";, which=2, skip=1)

This code was developed and tested on mac, but results in a different
winner on windows/linux. This happens because the current version of
the XML package on CRAN is 3.98, but the latest mac binary is 3.95.
Apparently this new version of XML introduces a tiny change that
causes html-table-headers to become colnames, rather than a row in the
matrix, resulting in different medal counts.

This example illustrates that we should never assume package versions
to be interchangeable. Any small bugfix release can have side effects
altering results. It is impossible to protect code against such
upstream changes using CMD check or unit testing. All R scripts and
packages are really only developed and tested for a single version of
their dependencies. Assuming anything else makes results
untrustworthy, and code unreliable.

## Summary

Extending the r-release cycle to CRAN seems like a solution that would
be easy to implement. Package updates simply only get pushed to the
r-devel branches of cran, rather than r-release

64 matches

Mail list logo