Re: [rb-general] Paper preprint: Reproducible genomics analysis pipelines with GNU Guix

2018-05-13 Thread Catonano
2018-05-13 7:07 GMT+02:00 Ricardo Wurmus :

>
> Catonano  writes:
>
> > Ricardo, I don't understand the problem you're raising here (I didn't
> read
> > the article yet, though)
> >
> > Would you mind to elaborate on that ?
> >
> > Why would you want to record the environment ?
>
> I want to record the detected build environment so that I can restore it
> at execution time.  Autoconf provides macros that probe the environment
> and record the full path to detected tools.  For example, I’m looking
> for Samtools, and the user may provide a particular variant of Samtools
> at configure time.


Thanks for clarifying !

Let me vent some thoughts on te issue !

Under Guix, the way to provide a specific version of the Samtools would be
to run the configuration in an environment that offers a specific Samtools
package, so that the configuration tool can pick that up

Under a traditional distro, it'd be to feed file paths to the configuration
tool

So, how much of the traditional way of doing things do we want to support,
in our pipelines ?

I record the full path to the executable at
> configure time and embed that path in a configuration file that is read
> when the pipeline is run.
>
> This works fine for tools, but doesn’t work very well at all for modules
> in language environments.  Take R for example.  I can detect and record
> the location of the R and Rscript executables, but I cannot easily
> record the location of build-time R packages (such as r-deseq2) in a way
> that allows me to rebuild the environment at runtime.
>
> Instead of writing an Autoconf macro that records the exact location of
> each of the detected R packages and their dependencies I chose to solve
> the problem in Guix by wrapping the pipeline executables in R_SITE_LIBS,
> because I figured that on systems without Guix you aren’t likely to
> install R packages into separate unique locations anyway — on most
> systems R packages end up being installed to one and the same directory.
>
> I think the desire to restore the configured environment at runtime is
> valid and we do this all the time when we run binaries that have
> embedded absolute paths (to libraries or other tools).


I didn't mean to imply it's not valid
I was just trying to understand what are the concerns on the ground and the
context



> It’s just that
> it gets pretty awkward to do this for things like R packages or Python
> modules (or Guile modules for that matter).
>
> The Guix workflow language solves this problem by depending on Guix for
> software deployment.  For PiGx we picked Snakemake early on and it does
> not have a software deployment solution (it expects to either run inside
> a suitable environment that the user provides or to have access to
> pre-built Singularity application bundles).  I don’t like to treat
> pipelines like some sort of collection of scripts that must be invoked
> in a suitable environment.  I like to see pipelines as big software
> packages that should know about the environment they need, that can be
> configured like regular tools, and thus only require the packager to
> assemble the environment, not the end-user.
>

I understand your concern to consider pipelines as packages

But say, for example, that a pipeline gets distributed as a .deb package
with dependencies to R (or Guile) modules

Or, say, that a pipeline is distributed with a bundled guix.scm file
containing R modules (or Guile modules) as inputs

Would that break the idea of a pipeline as a package ?

I'm afraid that the idea of a pipeline as a package shouldn't be entrusted
to the configuration tool, but rather to the package management tool

And the pipeline author shouldn't be assumed to work in isolation,
confident that any package management environment will be able to rus their
pipeline smoothly

The pipelines authors should be concerned with the collocation of their
pipeline in the packaged graph, that shouldn't be a concern of the packager
only

Maybe the sotware authors should provide dependency information in a
standardized format (rdf ? ) and that should be leveraged by packagers in
order to prepare .deb packages or guix.scm files

And if you are a developer and you want to test the software with a
specific version of a dependency, then you should run the configuration
tool in an environment where that version of the dependency is available,
so that the configuration tool can pick that up

If you are on Guix, you will probably create that environment with the Guix
environment tool

If you are on Debian or Fedora, you will have to rely on those distros
development tools



On traditional distros, you can install packages in your user folder or in
/opt or in other positions

And then, you can feed those to the configuration tool

On Guix, the conditions are different

The idea of pipelines as packages will be treated differently by the
configuration tool under Guix and the configuration tool under Debian/Fedora

So, in my 

Re: [rb-general] Paper preprint: Reproducible genomics analysis pipelines with GNU Guix

2018-05-13 Thread Ricardo Wurmus

Catonano  writes:

> Ricardo, I don't understand the problem you're raising here (I didn't read
> the article yet, though)
>
> Would you mind to elaborate on that ?
>
> Why would you want to record the environment ?

I want to record the detected build environment so that I can restore it
at execution time.  Autoconf provides macros that probe the environment
and record the full path to detected tools.  For example, I’m looking
for Samtools, and the user may provide a particular variant of Samtools
at configure time.  I record the full path to the executable at
configure time and embed that path in a configuration file that is read
when the pipeline is run.

This works fine for tools, but doesn’t work very well at all for modules
in language environments.  Take R for example.  I can detect and record
the location of the R and Rscript executables, but I cannot easily
record the location of build-time R packages (such as r-deseq2) in a way
that allows me to rebuild the environment at runtime.

Instead of writing an Autoconf macro that records the exact location of
each of the detected R packages and their dependencies I chose to solve
the problem in Guix by wrapping the pipeline executables in R_SITE_LIBS,
because I figured that on systems without Guix you aren’t likely to
install R packages into separate unique locations anyway — on most
systems R packages end up being installed to one and the same directory.

I think the desire to restore the configured environment at runtime is
valid and we do this all the time when we run binaries that have
embedded absolute paths (to libraries or other tools).  It’s just that
it gets pretty awkward to do this for things like R packages or Python
modules (or Guile modules for that matter).

The Guix workflow language solves this problem by depending on Guix for
software deployment.  For PiGx we picked Snakemake early on and it does
not have a software deployment solution (it expects to either run inside
a suitable environment that the user provides or to have access to
pre-built Singularity application bundles).  I don’t like to treat
pipelines like some sort of collection of scripts that must be invoked
in a suitable environment.  I like to see pipelines as big software
packages that should know about the environment they need, that can be
configured like regular tools, and thus only require the packager to
assemble the environment, not the end-user.

--
Ricardo





Re: [rb-general] Paper preprint: Reproducible genomics analysis pipelines with GNU Guix

2018-05-11 Thread Catonano
2018-05-11 10:19 GMT+02:00 Ricardo Wurmus :

>
> Ludovic Courtès  writes:
>
> > Perhaps we could add to Autoconf-Archive (if it doesn’t have such things
> > already) macros to deal with the R and Python stuff you had to deal
> > with?  And then publish a simple template that people could use as a
> > starting point.
>
> I submitted my macros for R packages (and they have been accepted), but
> I actually don’t really like them because they are not as useful as it
> may seem.  While they do check for R packages in the environment at
> configure time, nothing is done to record the environment necessary to
> access these packages.
>
> That’s a general problem for software that depends on search path
> environment variables.  I can’t just record the location of each
> individual R package that was detected and use that to set up the
> environment at runtime.  The R packages have other runtime dependencies
> that would also need to be recorded.
>
> It’s not ideal.
>
> --
> Ricardo
>
>
>

Ricardo, I don't understand the problem you're raising here (I didn't read
the article yet, though)

Would you mind to elaborate on that ?

Why would you want to record the environment ?

I have this tiny prototype that checks for the availability of the Guile
module "sqlite3" at configure time and writes this csexp (
https://gitlab.com/dustyweb/guile-csexps ) in a file

(7:sqlite32:no)
(7:sqlite33:yes)

The first line is produced in an environment in which sqlite3 is not
available
The second one is produced in an environment in which sqlite3 is, well
guess what, available

I produce such environments with the Guix "environment" command

I think csexps are cool because they are readable to humans

A user creating their pipeline can easily inspect the result of the
configuration phase

They could even paste excerpts of text on mailing lists, should they want
to ask for help

In my idea a build tool doesn't attempt at managing an environment

You could have sqlite3 because you set up a Guix environment, or because
you installed it with apt-get or dnf or manually

The build tool only worries about the availabilty, not how it's achieved

If every dependency is available (anyhow) it just builds

Because building and package management are supposed to be differrent
concerns.

If you have Guix, fine.
If you haven't Guix, then you're on your own, if you can manage, fine

This should address your concern to let people treat their pipelines as
packages

Doesn't it ?

Is this approach not enough for you ?

May I ask why ?

For now it only tests Guile modules but it could be obviously generalized
to test for more things (libs versions, data structures availability, along
the lines of what Autoconf does)

I'd love to be able to set up my (Guile) packages without having to deal
with the Autotools 


Re: [rb-general] Paper preprint: Reproducible genomics analysis pipelines with GNU Guix

2018-05-11 Thread Ricardo Wurmus

Ludovic Courtès  writes:

> Perhaps we could add to Autoconf-Archive (if it doesn’t have such things
> already) macros to deal with the R and Python stuff you had to deal
> with?  And then publish a simple template that people could use as a
> starting point.

I submitted my macros for R packages (and they have been accepted), but
I actually don’t really like them because they are not as useful as it
may seem.  While they do check for R packages in the environment at
configure time, nothing is done to record the environment necessary to
access these packages.

That’s a general problem for software that depends on search path
environment variables.  I can’t just record the location of each
individual R package that was detected and use that to set up the
environment at runtime.  The R packages have other runtime dependencies
that would also need to be recorded.

It’s not ideal.

--
Ricardo




Re: [rb-general] Paper preprint: Reproducible genomics analysis pipelines with GNU Guix

2018-05-11 Thread Ludovic Courtès
Hello!

Ricardo Wurmus  skribis:

> Ludovic Courtès  writes:

[...]

>> Given the intended audience, I wonder how we could provide a simpler
>> path to achieve the same goal.  It could be a set of Autoconf macros
>> leading to high-level ‘configure.ac’ files without any line of shell
>> code, or it could be Guix interpreting a top-level .scm or JSON file,
>> both of which would ideally be easier to write for bioinformaticians.
>
> I think a higher level “configure.ac” file would be of great help.  In
> general, independent of this particular use case.

Perhaps we could add to Autoconf-Archive (if it doesn’t have such things
already) macros to deal with the R and Python stuff you had to deal
with?  And then publish a simple template that people could use as a
starting point.

> There is a danger in pushing all of this work to Guix, though.  One of
> the great features of the Autotools suite is that users don’t need to
> know about it.  If we assume that users have Guix (which in our paper we
> only strongly encourage) we might as well have implemented the whole
> pipeline using the Guix Workflow Language.  This is, of course, a valid
> option, but the goal of the paper was to demonstrate a more general
> claim and approach to designing pipelines.  I wanted to encourage
> pipeline developers to treat their pipeline as a first-class package,
> not as some glue code that binds together tools in a specially crafted
> runtime environment.

Yes, that makes sense.

> I think that this alternative is worth exploring, though.  Building a
> complex pipeline with the Guix Workflow Language that addresses both
> deployment and execution order would be an interesting project; it would
> also be good to look into ways to make such a workflow available to
> users who do not have the ability or intention to install Guix.  An easy
> way is to bundle up the whole environment as one giant container blob,
> but I think we can do better.  I’d love to collaborate with other users
> of the GWL to see how far we can push it.

Would be nice, indeed.

Thanks,
Ludo’.



Re: [rb-general] Paper preprint: Reproducible genomics analysis pipelines with GNU Guix

2018-04-23 Thread Ludovic Courtès
Hello Ricardo & all!

Ricardo Wurmus  skribis:

> I’m happy to announce that the group I’m working with has released a
> preprint of a paper on reproducibility with the title:
>
> Reproducible genomics analysis pipelines with GNU Guix
> https://www.biorxiv.org/content/early/2018/04/11/298653
>
> We built a collection of bioinformatics pipelines and packaged them with
> GNU Guix, and then looked at the degree to which the software achieves
> bit-reproducibility (spoiler: ~98%), analysed sources of non-determinism
> (e.g. time stamps), discussed experimental reproducibility at runtime
> (e.g. random number generators, kernel+glibc interface, etc) and
> commented on the idea of using “containers” (or application bundles)
> instead.

Very impressive piece of work!  I think it’s important to stress that
reproducible builds is a crucial foundation for reproducible
computational experiments, and this paper does a great job at this.

Also nice that you show you can have these bit-reproducible pipelines
formalized in Guix *and* produce a ready-to-use “container image.”

Hopefully we can soon address the remaining sources of non-determinism
shown in Table 3 (I think you already addressed some of them in the
meantime, didn’t you?).

The bit I’m less comfortable with is Autotools.  I do understand how it
helps capture configure-time dependencies, and how it generally helps
people package and use the software; I think it’s one of the best tools
for the job.  However it’s also hard to learn and, whether it’s
justified or not, it’s considered “scary.”

Given the intended audience, I wonder how we could provide a simpler
path to achieve the same goal.  It could be a set of Autoconf macros
leading to high-level ‘configure.ac’ files without any line of shell
code, or it could be Guix interpreting a top-level .scm or JSON file,
both of which would ideally be easier to write for bioinformaticians.

What are your thoughts on this?

Anyway, kudos on this, thank you!

Ludo’.



Re: [rb-general] Paper preprint: Reproducible genomics analysis pipelines with GNU Guix

2018-04-11 Thread Holger Levsen
hi again,

and extra kudos and thanks for releasing this under a free licence! \o/


-- 
cheers,
Holger


signature.asc
Description: PGP signature


Re: [rb-general] Paper preprint: Reproducible genomics analysis pipelines with GNU Guix

2018-04-11 Thread Holger Levsen
On Wed, Apr 11, 2018 at 08:40:47PM +0200, Ricardo Wurmus wrote:
> > just one thing/question: in the keywords you have "reproducible
> Heh, it used to be “reproducible builds”, but the term was deemed too
> abstract for the audience of biologists, so it was decided to change it
> to “reproducible software”…

hehe.

> Lots of small compromises need to be made when writing a paper together,
> and that was one of them :)
 
I understand.

& thanks again, super cool!


-- 
cheers,
Holger


signature.asc
Description: PGP signature


Re: [rb-general] Paper preprint: Reproducible genomics analysis pipelines with GNU Guix

2018-04-11 Thread Holger Levsen
Hi Ricardo,

On Wed, Apr 11, 2018 at 02:18:38PM +0200, Ricardo Wurmus wrote:
> I’m happy to announce that the group I’m working with has released a
> preprint of a paper on reproducibility with the title:
> 
> Reproducible genomics analysis pipelines with GNU Guix
> https://www.biorxiv.org/content/early/2018/04/11/298653
> 
> We built a collection of bioinformatics pipelines and packaged them with
> GNU Guix, and then looked at the degree to which the software achieves
> bit-reproducibility (spoiler: ~98%), analysed sources of non-determinism
> (e.g. time stamps), discussed experimental reproducibility at runtime
> (e.g. random number generators, kernel+glibc interface, etc) and
> commented on the idea of using “containers” (or application bundles)
> instead.

wow, just wow. very very nice to see that!

> The middle section is a bit heavy on genomics to showcase the features
> of the pipelines, but I think the introduction and the
> discussion/conclusion may be of general interest.

As you might guess I have just skimmed over the text but it's really
super cool to see reproducible builds used in science! and diffoscope,
too!

just one thing/question: in the keywords you have "reproducible
software" but not "reproducible builds", which is kind of our "marketing
term". Do you think you could squeeze that in?


-- 
cheers,
Holger, once again wishing he could read more (and more...)


signature.asc
Description: PGP signature


Re: [rb-general] Paper preprint: Reproducible genomics analysis pipelines with GNU Guix

2018-04-11 Thread Ricardo Wurmus

Hi Holger,

thanks for your comments!

> just one thing/question: in the keywords you have "reproducible
> software" but not "reproducible builds", which is kind of our "marketing
> term". Do you think you could squeeze that in?

Heh, it used to be “reproducible builds”, but the term was deemed too
abstract for the audience of biologists, so it was decided to change it
to “reproducible software”…

Lots of small compromises need to be made when writing a paper together,
and that was one of them :)

--
Ricardo

GPG: BCA6 89B6 3655 3801 C3C6  2150 197A 5888 235F ACAC
https://elephly.net