Re: [Bioc-devel] Controlling vignette compilation order

Aaron Lun Mon, 24 Dec 2018 00:02:30 -0800

A working example of knitr caching across workflows is now available at 
https://github.com/LTLA/BiocWorkCache <https://github.com/LTLA/BiocWorkCache>.


It uses “~/chipseq.log” as a log to demonstrate that the code in the 
most-upstream workflow (“test1.Rmd”) is indeed only executed once during the 
BUILD.

Note that the compilation of upstream vignettes involves a system call out to a 
separate R session. This avoids some difficult issues with caching when a Rmd 
file is compiled from within another Rmd file - trying to use 
rmarkdown::render() on the upstream vignette within a downstream vignette does 
not generate a cache that is recognized when BUILD goes onto compile the 
upstream vignette.

-A

> On 23 Dec 2018, at 01:24, Aaron Lun 
> <infinite.monkeys.with.keyboa...@gmail.com> wrote:
> 
> Yes, I had noticed the vignettes.rds as well, and I figured that would be a 
> problem.
> 
> I just tried setting set cache=TRUE in my vignettes, implemented such that 
> BUILDing each downstream vignette will also run all upstream vignettes on 
> which it depends (that haven’t already been compiled). If an upstream 
> vignette is run in this manner, it caches the results of each code chunk to 
> avoid repeated work when it gets compiled “for real” by R CMD BUILD.
> 
> This seems to work on initial inspection (the caches are produced for the 
> upstream vignettes upon running one downstream vignette). I’ll have to check 
> whether this plays nice with R CMD BUILD. I will probably have to write a 
> function to isolate the scope of the execution of each upstream vignette, to 
> avoid polluting the namespace and cache of each downstream vignette.
> 
> -A
> 
>> On 22 Dec 2018, at 19:22, Henrik Bengtsson <henrik.bengts...@gmail.com 
>> <mailto:henrik.bengts...@gmail.com>> wrote:
>> 
>> On Sat, Dec 22, 2018 at 10:56 AM Michael Lawrence
>> <lawrence.mich...@gene.com <mailto:lawrence.mich...@gene.com>> wrote:
>>> 
>>> Anything that eventually lands in inst/doc is a vignette, I think, so
>>> there might be a hack around that.
>> 
>> Just so this is not misread - it's *not* possible to just hack your
>> vignette "product" files (PDF or HTML) into inst/doc and thinking
>> you're good.  R keeps track of package vignettes in a "vignette
>> index", e.g.
>> 
>>> readRDS(system.file(package = "utils", "Meta", "vignette.rds"))
>>        File              Title        PDF        R Depends Keywords
>> 1 Sweave.Rnw Sweave User Manual Sweave.pdf Sweave.R   tools
>> 
>> which is created during 'R CMD build' by parsing and compiling the
>> vignettes 
>> (https://github.com/wch/r-source/blob/tags/R-3-5-2/src/library/tools/R/build.R#L283-L393
>>  
>> <https://github.com/wch/r-source/blob/tags/R-3-5-2/src/library/tools/R/build.R#L283-L393>).
>> This vignette index is used to find package vignettes (e.g.
>> utils::vignette()) and build the HTML vignette index.
>> 
>> Also, one vignette source (e.g. Rnw, Rmd, ...) can only produce one
>> vignette product (PDF or HTML) in the vignette index.  You can output
>> other files (e.g. image files) in a relative folder that the vignette
>> references, which is why for instance non-self-contained HTML files
>> work.  Thus, one ad-hoc, not-so-nice hack that OP could do is to have
>> a single main vignette that produces and links to all child vignettes.
>> However, personally, I'd aim for using memoization/caching (to file)
>> such that each vignette can be compiled independently of the others
>> (and in any order), while still reusing intermediate
>> results/calculations produced by earlier vignettes.
>> 
>> /Henrik
>> 
>>> 
>>> On Fri, Dec 21, 2018 at 11:26 PM Aaron Lun
>>> <infinite.monkeys.with.keyboa...@gmail.com 
>>> <mailto:infinite.monkeys.with.keyboa...@gmail.com>> wrote:
>>>> 
>>>> I gave it a shot:
>>>> 
>>>> https://github.com/LTLA/DrakeTest <https://github.com/LTLA/DrakeTest> 
>>>> <https://github.com/LTLA/DrakeTest <https://github.com/LTLA/DrakeTest>>
>>>> 
>>>> This uses a single “controller” Rmd file to trigger Drake::make. Running 
>>>> this file will instruct Drake to compile all of the other vignettes 
>>>> following the desired dependency structure.
>>>> 
>>>> The current sticking point is that I need to move the Drake-controlled Rmd 
>>>> files out of “vignettes/“, otherwise they’ll just be compiled as usual 
>>>> without consideration of their dependencies. This causes problems as R CMD 
>>>> BUILD only recognizes the controller Rmd file as the sole vignette, and 
>>>> doesn’t retain or index the HTML files produced from the other Rmd files 
>>>> as side-effects of running the controller.
>>>> 
>>>> Are there any better ways to subvert the vignette building procedure to 
>>>> get the desired effect of running drake::make() and recognition of the 
>>>> resulting HTMLs as vignettes?
>>>> 
>>>> -A
>>>> 
>>>>> On 18 Dec 2018, at 17:41, Michael Lawrence <lawrence.mich...@gene.com 
>>>>> <mailto:lawrence.mich...@gene.com>> wrote:
>>>>> 
>>>>> Sounds like a use case for drake...
>>>>> 
>>>>> On Tue, Dec 18, 2018 at 6:58 AM Aaron Lun 
>>>>> <infinite.monkeys.with.keyboa...@gmail.com 
>>>>> <mailto:infinite.monkeys.with.keyboa...@gmail.com> 
>>>>> <mailto:infinite.monkeys.with.keyboa...@gmail.com 
>>>>> <mailto:infinite.monkeys.with.keyboa...@gmail.com>>> wrote:
>>>>> @Michael In this case, the resource produced by vignette X is a 
>>>>> SingleCellExperiment object containing the results of various processing 
>>>>> steps (normalization, clustering, etc.) described in that vignette.
>>>>> 
>>>>> I can imagine a lazy evaluation model for this, but it wouldn’t be 
>>>>> pretty. If I had another vignette Y that depended on the SCE produced by 
>>>>> vignette X, I would need Y to execute all of the steps in X if X hadn’t 
>>>>> already been run before Y. This gets us into the territory of 
>>>>> Makefile-like dependencies, which seems even more complicated than simply 
>>>>> specifying a compilation order.
>>>>> 
>>>>> You might ask why X and Y are split into two separate vignettes. The use 
>>>>> of different vignettes is motivated by the complexity of the workflows:
>>>>> 
>>>>> - Vignette 1 demonstrates core processing steps for one read-based 
>>>>> single-cell RNAseq dataset.
>>>>> - Vignette 2 demonstrates (slightly different) core steps for a UMI-based 
>>>>> dataset.
>>>>> - … so on for a bunch of other core steps for different types of data.
>>>>> - Vignette 6 demonstrates extra optional steps for the two SCEs produced 
>>>>> by vignettes 1 & 3.
>>>>> - … and so on for a bunch of other optional steps.
>>>>> 
>>>>> The separation between core and optional steps into separate documents is 
>>>>> desirable. From a pedagogical perspective, I would very much like to get 
>>>>> the reader through all the core steps before even considering the extra 
>>>>> steps, which would just be confusing if presented so early on. 
>>>>> Previously, everything was in a single document, which was difficult to 
>>>>> read (for users) and to debug (for me), especially because I had to use 
>>>>> contrived variable names to avoid clashes between different sections of 
>>>>> the workflow that did similar things.
>>>>> 
>>>>> @Martin I’ve been using BiocFileCache for all of the online resources 
>>>>> that are used in the workflow. However, this is only for my (and the 
>>>>> reader’s) convenience. I use a local cache rather than the system 
>>>>> default, to ensure that the downloaded files are removed after package 
>>>>> build. This is intentional as it forces the package builder to try to 
>>>>> re-download resources when compiling the vignette, thus ensuring the 
>>>>> validity of the URLs. For a similar reason, I would prefer not to cache 
>>>>> the result objects for use in different R sessions. I could imagine 
>>>>> caching the result objects for use by a different vignette in the same 
>>>>> build session, but this gets back to the problem of ensuring that the 
>>>>> result object is generated by one vignette before it is needed by another 
>>>>> vignette.
>>>>> 
>>>>> -A
>>>>> 
>>>>>> On 18 Dec 2018, at 14:14, Martin Morgan <mtmorgan.b...@gmail.com 
>>>>>> <mailto:mtmorgan.b...@gmail.com> <mailto:mtmorgan.b...@gmail.com 
>>>>>> <mailto:mtmorgan.b...@gmail.com>>> wrote:
>>>>>> 
>>>>>> Also perhaps using BiocFileCache so that the result object is only 
>>>>>> generated once, then cached for future (different session) use.
>>>>>> 
>>>>>> On 12/18/18, 8:35 AM, "Bioc-devel on behalf of Michael Lawrence" 
>>>>>> <bioc-devel-boun...@r-project.org 
>>>>>> <mailto:bioc-devel-boun...@r-project.org> 
>>>>>> <mailto:bioc-devel-boun...@r-project.org 
>>>>>> <mailto:bioc-devel-boun...@r-project.org>> on behalf of 
>>>>>> lawrence.mich...@gene.com <mailto:lawrence.mich...@gene.com> 
>>>>>> <mailto:lawrence.mich...@gene.com <mailto:lawrence.mich...@gene.com>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>   I would recommend against dependencies across vignettes. Ideally 
>>>>>> someone
>>>>>>   can pick up a vignette and execute the code independently of any other
>>>>>>   documentation. Perhaps you could move the code generating those shared
>>>>>>   resources to the package. They could behave lazily, only generating the
>>>>>>   resource if necessary, otherwise reusing it. That would also make it 
>>>>>> easy
>>>>>>   for people to write their own documents using those resources.
>>>>>> 
>>>>>>   Michael
>>>>>> 
>>>>>>   On Tue, Dec 18, 2018 at 5:22 AM Aaron Lun <
>>>>>>   infinite.monkeys.with.keyboa...@gmail.com 
>>>>>> <mailto:infinite.monkeys.with.keyboa...@gmail.com> 
>>>>>> <mailto:infinite.monkeys.with.keyboa...@gmail.com 
>>>>>> <mailto:infinite.monkeys.with.keyboa...@gmail.com>>> wrote:
>>>>>> 
>>>>>>> In a number of my workflow packages (e.g., simpleSingleCell), I rely on 
>>>>>>> a
>>>>>>> specific compilation order for my vignettes. This is because some 
>>>>>>> vignettes
>>>>>>> set up resources or objects that are to be used by later vignettes.
>>>>>>> 
>>>>>>> From what I understand, vignettes are compiled in alphanumeric ordering 
>>>>>>> of
>>>>>>> their file names. As such, I give my vignettes fairly structured names,
>>>>>>> e.g., “work-1-reads.Rmd”, “work-2-umi.Rmd” and so on.
>>>>>>> 
>>>>>>> However, it becomes rather annoying when I want to add a new vignette in
>>>>>>> the middle somewhere. This results in some unnatural numberings, e.g.,
>>>>>>> “work-0”, “3b”, which are ugly and unintuitive. This is relevant as
>>>>>>> BiocStyle::Biocpkg() links between vignettes require you to use the
>>>>>>> destination vignette’s file name; so difficult names complicate linking,
>>>>>>> especially if the names continually change to reflect new orderings.
>>>>>>> 
>>>>>>> Is there an easier way to control vignette compilation order? WRE 
>>>>>>> provides
>>>>>>> no (obvious) guidance, so I would like to know what non-standard hacks 
>>>>>>> are
>>>>>>> known to work on the build machines. I can imagine something dirty 
>>>>>>> whereby
>>>>>>> one ”reference” vignette contains code to “rmarkdown::render" all other
>>>>>>> vignettes in the specified order… ugh.
>>>>>>> 
>>>>>>> -A
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> 
>>>>>>> <mailto:Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>> 
>>>>>>> mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel 
>>>>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel> 
>>>>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel 
>>>>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>>      [[alternative HTML version deleted]]
>>>>>> 
>>>>>>   _______________________________________________
>>>>>>   Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> 
>>>>>> <mailto:Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>> 
>>>>>> mailing list
>>>>>>   https://stat.ethz.ch/mailman/listinfo/bioc-devel 
>>>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel> 
>>>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel 
>>>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> 
>>>>> <mailto:Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>> 
>>>>> mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel 
>>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel> 
>>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel 
>>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>> 
>>>> 
>>>>        [[alternative HTML version deleted]]
>>>> 
>>>> _______________________________________________
>>>> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel 
>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>> 
>>> 
>>> _______________________________________________
>>> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel 
>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>


        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Controlling vignette compilation order

Reply via email to