Re: [Rd] Wish: a way to track progress of parallel operations

2024-03-25 Thread Stephen H. Dawson, DSL via R-devel
Thanks Ivan and Henrik for considering this work. It would be a valuable 
contribution.


Kindly,
*Stephen Dawson, DSL*
/Executive Strategy Consultant/
Business & Technology
+1 (865) 804-3454
http://www.shdawson.com


On 3/25/24 13:19, Henrik Bengtsson wrote:

Hello,

thanks for bringing this topic up, and it would be excellent if we
could come of with a generic solution for this in base R.  It is one
of the top frequently asked questions and requested features in
parallel processing, but also in sequential processing. We have also
seen lots of variants on how to attack the problem of reporting on
progress when running in parallel.

As the author Futureverse (a parallel framework), I've been exposed to
these requests and I thought quite a bit about how we could solve this
problem. I'll outline my opinionated view and suggestions on this
below:

* Target a solution that works the same regardless whether we run in
parallel or not, i.e. the code/API should look the same regardless of
using, say, parallel::parLapply(), parallel::mclapply(), or
base::lapply(). The solution should also work as-is in other parallel
frameworks.

* Consider who owns the control of whether progress updates should be
reported or not. I believe it's best to separate what the end-user and
the developer controls.  I argue the end-user should be able to
decided whether they want to "see" progress updates or not, and the
developer should focus on where to report on progress, but not how and
when.

* In line with the previous comment, controlling progress reporting
via an argument (e.g. `.progress`) is not powerful enough. With such
an approach, one need to make sure that that argument is exposed and
relayed throughout in all nested function calls. If a package decides
to introduce such an argument, what should the default be? If they set
`.progress = TRUE`, then all of a sudden, any code/packages that
depend on this function will all of a sudden see progress updates.
There are endless per-package versions of this on CRAN and
Bioconductor, any they rarely work in harmony.

* Consider accessibility as well as graphical user interfaces. This
means, don't assume progress is necessarily reported in the terminal.
I found it a good practice to never use the term "progress bar",
because that is too focused on how progress is reported.

* Let the end-user control how progress is reported, e.g. a progress
bar in the terminal, a progress bar in their favorite IDE/GUI,
OS-specific notifications, third-party notification services, auditory
output, etc.

The above objectives challenge you to take a step back and think about
what progress reporting is about, because the most immediate needs.
Based on these, I came up with the 'progressr' package
(https://progressr.futureverse.org/). FWIW, it was originally actually
meant to be a proof-of-concept proposal for a universal, generic
solution to this problem, but as the demands grew and the prototype
showed to be useful, I made it official.  Here is the gist:

* Motto: "The developer is responsible for providing progress updates,
but it’s only the end user who decides if, when, and how progress
should be presented. No exceptions will be allowed."

* It rely on R's condition system to signal progress. The developer
signals progress conditions. Condition handlers, which the end-user
controls, are used to report/render these progress updates. The
support for global condition handlers, introduced in R 4.0.0, makes
this much more convenient. It is useful to think of the condition
mechanism in R as a back channel for communication that operates
separately from the rest of the "communication" stream (calling
functions with arguments and returning value).

* For parallel processing, progress conditions can be relayed back to
the parent process via back channels in a "near-live" fashion, or at
the very end when the parallel task is completed. Technically,
progress conditions inherit from 'immediateCondition', which is a
special class indicating that such conditions are allowed to be
relayed immediately and out of order. It is possible to use the
existing PSOCK socket connections to send such 'immediateCondition':s.

* No assumption is made on progress updates arriving in a certain
order. They are just a stream of "progress of this and that amount"
was made.

* There is a progress handler API. Using this API, various types of
progress reporting can be implemented. This allows anyone to implement
progress handlers in contributed R packages.

See https://progressr.futureverse.org/ for more details.


I would be happy to prepare code and documentation. If there is no time now, we 
can return to it after R-4.4 is released.

I strongly recommend to not rush this. This is an important, big
problem that goes beyond the 'parallel' package. I think it would be a
disfavor to introduce a '.progress' argument. As mentioned above, I
think a solution should work throughout the R ecosystem - all base-R
packages and beyond. I 

[R-pkg-devel] Check results on r-devel-windows claiming error but tests seem to pass?

2024-03-25 Thread Avraham Adler
I noticed that a few of my packages appear to be failing on
"r-devel-windows-x86_64". Specficially, Delaporte [1], lamW [2], and
revss[3]. However, checking the output of the tests shows that all
passed. Is this a hiccup or is there something that needs to be
changed? And why would my other two packages not suffer from this
(minimaApprox [4] and Pade [5])? I'm a bit confused.

Thank you,

Avi

[1] https://cran.r-project.org/web/checks/check_results_Delaporte.html
[2] https://cran.r-project.org/web/checks/check_results_lamW.html
[3] https://cran.r-project.org/web/checks/check_results_revss.html
[4] https://cran.r-project.org/web/checks/check_results_minimaxApprox.html
[5] https://cran.r-project.org/web/checks/check_results_Pade.html

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [Rd] Wish: a way to track progress of parallel operations

2024-03-25 Thread Henrik Bengtsson
Hello,

thanks for bringing this topic up, and it would be excellent if we
could come of with a generic solution for this in base R.  It is one
of the top frequently asked questions and requested features in
parallel processing, but also in sequential processing. We have also
seen lots of variants on how to attack the problem of reporting on
progress when running in parallel.

As the author Futureverse (a parallel framework), I've been exposed to
these requests and I thought quite a bit about how we could solve this
problem. I'll outline my opinionated view and suggestions on this
below:

* Target a solution that works the same regardless whether we run in
parallel or not, i.e. the code/API should look the same regardless of
using, say, parallel::parLapply(), parallel::mclapply(), or
base::lapply(). The solution should also work as-is in other parallel
frameworks.

* Consider who owns the control of whether progress updates should be
reported or not. I believe it's best to separate what the end-user and
the developer controls.  I argue the end-user should be able to
decided whether they want to "see" progress updates or not, and the
developer should focus on where to report on progress, but not how and
when.

* In line with the previous comment, controlling progress reporting
via an argument (e.g. `.progress`) is not powerful enough. With such
an approach, one need to make sure that that argument is exposed and
relayed throughout in all nested function calls. If a package decides
to introduce such an argument, what should the default be? If they set
`.progress = TRUE`, then all of a sudden, any code/packages that
depend on this function will all of a sudden see progress updates.
There are endless per-package versions of this on CRAN and
Bioconductor, any they rarely work in harmony.

* Consider accessibility as well as graphical user interfaces. This
means, don't assume progress is necessarily reported in the terminal.
I found it a good practice to never use the term "progress bar",
because that is too focused on how progress is reported.

* Let the end-user control how progress is reported, e.g. a progress
bar in the terminal, a progress bar in their favorite IDE/GUI,
OS-specific notifications, third-party notification services, auditory
output, etc.

The above objectives challenge you to take a step back and think about
what progress reporting is about, because the most immediate needs.
Based on these, I came up with the 'progressr' package
(https://progressr.futureverse.org/). FWIW, it was originally actually
meant to be a proof-of-concept proposal for a universal, generic
solution to this problem, but as the demands grew and the prototype
showed to be useful, I made it official.  Here is the gist:

* Motto: "The developer is responsible for providing progress updates,
but it’s only the end user who decides if, when, and how progress
should be presented. No exceptions will be allowed."

* It rely on R's condition system to signal progress. The developer
signals progress conditions. Condition handlers, which the end-user
controls, are used to report/render these progress updates. The
support for global condition handlers, introduced in R 4.0.0, makes
this much more convenient. It is useful to think of the condition
mechanism in R as a back channel for communication that operates
separately from the rest of the "communication" stream (calling
functions with arguments and returning value).

* For parallel processing, progress conditions can be relayed back to
the parent process via back channels in a "near-live" fashion, or at
the very end when the parallel task is completed. Technically,
progress conditions inherit from 'immediateCondition', which is a
special class indicating that such conditions are allowed to be
relayed immediately and out of order. It is possible to use the
existing PSOCK socket connections to send such 'immediateCondition':s.

* No assumption is made on progress updates arriving in a certain
order. They are just a stream of "progress of this and that amount"
was made.

* There is a progress handler API. Using this API, various types of
progress reporting can be implemented. This allows anyone to implement
progress handlers in contributed R packages.

See https://progressr.futureverse.org/ for more details.

> I would be happy to prepare code and documentation. If there is no time now, 
> we can return to it after R-4.4 is released.

I strongly recommend to not rush this. This is an important, big
problem that goes beyond the 'parallel' package. I think it would be a
disfavor to introduce a '.progress' argument. As mentioned above, I
think a solution should work throughout the R ecosystem - all base-R
packages and beyond. I honestly think we could arrive at a solution
where base-R proposes a very light, yet powerful, progress API that
handles all of the above. The main task is to come up with a standard
API/protocol - then the implementation does not matter.

/Henrik

On Mon, Mar 25, 

[Rd] Wish: a way to track progress of parallel operations

2024-03-25 Thread Ivan Krylov via R-devel
Hello R-devel,

A function to be run inside lapply() or one of its friends is trivial
to augment with side effects to show a progress bar. When the code is
intended to be run on a 'parallel' cluster, it generally cannot rely on
its own side effects to report progress.

I've found three approaches to progress bars for parallel processes on
CRAN:

 - Importing 'snow' (not 'parallel') internals like sendCall and
   implementing parallel processing on top of them (doSNOW). This has
   the downside of having to write higher-level code from scratch
   using undocumented inferfaces.

 - Splitting the workload into length(cluster)-sized chunks and
   processing them in separate parLapply() calls between updating the
   progress bar (pbapply). This approach trades off parallelism against
   the precision of the progress information: the function has to wait
   until all chunk elements have been processed before updating the
   progress bar and submitting a new portion; dynamic load balancing
   becomes much less efficient.

 - Adding local side effects to the function and detecting them while
   the parallel function is running in a child process (parabar). A
   clever hack, but much harder to extend to distributed clusters.

With recvData and recvOneData becoming exported in R-4.4 [*], another
approach becomes feasible: wrap the cluster object (and all nodes) into
another class, attach the progress callback as an attribute, and let
recvData / recvOneData call it. This makes it possible to give wrapped
cluster objects to unchanged code, but requires knowing the precise
number of chunks that the workload will be split into.

Could it be feasible to add an optional .progress argument after the
ellipsis to parLapply() and its friends? We can require it to be a
function accepting (done_chunk, total_chunks, ...). If not a new
argument, what other interfaces could be used to get accurate progress
information from staticClusterApply and dynamicClusterApply?

I understand that the default parLapply() behaviour is not very
amenable to progress tracking, but when running clusterMap(.scheduling
= 'dynamic') spanning multiple hours if not whole days, having progress
information sets the mind at ease.

I would be happy to prepare code and documentation. If there is no time
now, we can return to it after R-4.4 is released.

-- 
Best regards,
Ivan

[*] https://bugs.r-project.org/show_bug.cgi?id=18587

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Bioc-devel] Request to add maintainers for diffHic

2024-03-25 Thread Kern, Lori via Bioc-devel
Do Hannah and Gordon already have BiocCredentials accounts?

If so please let me know the email or userid.

If not I can create them and Hannah and Gordon should let me know which email 
they would like for access and if they have a github id they would like 
associated with the account.

Cheers,


Lori Shepherd - Kern

Bioconductor Core Team

Roswell Park Comprehensive Cancer Center

Department of Biostatistics & Bioinformatics

Elm & Carlton Streets

Buffalo, New York 14263


From: Bioc-devel  on behalf of Aaron Lun 

Sent: Monday, March 25, 2024 8:31 AM
To: bioc-devel 
Cc: Hannah Coughlan ; Gordon K Smyth 
Subject: [Bioc-devel] Request to add maintainers for diffHic

Could Hannah and Gordon (in cc) be given push access to Bioc's diffHic
repository? Note, this is in addition to my current push access, as I
will be responsible for the large body of C++ code still in the package.

Thanks,

-A

___
Bioc-devel@r-project.org mailing list
https://secure-web.cisco.com/1bcFNmZ045UYnV4SA6-zC9Pt7YFYC4-0rwJyHt6ey0WFBxSz8LkNnSfqvCO1esQSmXtAZYwH6F1ZtQZHc56wSVn3z49vnIVSxN9_sNnTRuZPDhJU_LO_gOOijRZsIYrYXNxYd03xiVX6KiiMqHLmF6DceqHDZ-pi1QjgEnN8MHbzncqjs4E5oc9Rzo5T1ebYirtoUGd5Oz-4DzOnh8ptnPNyEIkJqrcyip7gsVI2leONsbybRgnfrNAqGimiqLgmQVtq3ck0DH50xa5K397E5I60lMGfLhyTS9F6y8qKOPvuVduykR1iOiJTMxNqfdt6b/https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel



This email message may contain legally privileged and/or confidential 
information.  If you are not the intended recipient(s), or the employee or 
agent responsible for the delivery of this message to the intended 
recipient(s), you are hereby notified that any disclosure, copying, 
distribution, or use of this email message is prohibited.  If you have received 
this message in error, please notify the sender immediately by e-mail and 
delete this email message from your computer. Thank you.
[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] Request to add maintainers for diffHic

2024-03-25 Thread Aaron Lun
Could Hannah and Gordon (in cc) be given push access to Bioc's diffHic 
repository? Note, this is in addition to my current push access, as I 
will be responsible for the large body of C++ code still in the package.


Thanks,

-A

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [R-pkg-devel] How to store large data to be used in an R package?

2024-03-25 Thread Ivan Krylov via R-package-devel
В Mon, 25 Mar 2024 11:12:57 +0100
Jairo Hidalgo Migueles  пишет:

> Specifically, this data consists of regression and random forest
> models crucial for making predictions within our R package.

Apologies for asking a silly question, but is there a chance that these
models are large by accident (e.g. because an object references a large
environment containing multiple copies of the training dataset)? Or it
is there really more than a million weights required to make
predictions?

> Initially, I attempted to save these models as internal data within
> the package. While this approach maintains functionality, it has led
> to a package size exceeding 20 MB. I'm concerned that this would
> complicate submitting the package to CRAN in the future.

The policy mentions the possibility of having a separate large
data-only package. Since CRAN strives to archive all package versions,
this data-only package will have to be updated as rarely as possible.
You will need to ask CRAN for approval.

If there is a significant amount of core functionality inside your
package that does *not* require the large data (so that it can still
be installed and used without the data), you can publish the data-only
package yourself (e.g. using the 'drat' package), put it in Suggests
and link to it in the Additional_repositories field of your DESCRIPTION.
Alternatively, you can publish the data on Zenodo and offer to download
it on first use. Make sure to (1) use tools::R_user_dir to determine
where to put the files, (2) only download the files after the user
explicitly agrees to it and (3) test as much of your package
functionality as possible without requiring the data to be downloaded.

-- 
Best regards,
Ivan

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


[R-pkg-devel] How to store large data to be used in an R package?

2024-03-25 Thread Jairo Hidalgo Migueles
Dear all,

I'm reaching out to seek some guidance regarding the storage of relatively
large data, ranging from 10-40 MB, intended for use within an R package.
Specifically, this data consists of regression and random forest models
crucial for making predictions within our R package.

Initially, I attempted to save these models as internal data within the
package. While this approach maintains functionality, it has led to a
package size exceeding 20 MB. I'm concerned that this would complicate
submitting the package to CRAN in the future.

I would greatly appreciate any suggestions or insights you may have on
alternative methods or best practices for efficiently storing and accessing
this data within our R package.

Jairo

[[alternative HTML version deleted]]

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel