Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?

2013-11-06 Thread Martin Morgan

On 11/04/2013 11:34 AM, Michael Lawrence wrote:

The dynamic nature of R limits the extent of these checks. But as Ryan has
noted, a simple sanity check goes a long way. If what he has done could be
extended to the rest of the search path (people always forget to attach
packages), I think we've hit the 80% with 20%. Got a 404 on that URL btw.


I added three issues to BiocParallel on github.

1. bpexport

2. a function to check for non-local use. I think this should use codetools (to 
avoid adding additional dependencies) but I'm a little flexible. Contributions 
welcome on github, especially as a pull request with code formatted 
consistently, a man page, and especially unit tests to provide a clear 
understanding of circumstances covered or not. Michel Lang's Recall and the 
implementation in foreach also sound releavant here.


3. integration of (2) into bplapply etc.

Please feel free to address these further on github.

Martin



Michael


On Mon, Nov 4, 2013 at 11:05 AM, Gabriel Becker gmbec...@ucdavis.eduwrote:


Hey guys,

Here is code that I have written which resolves library names into a full
list of symbols:

https://github.com/duncantl/CodeDepends/blob/forCRAN_0.3.5/R/librarySymbols.RNote
this does not require that the packages actually be loaded at the time
of the check, and does not load them (or rather, it loads them but does not
attach them, so no searchpath muddying occurs). You do need a list of
packages to check though (it adds the base ones automatically). It handles
dependency and could be easily extended to handle suggests as well I think.

When CodeDepends gets pushed to cran (not my call and not high on my
priority list to push for currently) it will actually do exactly what you
want. (the forCRAN_0.3.5 branch already does and I believe it is
documented, so you could use devtools to install it now).

As a side note, I'm not sure that existence of a symbol is sufficient (it
certainly is necessary). What about situations where the symbol exists but
is stale compared to the value in the parent? Are we sure that can never
happen?

~G


On Mon, Nov 4, 2013 at 7:29 AM, Michel Lang michell...@gmail.com wrote:


You might want to consider using Recall() for recursion which should

solve

this. Determining the required variables using heuristics as codetools

will

probably lead to some confusion when using functions which include calls
to, e.g., with():

f = function() {
   with(iris, Sepal.Length + Sepal.Width)
}
codetools:::findGlobals(f)

I would suggest to write up some documentation on what the function's
environment contains and how to to define variables accordingly - or why

it

can generally be considered a good idea to pass everything essential as

an

argument. Nevertheless a bpExport function would be a good addition for
some rare corner cases in my opinion.

Michel


2013/11/3 Henrik Bengtsson h...@biostat.ucsf.edu


Hi,

in BiocParallel, is there a suggested (or planned) best standards for
making *locally* assigned variables (e.g. functions) available to the
applied function when it runs in a separate R process (which will be
the most common use case)?  I understand that avoid local variables
should be avoided and it's preferred to put as mush as possible in
packages, but that's not always possible or very convenient.

EXAMPLE:

library('BiocParallel')
library('BatchJobs')

# Here I pick a recursive functions to make the problem a bit harder,

i.e.

# the function needs to call itself (itself = see below)
fib - function(n=0) {
   if (n  0) stop(Invalid 'n': , n)
   if (n == 0 || n == 1) return(1)
   fib(n-2) + fib(n-1)
}

# Executing in the current R session
cluster.functions - makeClusterFunctionsInteractive()
bpParams - BatchJobsParam(cluster.functions=cluster.functions)
register(bpParams)
values - bplapply(0:9, FUN=fib)
## SubmitJobs |++| 100% (00:00:00)
## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00)


# Executing in a separate R process, where fib() is not defined
# (not specific to BiocParallel)
cluster.functions - makeClusterFunctionsLocal()
bpParams - BatchJobsParam(cluster.functions=cluster.functions)
register(bpParams)
values - bplapply(0:9, FUN=fib)
## SubmitJobs |++| 100% (00:00:00)
## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00)
Error in LastError$store(results = results, is.error = !ok,

throw.error =

TRUE)
:
   Errors occurred during execution. First error message:
Error in FUN(...): could not find function fib
[...]


# The following illustrates that the solution is not always
straightforward.
# (not specific to BiocParallel; must have been discussed previously)
values - bplapply(0:9, FUN=function(n, fib) {
   fib(n)
}, fib=fib)
Error in LastError$store(results = results, is.error = !ok,
throw.error = TRUE) :
   Errors occurred during execution. First error message:
Error in fib(n): could not find function fib
[...]

# Workaround; make fib() aware of 

Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?

2013-11-05 Thread luke-tierney

The 'foreach' framework does this sort of analysis using codetools at
least in part. You may be able to build on what they have.

luke

On Mon, 4 Nov 2013, Ryan wrote:



On 11/4/13, 11:05 AM, Gabriel Becker wrote:

As a side note, I'm not sure that existence of a symbol is sufficient (it
certainly is necessary). What about situations where the symbol exists but
is stale compared to the value in the parent? Are we sure that can never
happen?
I think this is a different issue. We want to detect when a function depends 
on variables outside that function in the user's workspace, or variables 
defined in a pacakge that the user has loaded. I think we can assume that R 
child processes will be of the same version with the same set of installed 
packages, so package-defined variables will not have different values in 
child processes. For user variables, I think the goal should be to prevent 
(or at least highly discourage) dependencies on them entirely, so I don't 
think it matters what their value may be in the child. I realize this is 
somewhat counter to the question that started this thread, which was about 
exporting variables to the children, but I think it is the most 
straightforward approach. As I believe someone noted earlier in the thread, 
Henrik's original problem of a recursive function is properly solved by using 
the Recall function.


-Ryan

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Luke Tierney
Chair, Statistics and Actuarial Science
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa  Phone: 319-335-3386
Department of Statistics andFax:   319-335-3017
   Actuarial Science
241 Schaeffer Hall  email:   luke-tier...@uiowa.edu
Iowa City, IA 52242 WWW:  http://www.stat.uiowa.edu

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?

2013-11-04 Thread Ryan
Actually, the check that I proposed is only supposed to check for usage 
of user-defined variables, not variables from packages. Truthfully, 
though, I guess I'm not the right person to work on this, since in 
practice I use forked processes for the vast majority of my inside-R 
parallelization, so I never have to worry about things being undefined 
in the forked subprocess. Therefore I cant really dogfood any of the 
stuff that might be implemented as a result of this thread.


-Ryan

On Mon Nov  4 03:48:23 2013, Michael Lawrence wrote:

So what is the best practice for ensuring that something is actually
visible to the worker? If the worker needs functionality from a
package, should the namespace be explicitly referenced via ::?  Lazy
users might want to include library() calls in the worker function.
This proposed check will then throw an exception. Probably a good
thing, but is there a way for a user to declare imported namespaces?
 I know that BatchJobs allows for passing a list of packages to be
loaded via library() on the worker. That is leveraging the search path
to make sure everything is visible and is a reasonable compromise (::
is always an option). We could essentially reimplement the search path
if we wanted isolation, but the worker is already isolated. Anyway,
somehow those types of declarations should be taken into account.

Moving back to the general discussion, for complex operations, it's
easiest to have the worker in a package. In that case, the worker will
likely rely on other functions, and the cleanest way to get those
functions to the worker is to have them installed as a package. At
least with BatchJobs, when the worker is inside a package namespace,
that namespace is automatically loaded (but not attached), so all
functions are automatically visible, without any extra work by me.

Michael


On Sun, Nov 3, 2013 at 10:46 PM, Ryan r...@thompsonclan.org
mailto:r...@thompsonclan.org wrote:

Ok, here is my attempt at a function to get the list of
user-defined free variables that a function refers to:

https://gist.github.com/__DarwinAwardWinner/7298557
https://gist.github.com/DarwinAwardWinner/7298557

Is uses codetools, so it is subject to the limitations of that
package, but for simple examples, it successfully detects when a
function refers to something in the global env.


On Sun Nov  3 21:14:29 2013, Gabriel Becker wrote:

Ryan (et al),

FYI:

 f
function() {
x = rnorm(x)
x
}
 findGlobals(f)
[1] = { rnorm

x should be in the list of globals but it isn't.

~G

 sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods
  base

other attached packages:
[1] codetools_0.2-8



On Sun, Nov 3, 2013 at 5:37 PM, Ryan r...@thompsonclan.org
mailto:r...@thompsonclan.org
mailto:r...@thompsonclan.org mailto:r...@thompsonclan.org
wrote:

Looking at the codetools package, I think findGlobals is
basically exactly what we want here, right? As you say,
there are
necessarily limitations due to R being a dynamic language,
but the
goal is to catch common errors, not stop people from
tricking the
check.

I think I'll try to code something up soon.

-Ryan


On 11/3/13, 5:10 PM, Gabriel Becker wrote:

Henrik,

See https://github.com/duncantl/__CodeDepends
https://github.com/duncantl/CodeDepends (as used by used by
https://github.com/gmbecker/__RCacheSuite
https://github.com/gmbecker/RCacheSuite). It will identify
necessarily defined symbols (input variables) for code
that is
not doing certain tricks (eg get(), mixing data.frame
columns and
gobal variables in formulas, etc ).

Tierney's codetools package also does things along
these lines
but there are some situations where it has trouble. I
can give
more detail if desired.

~G


On Sun, Nov 3, 2013 at 3:04 PM, Ryan
r...@thompsonclan.org mailto:r...@thompsonclan.org
mailto:r...@thompsonclan.org
mailto:r...@thompsonclan.org wrote:

Another potential easy 

Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?

2013-11-04 Thread Gabriel Becker
Weird, I guess it needs to be logged in or something. I don't know if the
issue is that its in a non-master branch or waht. The repo is fully public
and the forCRAN_0.3.5 in  branch definitely exists on github.

I started chrome (where I'm not logged into github) and got the same 404
error but after navigating to the file by going to the repo and changing
the branch and navigating to the file, it now works even when i quit chrome
and restart it. I don't know if it needed me to do that or if there was an
intermittent problem that is now fixed.

Anyway, here is the raw code, the link for which seems to work (in a
browser where I'm not logged into github). If it still doesn't I can just
attach the file here if you want. It doesn't rely on any of the rest of the
CodeDepends machinery.

https://raw.github.com/duncantl/CodeDepends/forCRAN_0.3.5/R/librarySymbols.R


~G


On Mon, Nov 4, 2013 at 11:34 AM, Michael Lawrence lawrence.mich...@gene.com
 wrote:

 The dynamic nature of R limits the extent of these checks. But as Ryan has
 noted, a simple sanity check goes a long way. If what he has done could be
 extended to the rest of the search path (people always forget to attach
 packages), I think we've hit the 80% with 20%. Got a 404 on that URL btw.

 Michael


 On Mon, Nov 4, 2013 at 11:05 AM, Gabriel Becker gmbec...@ucdavis.eduwrote:

 Hey guys,

 Here is code that I have written which resolves library names into a full
 list of symbols:

 https://github.com/duncantl/CodeDepends/blob/forCRAN_0.3.5/R/librarySymbols.RNote
 this does not require that the packages actually be loaded at the time
 of the check, and does not load them (or rather, it loads them but does
 not
 attach them, so no searchpath muddying occurs). You do need a list of
 packages to check though (it adds the base ones automatically). It handles
 dependency and could be easily extended to handle suggests as well I
 think.

 When CodeDepends gets pushed to cran (not my call and not high on my
 priority list to push for currently) it will actually do exactly what you
 want. (the forCRAN_0.3.5 branch already does and I believe it is
 documented, so you could use devtools to install it now).

 As a side note, I'm not sure that existence of a symbol is sufficient (it
 certainly is necessary). What about situations where the symbol exists but
 is stale compared to the value in the parent? Are we sure that can never
 happen?

 ~G


 On Mon, Nov 4, 2013 at 7:29 AM, Michel Lang michell...@gmail.com wrote:

  You might want to consider using Recall() for recursion which should
 solve
  this. Determining the required variables using heuristics as codetools
 will
  probably lead to some confusion when using functions which include calls
  to, e.g., with():
 
  f = function() {
with(iris, Sepal.Length + Sepal.Width)
  }
  codetools:::findGlobals(f)
 
  I would suggest to write up some documentation on what the function's
  environment contains and how to to define variables accordingly - or
 why it
  can generally be considered a good idea to pass everything essential as
 an
  argument. Nevertheless a bpExport function would be a good addition
 for
  some rare corner cases in my opinion.
 
  Michel
 
 
  2013/11/3 Henrik Bengtsson h...@biostat.ucsf.edu
 
   Hi,
  
   in BiocParallel, is there a suggested (or planned) best standards for
   making *locally* assigned variables (e.g. functions) available to the
   applied function when it runs in a separate R process (which will be
   the most common use case)?  I understand that avoid local variables
   should be avoided and it's preferred to put as mush as possible in
   packages, but that's not always possible or very convenient.
  
   EXAMPLE:
  
   library('BiocParallel')
   library('BatchJobs')
  
   # Here I pick a recursive functions to make the problem a bit harder,
  i.e.
   # the function needs to call itself (itself = see below)
   fib - function(n=0) {
 if (n  0) stop(Invalid 'n': , n)
 if (n == 0 || n == 1) return(1)
 fib(n-2) + fib(n-1)
   }
  
   # Executing in the current R session
   cluster.functions - makeClusterFunctionsInteractive()
   bpParams - BatchJobsParam(cluster.functions=cluster.functions)
   register(bpParams)
   values - bplapply(0:9, FUN=fib)
   ## SubmitJobs |++| 100% (00:00:00)
   ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00)
  
  
   # Executing in a separate R process, where fib() is not defined
   # (not specific to BiocParallel)
   cluster.functions - makeClusterFunctionsLocal()
   bpParams - BatchJobsParam(cluster.functions=cluster.functions)
   register(bpParams)
   values - bplapply(0:9, FUN=fib)
   ## SubmitJobs |++| 100% (00:00:00)
   ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00)
   Error in LastError$store(results = results, is.error = !ok,
 throw.error =
   TRUE)
   :
 Errors occurred during execution. First error message:
   Error in 

Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?

2013-11-04 Thread Ryan Thompson
The code that I wrote intentionally avoids checking for package variables,
since I consider that a separate problem. Package variables can be provided
to the child by leading the package, whereas user-defined variables must be
serialized in the parent and sent to the child.

I think I could fairly easily adapt the same code to return a list of all
packages that a function depends on.

-Ryan
On Nov 4, 2013 11:35 AM, Michael Lawrence lawrence.mich...@gene.com
wrote:

 The dynamic nature of R limits the extent of these checks. But as Ryan has
 noted, a simple sanity check goes a long way. If what he has done could be
 extended to the rest of the search path (people always forget to attach
 packages), I think we've hit the 80% with 20%. Got a 404 on that URL btw.

 Michael


 On Mon, Nov 4, 2013 at 11:05 AM, Gabriel Becker gmbec...@ucdavis.edu
 wrote:

  Hey guys,
 
  Here is code that I have written which resolves library names into a full
  list of symbols:
 
 
 https://github.com/duncantl/CodeDepends/blob/forCRAN_0.3.5/R/librarySymbols.RNote
  this does not require that the packages actually be loaded at the time
  of the check, and does not load them (or rather, it loads them but does
 not
  attach them, so no searchpath muddying occurs). You do need a list of
  packages to check though (it adds the base ones automatically). It
 handles
  dependency and could be easily extended to handle suggests as well I
 think.
 
  When CodeDepends gets pushed to cran (not my call and not high on my
  priority list to push for currently) it will actually do exactly what you
  want. (the forCRAN_0.3.5 branch already does and I believe it is
  documented, so you could use devtools to install it now).
 
  As a side note, I'm not sure that existence of a symbol is sufficient (it
  certainly is necessary). What about situations where the symbol exists
 but
  is stale compared to the value in the parent? Are we sure that can never
  happen?
 
  ~G
 
 
  On Mon, Nov 4, 2013 at 7:29 AM, Michel Lang michell...@gmail.com
 wrote:
 
   You might want to consider using Recall() for recursion which should
  solve
   this. Determining the required variables using heuristics as codetools
  will
   probably lead to some confusion when using functions which include
 calls
   to, e.g., with():
  
   f = function() {
 with(iris, Sepal.Length + Sepal.Width)
   }
   codetools:::findGlobals(f)
  
   I would suggest to write up some documentation on what the function's
   environment contains and how to to define variables accordingly - or
 why
  it
   can generally be considered a good idea to pass everything essential as
  an
   argument. Nevertheless a bpExport function would be a good addition
 for
   some rare corner cases in my opinion.
  
   Michel
  
  
   2013/11/3 Henrik Bengtsson h...@biostat.ucsf.edu
  
Hi,
   
in BiocParallel, is there a suggested (or planned) best standards for
making *locally* assigned variables (e.g. functions) available to the
applied function when it runs in a separate R process (which will be
the most common use case)?  I understand that avoid local variables
should be avoided and it's preferred to put as mush as possible in
packages, but that's not always possible or very convenient.
   
EXAMPLE:
   
library('BiocParallel')
library('BatchJobs')
   
# Here I pick a recursive functions to make the problem a bit harder,
   i.e.
# the function needs to call itself (itself = see below)
fib - function(n=0) {
  if (n  0) stop(Invalid 'n': , n)
  if (n == 0 || n == 1) return(1)
  fib(n-2) + fib(n-1)
}
   
# Executing in the current R session
cluster.functions - makeClusterFunctionsInteractive()
bpParams - BatchJobsParam(cluster.functions=cluster.functions)
register(bpParams)
values - bplapply(0:9, FUN=fib)
## SubmitJobs |++| 100% (00:00:00)
## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00)
   
   
# Executing in a separate R process, where fib() is not defined
# (not specific to BiocParallel)
cluster.functions - makeClusterFunctionsLocal()
bpParams - BatchJobsParam(cluster.functions=cluster.functions)
register(bpParams)
values - bplapply(0:9, FUN=fib)
## SubmitJobs |++| 100% (00:00:00)
## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00)
Error in LastError$store(results = results, is.error = !ok,
  throw.error =
TRUE)
:
  Errors occurred during execution. First error message:
Error in FUN(...): could not find function fib
[...]
   
   
# The following illustrates that the solution is not always
straightforward.
# (not specific to BiocParallel; must have been discussed previously)
values - bplapply(0:9, FUN=function(n, fib) {
  fib(n)
}, fib=fib)
Error in LastError$store(results = results, is.error = !ok,
throw.error = TRUE) :
  

Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?

2013-11-04 Thread Gabriel Becker
Ryan,

I agree that in some sense it is a different problem, but my point is with
a different approach we can easily answer both. The code I posted returns a
named character vector of symbol names with package name being the name.

This makes it a trivial lookup to determine both a) what symbols aren't
available in any of the packages and b) what packages provide the remaining
required symbols. No extra work required.

You do have to give it a list of packages to check, but it is easy to write
a wrapper that automatically passes it all currently attached packages if
desired (a combination of search() and gsub() would be a quick and dirty
way to do this).

All that said, I'm simply trying to help. If you guys don't want to use my
code/approach that is your perogative as I'm not currently working on
BiocParallel myself.

~G




On Mon, Nov 4, 2013 at 11:54 AM, Ryan Thompson r...@thompsonclan.org wrote:

 The code that I wrote intentionally avoids checking for package variables,
 since I consider that a separate problem. Package variables can be provided
 to the child by leading the package, whereas user-defined variables must be
 serialized in the parent and sent to the child.

 I think I could fairly easily adapt the same code to return a list of all
 packages that a function depends on.

 -Ryan
 On Nov 4, 2013 11:35 AM, Michael Lawrence lawrence.mich...@gene.com
 wrote:

 The dynamic nature of R limits the extent of these checks. But as Ryan has
 noted, a simple sanity check goes a long way. If what he has done could be
 extended to the rest of the search path (people always forget to attach
 packages), I think we've hit the 80% with 20%. Got a 404 on that URL btw.

 Michael


 On Mon, Nov 4, 2013 at 11:05 AM, Gabriel Becker gmbec...@ucdavis.edu
 wrote:

  Hey guys,
 
  Here is code that I have written which resolves library names into a
 full
  list of symbols:
 
 
 https://github.com/duncantl/CodeDepends/blob/forCRAN_0.3.5/R/librarySymbols.RNote
  this does not require that the packages actually be loaded at the time
  of the check, and does not load them (or rather, it loads them but does
 not
  attach them, so no searchpath muddying occurs). You do need a list of
  packages to check though (it adds the base ones automatically). It
 handles
  dependency and could be easily extended to handle suggests as well I
 think.
 
  When CodeDepends gets pushed to cran (not my call and not high on my
  priority list to push for currently) it will actually do exactly what
 you
  want. (the forCRAN_0.3.5 branch already does and I believe it is
  documented, so you could use devtools to install it now).
 
  As a side note, I'm not sure that existence of a symbol is sufficient
 (it
  certainly is necessary). What about situations where the symbol exists
 but
  is stale compared to the value in the parent? Are we sure that can never
  happen?
 
  ~G
 
 
  On Mon, Nov 4, 2013 at 7:29 AM, Michel Lang michell...@gmail.com
 wrote:
 
   You might want to consider using Recall() for recursion which should
  solve
   this. Determining the required variables using heuristics as codetools
  will
   probably lead to some confusion when using functions which include
 calls
   to, e.g., with():
  
   f = function() {
 with(iris, Sepal.Length + Sepal.Width)
   }
   codetools:::findGlobals(f)
  
   I would suggest to write up some documentation on what the function's
   environment contains and how to to define variables accordingly - or
 why
  it
   can generally be considered a good idea to pass everything essential
 as
  an
   argument. Nevertheless a bpExport function would be a good addition
 for
   some rare corner cases in my opinion.
  
   Michel
  
  
   2013/11/3 Henrik Bengtsson h...@biostat.ucsf.edu
  
Hi,
   
in BiocParallel, is there a suggested (or planned) best standards
 for
making *locally* assigned variables (e.g. functions) available to
 the
applied function when it runs in a separate R process (which will be
the most common use case)?  I understand that avoid local variables
should be avoided and it's preferred to put as mush as possible in
packages, but that's not always possible or very convenient.
   
EXAMPLE:
   
library('BiocParallel')
library('BatchJobs')
   
# Here I pick a recursive functions to make the problem a bit
 harder,
   i.e.
# the function needs to call itself (itself = see below)
fib - function(n=0) {
  if (n  0) stop(Invalid 'n': , n)
  if (n == 0 || n == 1) return(1)
  fib(n-2) + fib(n-1)
}
   
# Executing in the current R session
cluster.functions - makeClusterFunctionsInteractive()
bpParams - BatchJobsParam(cluster.functions=cluster.functions)
register(bpParams)
values - bplapply(0:9, FUN=fib)
## SubmitJobs |++| 100% (00:00:00)
## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00)
   
   
# Executing in a separate R process, where fib() is 

Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?

2013-11-04 Thread Ryan


On 11/4/13, 11:05 AM, Gabriel Becker wrote:

As a side note, I'm not sure that existence of a symbol is sufficient (it
certainly is necessary). What about situations where the symbol exists but
is stale compared to the value in the parent? Are we sure that can never
happen?
I think this is a different issue. We want to detect when a function 
depends on variables outside that function in the user's workspace, or 
variables defined in a pacakge that the user has loaded. I think we can 
assume that R child processes will be of the same version with the same 
set of installed packages, so package-defined variables will not have 
different values in child processes. For user variables, I think the 
goal should be to prevent (or at least highly discourage) dependencies 
on them entirely, so I don't think it matters what their value may be in 
the child. I realize this is somewhat counter to the question that 
started this thread, which was about exporting variables to the 
children, but I think it is the most straightforward approach. As I 
believe someone noted earlier in the thread, Henrik's original problem 
of a recursive function is properly solved by using the Recall function.


-Ryan

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?

2013-11-03 Thread Michael Lawrence
An analog to clusterExport is a good idea. To make it even easier, we could
have a dynamic environment based on object tables that would catch missing
symbols and download them from the parent thread. But maybe there's some
benefit to being explicit?

Michael


On Sun, Nov 3, 2013 at 12:39 PM, Henrik Bengtsson h...@biostat.ucsf.eduwrote:

 Hi,

 in BiocParallel, is there a suggested (or planned) best standards for
 making *locally* assigned variables (e.g. functions) available to the
 applied function when it runs in a separate R process (which will be
 the most common use case)?  I understand that avoid local variables
 should be avoided and it's preferred to put as mush as possible in
 packages, but that's not always possible or very convenient.

 EXAMPLE:

 library('BiocParallel')
 library('BatchJobs')

 # Here I pick a recursive functions to make the problem a bit harder, i.e.
 # the function needs to call itself (itself = see below)
 fib - function(n=0) {
   if (n  0) stop(Invalid 'n': , n)
   if (n == 0 || n == 1) return(1)
   fib(n-2) + fib(n-1)
 }

 # Executing in the current R session
 cluster.functions - makeClusterFunctionsInteractive()
 bpParams - BatchJobsParam(cluster.functions=cluster.functions)
 register(bpParams)
 values - bplapply(0:9, FUN=fib)
 ## SubmitJobs |++| 100% (00:00:00)
 ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00)


 # Executing in a separate R process, where fib() is not defined
 # (not specific to BiocParallel)
 cluster.functions - makeClusterFunctionsLocal()
 bpParams - BatchJobsParam(cluster.functions=cluster.functions)
 register(bpParams)
 values - bplapply(0:9, FUN=fib)
 ## SubmitJobs |++| 100% (00:00:00)
 ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00)
 Error in LastError$store(results = results, is.error = !ok, throw.error =
 TRUE)
 :
   Errors occurred during execution. First error message:
 Error in FUN(...): could not find function fib
 [...]


 # The following illustrates that the solution is not always
 straightforward.
 # (not specific to BiocParallel; must have been discussed previously)
 values - bplapply(0:9, FUN=function(n, fib) {
   fib(n)
 }, fib=fib)
 Error in LastError$store(results = results, is.error = !ok,
 throw.error = TRUE) :
   Errors occurred during execution. First error message:
 Error in fib(n): could not find function fib
 [...]

 # Workaround; make fib() aware of itself
 # (this is something the user need to do, and would be very
 #  hard for BiocParallel et al. to automate.  BTW, should all
 #  recursive functions be implemented this way?).
 fib - function(n=0) {
   if (n  0) stop(Invalid 'n': , n)
   if (n == 0 || n == 1) return(1)
   fib - sys.function() # Make function aware of itself
   fib(n-2) + fib(n-1)
 }
 values - bplapply(0:9, FUN=function(n, fib) {
   fib(n)
 }, fib=fib)


 WISHLIST:
 Considering the above recursive issue solved, a slightly more explicit
 and standardized solution is then:

 values - bplapply(0:9, FUN=function(n, BPGLOBALS=NULL) {
   for (name in names(BPGLOBALS)) assign(name, BPGLOBALS[[name]])
   fib(n)
 }, BPGLOBALS=list(fib=fib))

 Could the above be generalized into something as neat as:

 bpExport(fib)
 values - bplapply(0:9, FUN=function(n) {
   BiocParallel::bpImport(fib)
   fib(n)
 })

 or ideally just (analogously to parallel::clusterExport()):

 bpExport(fib)
 values - bplapply(0:9, FUN=fib)

 /Henrik

 ___
 Bioc-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/bioc-devel


[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?

2013-11-03 Thread Henrik Bengtsson
On Sun, Nov 3, 2013 at 1:29 PM, Michael Lawrence
lawrence.mich...@gene.com wrote:
 An analog to clusterExport is a good idea. To make it even easier, we could
 have a dynamic environment based on object tables that would catch missing
 symbols and download them from the parent thread. But maybe there's some
 benefit to being explicit?

A first step to fully automate this would be to provide some (opt
in/out) mechanism for code inspection and warn about non-defined
objects (cf. 'R CMD check').  That is of course major work, but will
certainly spare the community/users 1000's of hours in troubleshooting
and the mailing lists from why doesn't my parallel code not work
messages.  Such protection may be better suited for the 'parallel'
package though.  Unfortunately, it's beyond my skills/time to pull
such a thing together.

/Henrik


 Michael


 On Sun, Nov 3, 2013 at 12:39 PM, Henrik Bengtsson h...@biostat.ucsf.edu
 wrote:

 Hi,

 in BiocParallel, is there a suggested (or planned) best standards for
 making *locally* assigned variables (e.g. functions) available to the
 applied function when it runs in a separate R process (which will be
 the most common use case)?  I understand that avoid local variables
 should be avoided and it's preferred to put as mush as possible in
 packages, but that's not always possible or very convenient.

 EXAMPLE:

 library('BiocParallel')
 library('BatchJobs')

 # Here I pick a recursive functions to make the problem a bit harder, i.e.
 # the function needs to call itself (itself = see below)
 fib - function(n=0) {
   if (n  0) stop(Invalid 'n': , n)
   if (n == 0 || n == 1) return(1)
   fib(n-2) + fib(n-1)
 }

 # Executing in the current R session
 cluster.functions - makeClusterFunctionsInteractive()
 bpParams - BatchJobsParam(cluster.functions=cluster.functions)
 register(bpParams)
 values - bplapply(0:9, FUN=fib)
 ## SubmitJobs |++| 100% (00:00:00)
 ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00)


 # Executing in a separate R process, where fib() is not defined
 # (not specific to BiocParallel)
 cluster.functions - makeClusterFunctionsLocal()
 bpParams - BatchJobsParam(cluster.functions=cluster.functions)
 register(bpParams)
 values - bplapply(0:9, FUN=fib)
 ## SubmitJobs |++| 100% (00:00:00)
 ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00)
 Error in LastError$store(results = results, is.error = !ok, throw.error =
 TRUE)
 :
   Errors occurred during execution. First error message:
 Error in FUN(...): could not find function fib
 [...]


 # The following illustrates that the solution is not always
 straightforward.
 # (not specific to BiocParallel; must have been discussed previously)
 values - bplapply(0:9, FUN=function(n, fib) {
   fib(n)
 }, fib=fib)
 Error in LastError$store(results = results, is.error = !ok,
 throw.error = TRUE) :
   Errors occurred during execution. First error message:
 Error in fib(n): could not find function fib
 [...]

 # Workaround; make fib() aware of itself
 # (this is something the user need to do, and would be very
 #  hard for BiocParallel et al. to automate.  BTW, should all
 #  recursive functions be implemented this way?).
 fib - function(n=0) {
   if (n  0) stop(Invalid 'n': , n)
   if (n == 0 || n == 1) return(1)
   fib - sys.function() # Make function aware of itself
   fib(n-2) + fib(n-1)
 }
 values - bplapply(0:9, FUN=function(n, fib) {
   fib(n)
 }, fib=fib)


 WISHLIST:
 Considering the above recursive issue solved, a slightly more explicit
 and standardized solution is then:

 values - bplapply(0:9, FUN=function(n, BPGLOBALS=NULL) {
   for (name in names(BPGLOBALS)) assign(name, BPGLOBALS[[name]])
   fib(n)
 }, BPGLOBALS=list(fib=fib))

 Could the above be generalized into something as neat as:

 bpExport(fib)
 values - bplapply(0:9, FUN=function(n) {
   BiocParallel::bpImport(fib)
   fib(n)
 })

 or ideally just (analogously to parallel::clusterExport()):

 bpExport(fib)
 values - bplapply(0:9, FUN=fib)

 /Henrik

 ___
 Bioc-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/bioc-devel



___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?

2013-11-03 Thread Ryan
Here's an easy thing we can add to BiocParallel in the short term. The 
following code defines a wrapper function withBPExtraErrorText that 
simply appends an additional message to the end of any error that looks 
like it is about a missing variable. We could wrap every evaluation in 
a similar tryCatch to at least provide a more informative error message 
when a subprocess has a missing variable.


-Ryan

withBPExtraErrorText - function(expr) {
   tryCatch({
   expr
   }, simpleError = function(err) {
   if (grepl(^object '(.*)' not found$, err$message, perl=TRUE)) 
{

   ## It is an error due to a variable not found.
   err$message - paste0(err$message, . Maybe you forgot to 
export this variable from the main R session using \bpexport\?)

   }
   stop(err)
   })
}

x - 5

## Succeeds
withBPExtraErrorText(x)

## Fails with more informative error message
withBPExtraErrorText(y)



On Sun Nov  3 14:01:48 2013, Henrik Bengtsson wrote:

On Sun, Nov 3, 2013 at 1:29 PM, Michael Lawrence
lawrence.mich...@gene.com wrote:

An analog to clusterExport is a good idea. To make it even easier, we could
have a dynamic environment based on object tables that would catch missing
symbols and download them from the parent thread. But maybe there's some
benefit to being explicit?


A first step to fully automate this would be to provide some (opt
in/out) mechanism for code inspection and warn about non-defined
objects (cf. 'R CMD check').  That is of course major work, but will
certainly spare the community/users 1000's of hours in troubleshooting
and the mailing lists from why doesn't my parallel code not work
messages.  Such protection may be better suited for the 'parallel'
package though.  Unfortunately, it's beyond my skills/time to pull
such a thing together.

/Henrik



Michael


On Sun, Nov 3, 2013 at 12:39 PM, Henrik Bengtsson h...@biostat.ucsf.edu
wrote:


Hi,

in BiocParallel, is there a suggested (or planned) best standards for
making *locally* assigned variables (e.g. functions) available to the
applied function when it runs in a separate R process (which will be
the most common use case)?  I understand that avoid local variables
should be avoided and it's preferred to put as mush as possible in
packages, but that's not always possible or very convenient.

EXAMPLE:

library('BiocParallel')
library('BatchJobs')

# Here I pick a recursive functions to make the problem a bit harder, i.e.
# the function needs to call itself (itself = see below)
fib - function(n=0) {
   if (n  0) stop(Invalid 'n': , n)
   if (n == 0 || n == 1) return(1)
   fib(n-2) + fib(n-1)
}

# Executing in the current R session
cluster.functions - makeClusterFunctionsInteractive()
bpParams - BatchJobsParam(cluster.functions=cluster.functions)
register(bpParams)
values - bplapply(0:9, FUN=fib)
## SubmitJobs |++| 100% (00:00:00)
## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00)


# Executing in a separate R process, where fib() is not defined
# (not specific to BiocParallel)
cluster.functions - makeClusterFunctionsLocal()
bpParams - BatchJobsParam(cluster.functions=cluster.functions)
register(bpParams)
values - bplapply(0:9, FUN=fib)
## SubmitJobs |++| 100% (00:00:00)
## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00)
Error in LastError$store(results = results, is.error = !ok, throw.error =
TRUE)
:
   Errors occurred during execution. First error message:
Error in FUN(...): could not find function fib
[...]


# The following illustrates that the solution is not always
straightforward.
# (not specific to BiocParallel; must have been discussed previously)
values - bplapply(0:9, FUN=function(n, fib) {
   fib(n)
}, fib=fib)
Error in LastError$store(results = results, is.error = !ok,
throw.error = TRUE) :
   Errors occurred during execution. First error message:
Error in fib(n): could not find function fib
[...]

# Workaround; make fib() aware of itself
# (this is something the user need to do, and would be very
#  hard for BiocParallel et al. to automate.  BTW, should all
#  recursive functions be implemented this way?).
fib - function(n=0) {
   if (n  0) stop(Invalid 'n': , n)
   if (n == 0 || n == 1) return(1)
   fib - sys.function() # Make function aware of itself
   fib(n-2) + fib(n-1)
}
values - bplapply(0:9, FUN=function(n, fib) {
   fib(n)
}, fib=fib)


WISHLIST:
Considering the above recursive issue solved, a slightly more explicit
and standardized solution is then:

values - bplapply(0:9, FUN=function(n, BPGLOBALS=NULL) {
   for (name in names(BPGLOBALS)) assign(name, BPGLOBALS[[name]])
   fib(n)
}, BPGLOBALS=list(fib=fib))

Could the above be generalized into something as neat as:

bpExport(fib)
values - bplapply(0:9, FUN=function(n) {
   BiocParallel::bpImport(fib)
   fib(n)
})

or ideally just (analogously to parallel::clusterExport()):

bpExport(fib)
values - bplapply(0:9, FUN=fib)

/Henrik


Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?

2013-11-03 Thread Ryan
Another potential easy step we can do is that if FUN function in the 
user's workspace, we automatically export that function under the same 
name in the children. This would make recursive functions just work, but 
it might be a bit too magical.


On 11/3/13, 2:38 PM, Ryan wrote:
Here's an easy thing we can add to BiocParallel in the short term. The 
following code defines a wrapper function withBPExtraErrorText that 
simply appends an additional message to the end of any error that 
looks like it is about a missing variable. We could wrap every 
evaluation in a similar tryCatch to at least provide a more 
informative error message when a subprocess has a missing variable.


-Ryan

withBPExtraErrorText - function(expr) {
   tryCatch({
   expr
   }, simpleError = function(err) {
   if (grepl(^object '(.*)' not found$, err$message, perl=TRUE)) {
   ## It is an error due to a variable not found.
   err$message - paste0(err$message, . Maybe you forgot to 
export this variable from the main R session using \bpexport\?)

   }
   stop(err)
   })
}

x - 5

## Succeeds
withBPExtraErrorText(x)

## Fails with more informative error message
withBPExtraErrorText(y)



On Sun Nov  3 14:01:48 2013, Henrik Bengtsson wrote:

On Sun, Nov 3, 2013 at 1:29 PM, Michael Lawrence
lawrence.mich...@gene.com wrote:
An analog to clusterExport is a good idea. To make it even easier, 
we could
have a dynamic environment based on object tables that would catch 
missing
symbols and download them from the parent thread. But maybe there's 
some

benefit to being explicit?


A first step to fully automate this would be to provide some (opt
in/out) mechanism for code inspection and warn about non-defined
objects (cf. 'R CMD check').  That is of course major work, but will
certainly spare the community/users 1000's of hours in troubleshooting
and the mailing lists from why doesn't my parallel code not work
messages.  Such protection may be better suited for the 'parallel'
package though.  Unfortunately, it's beyond my skills/time to pull
such a thing together.

/Henrik



Michael


On Sun, Nov 3, 2013 at 12:39 PM, Henrik Bengtsson h...@biostat.ucsf.edu
wrote:


Hi,

in BiocParallel, is there a suggested (or planned) best standards for
making *locally* assigned variables (e.g. functions) available to the
applied function when it runs in a separate R process (which will be
the most common use case)?  I understand that avoid local variables
should be avoided and it's preferred to put as mush as possible in
packages, but that's not always possible or very convenient.

EXAMPLE:

library('BiocParallel')
library('BatchJobs')

# Here I pick a recursive functions to make the problem a bit 
harder, i.e.

# the function needs to call itself (itself = see below)
fib - function(n=0) {
   if (n  0) stop(Invalid 'n': , n)
   if (n == 0 || n == 1) return(1)
   fib(n-2) + fib(n-1)
}

# Executing in the current R session
cluster.functions - makeClusterFunctionsInteractive()
bpParams - BatchJobsParam(cluster.functions=cluster.functions)
register(bpParams)
values - bplapply(0:9, FUN=fib)
## SubmitJobs |++| 100% (00:00:00)
## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00)


# Executing in a separate R process, where fib() is not defined
# (not specific to BiocParallel)
cluster.functions - makeClusterFunctionsLocal()
bpParams - BatchJobsParam(cluster.functions=cluster.functions)
register(bpParams)
values - bplapply(0:9, FUN=fib)
## SubmitJobs |++| 100% (00:00:00)
## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00)
Error in LastError$store(results = results, is.error = !ok, 
throw.error =

TRUE)
:
   Errors occurred during execution. First error message:
Error in FUN(...): could not find function fib
[...]


# The following illustrates that the solution is not always
straightforward.
# (not specific to BiocParallel; must have been discussed previously)
values - bplapply(0:9, FUN=function(n, fib) {
   fib(n)
}, fib=fib)
Error in LastError$store(results = results, is.error = !ok,
throw.error = TRUE) :
   Errors occurred during execution. First error message:
Error in fib(n): could not find function fib
[...]

# Workaround; make fib() aware of itself
# (this is something the user need to do, and would be very
#  hard for BiocParallel et al. to automate.  BTW, should all
#  recursive functions be implemented this way?).
fib - function(n=0) {
   if (n  0) stop(Invalid 'n': , n)
   if (n == 0 || n == 1) return(1)
   fib - sys.function() # Make function aware of itself
   fib(n-2) + fib(n-1)
}
values - bplapply(0:9, FUN=function(n, fib) {
   fib(n)
}, fib=fib)


WISHLIST:
Considering the above recursive issue solved, a slightly more explicit
and standardized solution is then:

values - bplapply(0:9, FUN=function(n, BPGLOBALS=NULL) {
   for (name in names(BPGLOBALS)) assign(name, BPGLOBALS[[name]])
   fib(n)
}, BPGLOBALS=list(fib=fib))

Could 

Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?

2013-11-03 Thread Gabriel Becker
Henrik,

See https://github.com/duncantl/CodeDepends (as used by used by
https://github.com/gmbecker/RCacheSuite). It will identify necessarily
defined symbols (input variables) for code that is not doing certain tricks
(eg get(), mixing data.frame columns and gobal variables in formulas, etc ).

Tierney's codetools package also does things along these lines but there
are some situations where it has trouble. I can give more detail if desired.

~G


On Sun, Nov 3, 2013 at 3:04 PM, Ryan r...@thompsonclan.org wrote:

 Another potential easy step we can do is that if FUN function in the
 user's workspace, we automatically export that function under the same name
 in the children. This would make recursive functions just work, but it
 might be a bit too magical.


 On 11/3/13, 2:38 PM, Ryan wrote:

 Here's an easy thing we can add to BiocParallel in the short term. The
 following code defines a wrapper function withBPExtraErrorText that
 simply appends an additional message to the end of any error that looks
 like it is about a missing variable. We could wrap every evaluation in a
 similar tryCatch to at least provide a more informative error message when
 a subprocess has a missing variable.

 -Ryan

 withBPExtraErrorText - function(expr) {
tryCatch({
expr
}, simpleError = function(err) {
if (grepl(^object '(.*)' not found$, err$message, perl=TRUE)) {
## It is an error due to a variable not found.
err$message - paste0(err$message, . Maybe you forgot to
 export this variable from the main R session using \bpexport\?)
}
stop(err)
})
 }

 x - 5

 ## Succeeds
 withBPExtraErrorText(x)

 ## Fails with more informative error message
 withBPExtraErrorText(y)



 On Sun Nov  3 14:01:48 2013, Henrik Bengtsson wrote:

 On Sun, Nov 3, 2013 at 1:29 PM, Michael Lawrence
 lawrence.mich...@gene.com wrote:

 An analog to clusterExport is a good idea. To make it even easier, we
 could
 have a dynamic environment based on object tables that would catch
 missing
 symbols and download them from the parent thread. But maybe there's some
 benefit to being explicit?


 A first step to fully automate this would be to provide some (opt
 in/out) mechanism for code inspection and warn about non-defined
 objects (cf. 'R CMD check').  That is of course major work, but will
 certainly spare the community/users 1000's of hours in troubleshooting
 and the mailing lists from why doesn't my parallel code not work
 messages.  Such protection may be better suited for the 'parallel'
 package though.  Unfortunately, it's beyond my skills/time to pull
 such a thing together.

 /Henrik


 Michael


 On Sun, Nov 3, 2013 at 12:39 PM, Henrik Bengtsson h...@biostat.ucsf.edu
 wrote:


 Hi,

 in BiocParallel, is there a suggested (or planned) best standards for
 making *locally* assigned variables (e.g. functions) available to the
 applied function when it runs in a separate R process (which will be
 the most common use case)?  I understand that avoid local variables
 should be avoided and it's preferred to put as mush as possible in
 packages, but that's not always possible or very convenient.

 EXAMPLE:

 library('BiocParallel')
 library('BatchJobs')

 # Here I pick a recursive functions to make the problem a bit harder,
 i.e.
 # the function needs to call itself (itself = see below)
 fib - function(n=0) {
if (n  0) stop(Invalid 'n': , n)
if (n == 0 || n == 1) return(1)
fib(n-2) + fib(n-1)
 }

 # Executing in the current R session
 cluster.functions - makeClusterFunctionsInteractive()
 bpParams - BatchJobsParam(cluster.functions=cluster.functions)
 register(bpParams)
 values - bplapply(0:9, FUN=fib)
 ## SubmitJobs |++| 100% (00:00:00)
 ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00)


 # Executing in a separate R process, where fib() is not defined
 # (not specific to BiocParallel)
 cluster.functions - makeClusterFunctionsLocal()
 bpParams - BatchJobsParam(cluster.functions=cluster.functions)
 register(bpParams)
 values - bplapply(0:9, FUN=fib)
 ## SubmitJobs |++| 100% (00:00:00)
 ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00)
 Error in LastError$store(results = results, is.error = !ok,
 throw.error =
 TRUE)
 :
Errors occurred during execution. First error message:
 Error in FUN(...): could not find function fib
 [...]


 # The following illustrates that the solution is not always
 straightforward.
 # (not specific to BiocParallel; must have been discussed previously)
 values - bplapply(0:9, FUN=function(n, fib) {
fib(n)
 }, fib=fib)
 Error in LastError$store(results = results, is.error = !ok,
 throw.error = TRUE) :
Errors occurred during execution. First error message:
 Error in fib(n): could not find function fib
 [...]

 # Workaround; make fib() aware of itself
 # (this is something the user need to do, and would be very
 #  hard for BiocParallel et al. to 

Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?

2013-11-03 Thread Ryan
I guess all we need to do is to detect whether a function would try to 
access a free variable in the user's workspace, and warn/error if so. 
It looks like CodeDepends could do that. I could try to come up with an 
implementation. I guess we would add CodeDepends as an optional 
dependency for BiocParallel, and only do the checks if CodeDepends is 
available.


On Sun Nov  3 17:10:45 2013, Gabriel Becker wrote:

Henrik,

See https://github.com/duncantl/CodeDepends (as used by used by
https://github.com/gmbecker/RCacheSuite). It will identify necessarily
defined symbols (input variables) for code that is not doing certain
tricks (eg get(), mixing data.frame columns and gobal variables in
formulas, etc ).

Tierney's codetools package also does things along these lines but
there are some situations where it has trouble. I can give more detail
if desired.

~G


On Sun, Nov 3, 2013 at 3:04 PM, Ryan r...@thompsonclan.org
mailto:r...@thompsonclan.org wrote:

Another potential easy step we can do is that if FUN function in
the user's workspace, we automatically export that function under
the same name in the children. This would make recursive functions
just work, but it might be a bit too magical.


On 11/3/13, 2:38 PM, Ryan wrote:

Here's an easy thing we can add to BiocParallel in the short
term. The following code defines a wrapper function
withBPExtraErrorText that simply appends an additional
message to the end of any error that looks like it is about a
missing variable. We could wrap every evaluation in a similar
tryCatch to at least provide a more informative error message
when a subprocess has a missing variable.

-Ryan

withBPExtraErrorText - function(expr) {
   tryCatch({
   expr
   }, simpleError = function(err) {
   if (grepl(^object '(.*)' not found$, err$message,
perl=TRUE)) {
   ## It is an error due to a variable not found.
   err$message - paste0(err$message, . Maybe you
forgot to export this variable from the main R session using
\bpexport\?)
   }
   stop(err)
   })
}

x - 5

## Succeeds
withBPExtraErrorText(x)

## Fails with more informative error message
withBPExtraErrorText(y)



On Sun Nov  3 14:01:48 2013, Henrik Bengtsson wrote:

On Sun, Nov 3, 2013 at 1:29 PM, Michael Lawrence
lawrence.mich...@gene.com
mailto:lawrence.mich...@gene.com wrote:

An analog to clusterExport is a good idea. To make it
even easier, we could
have a dynamic environment based on object tables that
would catch missing
symbols and download them from the parent thread. But
maybe there's some
benefit to being explicit?


A first step to fully automate this would be to provide
some (opt
in/out) mechanism for code inspection and warn about
non-defined
objects (cf. 'R CMD check').  That is of course major
work, but will
certainly spare the community/users 1000's of hours in
troubleshooting
and the mailing lists from why doesn't my parallel code
not work
messages.  Such protection may be better suited for the
'parallel'
package though.  Unfortunately, it's beyond my skills/time
to pull
such a thing together.

/Henrik


Michael


On Sun, Nov 3, 2013 at 12:39 PM, Henrik Bengtsson
h...@biostat.ucsf.edu mailto:h...@biostat.ucsf.edu
wrote:


Hi,

in BiocParallel, is there a suggested (or planned)
best standards for
making *locally* assigned variables (e.g.
functions) available to the
applied function when it runs in a separate R
process (which will be
the most common use case)?  I understand that
avoid local variables
should be avoided and it's preferred to put as
mush as possible in
packages, but that's not always possible or very
convenient.

EXAMPLE:

library('BiocParallel')
library('BatchJobs')

# Here I pick a recursive functions to make the
problem a bit harder, i.e.
# the function needs to call itself (itself =
see below)
fib - function(n=0) {
   if (n  0) stop(Invalid 'n': , n)
   if (n == 0 || n == 

Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?

2013-11-03 Thread Gabriel Becker
Ryan (et al),

FYI:

 f
function() {
x = rnorm(x)
x
}
 findGlobals(f)
[1] = { rnorm

x should be in the list of globals but it isn't.

~G

 sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

other attached packages:
[1] codetools_0.2-8



On Sun, Nov 3, 2013 at 5:37 PM, Ryan r...@thompsonclan.org wrote:

  Looking at the codetools package, I think findGlobals is basically
 exactly what we want here, right? As you say, there are necessarily
 limitations due to R being a dynamic language, but the goal is to catch
 common errors, not stop people from tricking the check.

 I think I'll try to code something up soon.

 -Ryan


 On 11/3/13, 5:10 PM, Gabriel Becker wrote:

  Henrik,

 See https://github.com/duncantl/CodeDepends (as used by used by
 https://github.com/gmbecker/RCacheSuite). It will identify necessarily
 defined symbols (input variables) for code that is not doing certain tricks
 (eg get(), mixing data.frame columns and gobal variables in formulas, etc ).

  Tierney's codetools package also does things along these lines but there
 are some situations where it has trouble. I can give more detail if desired.

  ~G


 On Sun, Nov 3, 2013 at 3:04 PM, Ryan r...@thompsonclan.org wrote:

 Another potential easy step we can do is that if FUN function in the
 user's workspace, we automatically export that function under the same name
 in the children. This would make recursive functions just work, but it
 might be a bit too magical.


 On 11/3/13, 2:38 PM, Ryan wrote:

 Here's an easy thing we can add to BiocParallel in the short term. The
 following code defines a wrapper function withBPExtraErrorText that
 simply appends an additional message to the end of any error that looks
 like it is about a missing variable. We could wrap every evaluation in a
 similar tryCatch to at least provide a more informative error message when
 a subprocess has a missing variable.

 -Ryan

 withBPExtraErrorText - function(expr) {
tryCatch({
expr
}, simpleError = function(err) {
if (grepl(^object '(.*)' not found$, err$message, perl=TRUE)) {
## It is an error due to a variable not found.
err$message - paste0(err$message, . Maybe you forgot to
 export this variable from the main R session using \bpexport\?)
}
stop(err)
})
 }

 x - 5

 ## Succeeds
 withBPExtraErrorText(x)

 ## Fails with more informative error message
 withBPExtraErrorText(y)



 On Sun Nov  3 14:01:48 2013, Henrik Bengtsson wrote:

 On Sun, Nov 3, 2013 at 1:29 PM, Michael Lawrence
 lawrence.mich...@gene.com wrote:

 An analog to clusterExport is a good idea. To make it even easier, we
 could
 have a dynamic environment based on object tables that would catch
 missing
 symbols and download them from the parent thread. But maybe there's
 some
 benefit to being explicit?


 A first step to fully automate this would be to provide some (opt
 in/out) mechanism for code inspection and warn about non-defined
 objects (cf. 'R CMD check').  That is of course major work, but will
 certainly spare the community/users 1000's of hours in troubleshooting
 and the mailing lists from why doesn't my parallel code not work
 messages.  Such protection may be better suited for the 'parallel'
 package though.  Unfortunately, it's beyond my skills/time to pull
 such a thing together.

 /Henrik


 Michael


 On Sun, Nov 3, 2013 at 12:39 PM, Henrik Bengtsson h...@biostat.ucsf.edu
 
 wrote:


 Hi,

 in BiocParallel, is there a suggested (or planned) best standards for
 making *locally* assigned variables (e.g. functions) available to the
 applied function when it runs in a separate R process (which will be
 the most common use case)?  I understand that avoid local variables
 should be avoided and it's preferred to put as mush as possible in
 packages, but that's not always possible or very convenient.

 EXAMPLE:

 library('BiocParallel')
 library('BatchJobs')

 # Here I pick a recursive functions to make the problem a bit harder,
 i.e.
 # the function needs to call itself (itself = see below)
 fib - function(n=0) {
if (n  0) stop(Invalid 'n': , n)
if (n == 0 || n == 1) return(1)
fib(n-2) + fib(n-1)
 }

 # Executing in the current R session
 cluster.functions - makeClusterFunctionsInteractive()
 bpParams - BatchJobsParam(cluster.functions=cluster.functions)
 register(bpParams)
 values - bplapply(0:9, FUN=fib)
 ## SubmitJobs |++| 100% (00:00:00)
 ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00)


 # Executing in a separate R 

Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?

2013-11-03 Thread Ryan
Ok, here is my attempt at a function to get the list of user-defined 
free variables that a function refers to:


https://gist.github.com/DarwinAwardWinner/7298557

Is uses codetools, so it is subject to the limitations of that package, 
but for simple examples, it successfully detects when a function refers 
to something in the global env.


On Sun Nov  3 21:14:29 2013, Gabriel Becker wrote:

Ryan (et al),

FYI:

 f
function() {
x = rnorm(x)
x
}
 findGlobals(f)
[1] = { rnorm

x should be in the list of globals but it isn't.

~G

 sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

other attached packages:
[1] codetools_0.2-8



On Sun, Nov 3, 2013 at 5:37 PM, Ryan r...@thompsonclan.org
mailto:r...@thompsonclan.org wrote:

Looking at the codetools package, I think findGlobals is
basically exactly what we want here, right? As you say, there are
necessarily limitations due to R being a dynamic language, but the
goal is to catch common errors, not stop people from tricking the
check.

I think I'll try to code something up soon.

-Ryan


On 11/3/13, 5:10 PM, Gabriel Becker wrote:

Henrik,

See https://github.com/duncantl/CodeDepends (as used by used by
https://github.com/gmbecker/RCacheSuite). It will identify
necessarily defined symbols (input variables) for code that is
not doing certain tricks (eg get(), mixing data.frame columns and
gobal variables in formulas, etc ).

Tierney's codetools package also does things along these lines
but there are some situations where it has trouble. I can give
more detail if desired.

~G


On Sun, Nov 3, 2013 at 3:04 PM, Ryan r...@thompsonclan.org
mailto:r...@thompsonclan.org wrote:

Another potential easy step we can do is that if FUN function
in the user's workspace, we automatically export that
function under the same name in the children. This would make
recursive functions just work, but it might be a bit too
magical.


On 11/3/13, 2:38 PM, Ryan wrote:

Here's an easy thing we can add to BiocParallel in the
short term. The following code defines a wrapper function
withBPExtraErrorText that simply appends an additional
message to the end of any error that looks like it is
about a missing variable. We could wrap every evaluation
in a similar tryCatch to at least provide a more
informative error message when a subprocess has a missing
variable.

-Ryan

withBPExtraErrorText - function(expr) {
   tryCatch({
   expr
   }, simpleError = function(err) {
   if (grepl(^object '(.*)' not found$,
err$message, perl=TRUE)) {
   ## It is an error due to a variable not found.
   err$message - paste0(err$message, . Maybe
you forgot to export this variable from the main R
session using \bpexport\?)
   }
   stop(err)
   })
}

x - 5

## Succeeds
withBPExtraErrorText(x)

## Fails with more informative error message
withBPExtraErrorText(y)



On Sun Nov  3 14:01:48 2013, Henrik Bengtsson wrote:

On Sun, Nov 3, 2013 at 1:29 PM, Michael Lawrence
lawrence.mich...@gene.com
mailto:lawrence.mich...@gene.com wrote:

An analog to clusterExport is a good idea. To
make it even easier, we could
have a dynamic environment based on object tables
that would catch missing
symbols and download them from the parent thread.
But maybe there's some
benefit to being explicit?


A first step to fully automate this would be to
provide some (opt
in/out) mechanism for code inspection and warn about
non-defined
objects (cf. 'R CMD check').  That is of course major
work, but will
certainly spare the community/users 1000's of hours
in troubleshooting
and the mailing lists from why doesn't my parallel
code not work
messages.  Such protection may be better suited for
the 'parallel'
package though.