Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?
On 11/04/2013 11:34 AM, Michael Lawrence wrote: The dynamic nature of R limits the extent of these checks. But as Ryan has noted, a simple sanity check goes a long way. If what he has done could be extended to the rest of the search path (people always forget to attach packages), I think we've hit the 80% with 20%. Got a 404 on that URL btw. I added three issues to BiocParallel on github. 1. bpexport 2. a function to check for non-local use. I think this should use codetools (to avoid adding additional dependencies) but I'm a little flexible. Contributions welcome on github, especially as a pull request with code formatted consistently, a man page, and especially unit tests to provide a clear understanding of circumstances covered or not. Michel Lang's Recall and the implementation in foreach also sound releavant here. 3. integration of (2) into bplapply etc. Please feel free to address these further on github. Martin Michael On Mon, Nov 4, 2013 at 11:05 AM, Gabriel Becker gmbec...@ucdavis.eduwrote: Hey guys, Here is code that I have written which resolves library names into a full list of symbols: https://github.com/duncantl/CodeDepends/blob/forCRAN_0.3.5/R/librarySymbols.RNote this does not require that the packages actually be loaded at the time of the check, and does not load them (or rather, it loads them but does not attach them, so no searchpath muddying occurs). You do need a list of packages to check though (it adds the base ones automatically). It handles dependency and could be easily extended to handle suggests as well I think. When CodeDepends gets pushed to cran (not my call and not high on my priority list to push for currently) it will actually do exactly what you want. (the forCRAN_0.3.5 branch already does and I believe it is documented, so you could use devtools to install it now). As a side note, I'm not sure that existence of a symbol is sufficient (it certainly is necessary). What about situations where the symbol exists but is stale compared to the value in the parent? Are we sure that can never happen? ~G On Mon, Nov 4, 2013 at 7:29 AM, Michel Lang michell...@gmail.com wrote: You might want to consider using Recall() for recursion which should solve this. Determining the required variables using heuristics as codetools will probably lead to some confusion when using functions which include calls to, e.g., with(): f = function() { with(iris, Sepal.Length + Sepal.Width) } codetools:::findGlobals(f) I would suggest to write up some documentation on what the function's environment contains and how to to define variables accordingly - or why it can generally be considered a good idea to pass everything essential as an argument. Nevertheless a bpExport function would be a good addition for some rare corner cases in my opinion. Michel 2013/11/3 Henrik Bengtsson h...@biostat.ucsf.edu Hi, in BiocParallel, is there a suggested (or planned) best standards for making *locally* assigned variables (e.g. functions) available to the applied function when it runs in a separate R process (which will be the most common use case)? I understand that avoid local variables should be avoided and it's preferred to put as mush as possible in packages, but that's not always possible or very convenient. EXAMPLE: library('BiocParallel') library('BatchJobs') # Here I pick a recursive functions to make the problem a bit harder, i.e. # the function needs to call itself (itself = see below) fib - function(n=0) { if (n 0) stop(Invalid 'n': , n) if (n == 0 || n == 1) return(1) fib(n-2) + fib(n-1) } # Executing in the current R session cluster.functions - makeClusterFunctionsInteractive() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) # Executing in a separate R process, where fib() is not defined # (not specific to BiocParallel) cluster.functions - makeClusterFunctionsLocal() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) Error in LastError$store(results = results, is.error = !ok, throw.error = TRUE) : Errors occurred during execution. First error message: Error in FUN(...): could not find function fib [...] # The following illustrates that the solution is not always straightforward. # (not specific to BiocParallel; must have been discussed previously) values - bplapply(0:9, FUN=function(n, fib) { fib(n) }, fib=fib) Error in LastError$store(results = results, is.error = !ok, throw.error = TRUE) : Errors occurred during execution. First error message: Error in fib(n): could not find function fib [...] # Workaround; make fib() aware of
Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?
The 'foreach' framework does this sort of analysis using codetools at least in part. You may be able to build on what they have. luke On Mon, 4 Nov 2013, Ryan wrote: On 11/4/13, 11:05 AM, Gabriel Becker wrote: As a side note, I'm not sure that existence of a symbol is sufficient (it certainly is necessary). What about situations where the symbol exists but is stale compared to the value in the parent? Are we sure that can never happen? I think this is a different issue. We want to detect when a function depends on variables outside that function in the user's workspace, or variables defined in a pacakge that the user has loaded. I think we can assume that R child processes will be of the same version with the same set of installed packages, so package-defined variables will not have different values in child processes. For user variables, I think the goal should be to prevent (or at least highly discourage) dependencies on them entirely, so I don't think it matters what their value may be in the child. I realize this is somewhat counter to the question that started this thread, which was about exporting variables to the children, but I think it is the most straightforward approach. As I believe someone noted earlier in the thread, Henrik's original problem of a recursive function is properly solved by using the Recall function. -Ryan ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Luke Tierney Chair, Statistics and Actuarial Science Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics andFax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: luke-tier...@uiowa.edu Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?
Actually, the check that I proposed is only supposed to check for usage of user-defined variables, not variables from packages. Truthfully, though, I guess I'm not the right person to work on this, since in practice I use forked processes for the vast majority of my inside-R parallelization, so I never have to worry about things being undefined in the forked subprocess. Therefore I cant really dogfood any of the stuff that might be implemented as a result of this thread. -Ryan On Mon Nov 4 03:48:23 2013, Michael Lawrence wrote: So what is the best practice for ensuring that something is actually visible to the worker? If the worker needs functionality from a package, should the namespace be explicitly referenced via ::? Lazy users might want to include library() calls in the worker function. This proposed check will then throw an exception. Probably a good thing, but is there a way for a user to declare imported namespaces? I know that BatchJobs allows for passing a list of packages to be loaded via library() on the worker. That is leveraging the search path to make sure everything is visible and is a reasonable compromise (:: is always an option). We could essentially reimplement the search path if we wanted isolation, but the worker is already isolated. Anyway, somehow those types of declarations should be taken into account. Moving back to the general discussion, for complex operations, it's easiest to have the worker in a package. In that case, the worker will likely rely on other functions, and the cleanest way to get those functions to the worker is to have them installed as a package. At least with BatchJobs, when the worker is inside a package namespace, that namespace is automatically loaded (but not attached), so all functions are automatically visible, without any extra work by me. Michael On Sun, Nov 3, 2013 at 10:46 PM, Ryan r...@thompsonclan.org mailto:r...@thompsonclan.org wrote: Ok, here is my attempt at a function to get the list of user-defined free variables that a function refers to: https://gist.github.com/__DarwinAwardWinner/7298557 https://gist.github.com/DarwinAwardWinner/7298557 Is uses codetools, so it is subject to the limitations of that package, but for simple examples, it successfully detects when a function refers to something in the global env. On Sun Nov 3 21:14:29 2013, Gabriel Becker wrote: Ryan (et al), FYI: f function() { x = rnorm(x) x } findGlobals(f) [1] = { rnorm x should be in the list of globals but it isn't. ~G sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] codetools_0.2-8 On Sun, Nov 3, 2013 at 5:37 PM, Ryan r...@thompsonclan.org mailto:r...@thompsonclan.org mailto:r...@thompsonclan.org mailto:r...@thompsonclan.org wrote: Looking at the codetools package, I think findGlobals is basically exactly what we want here, right? As you say, there are necessarily limitations due to R being a dynamic language, but the goal is to catch common errors, not stop people from tricking the check. I think I'll try to code something up soon. -Ryan On 11/3/13, 5:10 PM, Gabriel Becker wrote: Henrik, See https://github.com/duncantl/__CodeDepends https://github.com/duncantl/CodeDepends (as used by used by https://github.com/gmbecker/__RCacheSuite https://github.com/gmbecker/RCacheSuite). It will identify necessarily defined symbols (input variables) for code that is not doing certain tricks (eg get(), mixing data.frame columns and gobal variables in formulas, etc ). Tierney's codetools package also does things along these lines but there are some situations where it has trouble. I can give more detail if desired. ~G On Sun, Nov 3, 2013 at 3:04 PM, Ryan r...@thompsonclan.org mailto:r...@thompsonclan.org mailto:r...@thompsonclan.org mailto:r...@thompsonclan.org wrote: Another potential easy
Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?
Weird, I guess it needs to be logged in or something. I don't know if the issue is that its in a non-master branch or waht. The repo is fully public and the forCRAN_0.3.5 in branch definitely exists on github. I started chrome (where I'm not logged into github) and got the same 404 error but after navigating to the file by going to the repo and changing the branch and navigating to the file, it now works even when i quit chrome and restart it. I don't know if it needed me to do that or if there was an intermittent problem that is now fixed. Anyway, here is the raw code, the link for which seems to work (in a browser where I'm not logged into github). If it still doesn't I can just attach the file here if you want. It doesn't rely on any of the rest of the CodeDepends machinery. https://raw.github.com/duncantl/CodeDepends/forCRAN_0.3.5/R/librarySymbols.R ~G On Mon, Nov 4, 2013 at 11:34 AM, Michael Lawrence lawrence.mich...@gene.com wrote: The dynamic nature of R limits the extent of these checks. But as Ryan has noted, a simple sanity check goes a long way. If what he has done could be extended to the rest of the search path (people always forget to attach packages), I think we've hit the 80% with 20%. Got a 404 on that URL btw. Michael On Mon, Nov 4, 2013 at 11:05 AM, Gabriel Becker gmbec...@ucdavis.eduwrote: Hey guys, Here is code that I have written which resolves library names into a full list of symbols: https://github.com/duncantl/CodeDepends/blob/forCRAN_0.3.5/R/librarySymbols.RNote this does not require that the packages actually be loaded at the time of the check, and does not load them (or rather, it loads them but does not attach them, so no searchpath muddying occurs). You do need a list of packages to check though (it adds the base ones automatically). It handles dependency and could be easily extended to handle suggests as well I think. When CodeDepends gets pushed to cran (not my call and not high on my priority list to push for currently) it will actually do exactly what you want. (the forCRAN_0.3.5 branch already does and I believe it is documented, so you could use devtools to install it now). As a side note, I'm not sure that existence of a symbol is sufficient (it certainly is necessary). What about situations where the symbol exists but is stale compared to the value in the parent? Are we sure that can never happen? ~G On Mon, Nov 4, 2013 at 7:29 AM, Michel Lang michell...@gmail.com wrote: You might want to consider using Recall() for recursion which should solve this. Determining the required variables using heuristics as codetools will probably lead to some confusion when using functions which include calls to, e.g., with(): f = function() { with(iris, Sepal.Length + Sepal.Width) } codetools:::findGlobals(f) I would suggest to write up some documentation on what the function's environment contains and how to to define variables accordingly - or why it can generally be considered a good idea to pass everything essential as an argument. Nevertheless a bpExport function would be a good addition for some rare corner cases in my opinion. Michel 2013/11/3 Henrik Bengtsson h...@biostat.ucsf.edu Hi, in BiocParallel, is there a suggested (or planned) best standards for making *locally* assigned variables (e.g. functions) available to the applied function when it runs in a separate R process (which will be the most common use case)? I understand that avoid local variables should be avoided and it's preferred to put as mush as possible in packages, but that's not always possible or very convenient. EXAMPLE: library('BiocParallel') library('BatchJobs') # Here I pick a recursive functions to make the problem a bit harder, i.e. # the function needs to call itself (itself = see below) fib - function(n=0) { if (n 0) stop(Invalid 'n': , n) if (n == 0 || n == 1) return(1) fib(n-2) + fib(n-1) } # Executing in the current R session cluster.functions - makeClusterFunctionsInteractive() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) # Executing in a separate R process, where fib() is not defined # (not specific to BiocParallel) cluster.functions - makeClusterFunctionsLocal() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) Error in LastError$store(results = results, is.error = !ok, throw.error = TRUE) : Errors occurred during execution. First error message: Error in
Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?
The code that I wrote intentionally avoids checking for package variables, since I consider that a separate problem. Package variables can be provided to the child by leading the package, whereas user-defined variables must be serialized in the parent and sent to the child. I think I could fairly easily adapt the same code to return a list of all packages that a function depends on. -Ryan On Nov 4, 2013 11:35 AM, Michael Lawrence lawrence.mich...@gene.com wrote: The dynamic nature of R limits the extent of these checks. But as Ryan has noted, a simple sanity check goes a long way. If what he has done could be extended to the rest of the search path (people always forget to attach packages), I think we've hit the 80% with 20%. Got a 404 on that URL btw. Michael On Mon, Nov 4, 2013 at 11:05 AM, Gabriel Becker gmbec...@ucdavis.edu wrote: Hey guys, Here is code that I have written which resolves library names into a full list of symbols: https://github.com/duncantl/CodeDepends/blob/forCRAN_0.3.5/R/librarySymbols.RNote this does not require that the packages actually be loaded at the time of the check, and does not load them (or rather, it loads them but does not attach them, so no searchpath muddying occurs). You do need a list of packages to check though (it adds the base ones automatically). It handles dependency and could be easily extended to handle suggests as well I think. When CodeDepends gets pushed to cran (not my call and not high on my priority list to push for currently) it will actually do exactly what you want. (the forCRAN_0.3.5 branch already does and I believe it is documented, so you could use devtools to install it now). As a side note, I'm not sure that existence of a symbol is sufficient (it certainly is necessary). What about situations where the symbol exists but is stale compared to the value in the parent? Are we sure that can never happen? ~G On Mon, Nov 4, 2013 at 7:29 AM, Michel Lang michell...@gmail.com wrote: You might want to consider using Recall() for recursion which should solve this. Determining the required variables using heuristics as codetools will probably lead to some confusion when using functions which include calls to, e.g., with(): f = function() { with(iris, Sepal.Length + Sepal.Width) } codetools:::findGlobals(f) I would suggest to write up some documentation on what the function's environment contains and how to to define variables accordingly - or why it can generally be considered a good idea to pass everything essential as an argument. Nevertheless a bpExport function would be a good addition for some rare corner cases in my opinion. Michel 2013/11/3 Henrik Bengtsson h...@biostat.ucsf.edu Hi, in BiocParallel, is there a suggested (or planned) best standards for making *locally* assigned variables (e.g. functions) available to the applied function when it runs in a separate R process (which will be the most common use case)? I understand that avoid local variables should be avoided and it's preferred to put as mush as possible in packages, but that's not always possible or very convenient. EXAMPLE: library('BiocParallel') library('BatchJobs') # Here I pick a recursive functions to make the problem a bit harder, i.e. # the function needs to call itself (itself = see below) fib - function(n=0) { if (n 0) stop(Invalid 'n': , n) if (n == 0 || n == 1) return(1) fib(n-2) + fib(n-1) } # Executing in the current R session cluster.functions - makeClusterFunctionsInteractive() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) # Executing in a separate R process, where fib() is not defined # (not specific to BiocParallel) cluster.functions - makeClusterFunctionsLocal() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) Error in LastError$store(results = results, is.error = !ok, throw.error = TRUE) : Errors occurred during execution. First error message: Error in FUN(...): could not find function fib [...] # The following illustrates that the solution is not always straightforward. # (not specific to BiocParallel; must have been discussed previously) values - bplapply(0:9, FUN=function(n, fib) { fib(n) }, fib=fib) Error in LastError$store(results = results, is.error = !ok, throw.error = TRUE) :
Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?
Ryan, I agree that in some sense it is a different problem, but my point is with a different approach we can easily answer both. The code I posted returns a named character vector of symbol names with package name being the name. This makes it a trivial lookup to determine both a) what symbols aren't available in any of the packages and b) what packages provide the remaining required symbols. No extra work required. You do have to give it a list of packages to check, but it is easy to write a wrapper that automatically passes it all currently attached packages if desired (a combination of search() and gsub() would be a quick and dirty way to do this). All that said, I'm simply trying to help. If you guys don't want to use my code/approach that is your perogative as I'm not currently working on BiocParallel myself. ~G On Mon, Nov 4, 2013 at 11:54 AM, Ryan Thompson r...@thompsonclan.org wrote: The code that I wrote intentionally avoids checking for package variables, since I consider that a separate problem. Package variables can be provided to the child by leading the package, whereas user-defined variables must be serialized in the parent and sent to the child. I think I could fairly easily adapt the same code to return a list of all packages that a function depends on. -Ryan On Nov 4, 2013 11:35 AM, Michael Lawrence lawrence.mich...@gene.com wrote: The dynamic nature of R limits the extent of these checks. But as Ryan has noted, a simple sanity check goes a long way. If what he has done could be extended to the rest of the search path (people always forget to attach packages), I think we've hit the 80% with 20%. Got a 404 on that URL btw. Michael On Mon, Nov 4, 2013 at 11:05 AM, Gabriel Becker gmbec...@ucdavis.edu wrote: Hey guys, Here is code that I have written which resolves library names into a full list of symbols: https://github.com/duncantl/CodeDepends/blob/forCRAN_0.3.5/R/librarySymbols.RNote this does not require that the packages actually be loaded at the time of the check, and does not load them (or rather, it loads them but does not attach them, so no searchpath muddying occurs). You do need a list of packages to check though (it adds the base ones automatically). It handles dependency and could be easily extended to handle suggests as well I think. When CodeDepends gets pushed to cran (not my call and not high on my priority list to push for currently) it will actually do exactly what you want. (the forCRAN_0.3.5 branch already does and I believe it is documented, so you could use devtools to install it now). As a side note, I'm not sure that existence of a symbol is sufficient (it certainly is necessary). What about situations where the symbol exists but is stale compared to the value in the parent? Are we sure that can never happen? ~G On Mon, Nov 4, 2013 at 7:29 AM, Michel Lang michell...@gmail.com wrote: You might want to consider using Recall() for recursion which should solve this. Determining the required variables using heuristics as codetools will probably lead to some confusion when using functions which include calls to, e.g., with(): f = function() { with(iris, Sepal.Length + Sepal.Width) } codetools:::findGlobals(f) I would suggest to write up some documentation on what the function's environment contains and how to to define variables accordingly - or why it can generally be considered a good idea to pass everything essential as an argument. Nevertheless a bpExport function would be a good addition for some rare corner cases in my opinion. Michel 2013/11/3 Henrik Bengtsson h...@biostat.ucsf.edu Hi, in BiocParallel, is there a suggested (or planned) best standards for making *locally* assigned variables (e.g. functions) available to the applied function when it runs in a separate R process (which will be the most common use case)? I understand that avoid local variables should be avoided and it's preferred to put as mush as possible in packages, but that's not always possible or very convenient. EXAMPLE: library('BiocParallel') library('BatchJobs') # Here I pick a recursive functions to make the problem a bit harder, i.e. # the function needs to call itself (itself = see below) fib - function(n=0) { if (n 0) stop(Invalid 'n': , n) if (n == 0 || n == 1) return(1) fib(n-2) + fib(n-1) } # Executing in the current R session cluster.functions - makeClusterFunctionsInteractive() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) # Executing in a separate R process, where fib() is
Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?
On 11/4/13, 11:05 AM, Gabriel Becker wrote: As a side note, I'm not sure that existence of a symbol is sufficient (it certainly is necessary). What about situations where the symbol exists but is stale compared to the value in the parent? Are we sure that can never happen? I think this is a different issue. We want to detect when a function depends on variables outside that function in the user's workspace, or variables defined in a pacakge that the user has loaded. I think we can assume that R child processes will be of the same version with the same set of installed packages, so package-defined variables will not have different values in child processes. For user variables, I think the goal should be to prevent (or at least highly discourage) dependencies on them entirely, so I don't think it matters what their value may be in the child. I realize this is somewhat counter to the question that started this thread, which was about exporting variables to the children, but I think it is the most straightforward approach. As I believe someone noted earlier in the thread, Henrik's original problem of a recursive function is properly solved by using the Recall function. -Ryan ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?
An analog to clusterExport is a good idea. To make it even easier, we could have a dynamic environment based on object tables that would catch missing symbols and download them from the parent thread. But maybe there's some benefit to being explicit? Michael On Sun, Nov 3, 2013 at 12:39 PM, Henrik Bengtsson h...@biostat.ucsf.eduwrote: Hi, in BiocParallel, is there a suggested (or planned) best standards for making *locally* assigned variables (e.g. functions) available to the applied function when it runs in a separate R process (which will be the most common use case)? I understand that avoid local variables should be avoided and it's preferred to put as mush as possible in packages, but that's not always possible or very convenient. EXAMPLE: library('BiocParallel') library('BatchJobs') # Here I pick a recursive functions to make the problem a bit harder, i.e. # the function needs to call itself (itself = see below) fib - function(n=0) { if (n 0) stop(Invalid 'n': , n) if (n == 0 || n == 1) return(1) fib(n-2) + fib(n-1) } # Executing in the current R session cluster.functions - makeClusterFunctionsInteractive() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) # Executing in a separate R process, where fib() is not defined # (not specific to BiocParallel) cluster.functions - makeClusterFunctionsLocal() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) Error in LastError$store(results = results, is.error = !ok, throw.error = TRUE) : Errors occurred during execution. First error message: Error in FUN(...): could not find function fib [...] # The following illustrates that the solution is not always straightforward. # (not specific to BiocParallel; must have been discussed previously) values - bplapply(0:9, FUN=function(n, fib) { fib(n) }, fib=fib) Error in LastError$store(results = results, is.error = !ok, throw.error = TRUE) : Errors occurred during execution. First error message: Error in fib(n): could not find function fib [...] # Workaround; make fib() aware of itself # (this is something the user need to do, and would be very # hard for BiocParallel et al. to automate. BTW, should all # recursive functions be implemented this way?). fib - function(n=0) { if (n 0) stop(Invalid 'n': , n) if (n == 0 || n == 1) return(1) fib - sys.function() # Make function aware of itself fib(n-2) + fib(n-1) } values - bplapply(0:9, FUN=function(n, fib) { fib(n) }, fib=fib) WISHLIST: Considering the above recursive issue solved, a slightly more explicit and standardized solution is then: values - bplapply(0:9, FUN=function(n, BPGLOBALS=NULL) { for (name in names(BPGLOBALS)) assign(name, BPGLOBALS[[name]]) fib(n) }, BPGLOBALS=list(fib=fib)) Could the above be generalized into something as neat as: bpExport(fib) values - bplapply(0:9, FUN=function(n) { BiocParallel::bpImport(fib) fib(n) }) or ideally just (analogously to parallel::clusterExport()): bpExport(fib) values - bplapply(0:9, FUN=fib) /Henrik ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?
On Sun, Nov 3, 2013 at 1:29 PM, Michael Lawrence lawrence.mich...@gene.com wrote: An analog to clusterExport is a good idea. To make it even easier, we could have a dynamic environment based on object tables that would catch missing symbols and download them from the parent thread. But maybe there's some benefit to being explicit? A first step to fully automate this would be to provide some (opt in/out) mechanism for code inspection and warn about non-defined objects (cf. 'R CMD check'). That is of course major work, but will certainly spare the community/users 1000's of hours in troubleshooting and the mailing lists from why doesn't my parallel code not work messages. Such protection may be better suited for the 'parallel' package though. Unfortunately, it's beyond my skills/time to pull such a thing together. /Henrik Michael On Sun, Nov 3, 2013 at 12:39 PM, Henrik Bengtsson h...@biostat.ucsf.edu wrote: Hi, in BiocParallel, is there a suggested (or planned) best standards for making *locally* assigned variables (e.g. functions) available to the applied function when it runs in a separate R process (which will be the most common use case)? I understand that avoid local variables should be avoided and it's preferred to put as mush as possible in packages, but that's not always possible or very convenient. EXAMPLE: library('BiocParallel') library('BatchJobs') # Here I pick a recursive functions to make the problem a bit harder, i.e. # the function needs to call itself (itself = see below) fib - function(n=0) { if (n 0) stop(Invalid 'n': , n) if (n == 0 || n == 1) return(1) fib(n-2) + fib(n-1) } # Executing in the current R session cluster.functions - makeClusterFunctionsInteractive() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) # Executing in a separate R process, where fib() is not defined # (not specific to BiocParallel) cluster.functions - makeClusterFunctionsLocal() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) Error in LastError$store(results = results, is.error = !ok, throw.error = TRUE) : Errors occurred during execution. First error message: Error in FUN(...): could not find function fib [...] # The following illustrates that the solution is not always straightforward. # (not specific to BiocParallel; must have been discussed previously) values - bplapply(0:9, FUN=function(n, fib) { fib(n) }, fib=fib) Error in LastError$store(results = results, is.error = !ok, throw.error = TRUE) : Errors occurred during execution. First error message: Error in fib(n): could not find function fib [...] # Workaround; make fib() aware of itself # (this is something the user need to do, and would be very # hard for BiocParallel et al. to automate. BTW, should all # recursive functions be implemented this way?). fib - function(n=0) { if (n 0) stop(Invalid 'n': , n) if (n == 0 || n == 1) return(1) fib - sys.function() # Make function aware of itself fib(n-2) + fib(n-1) } values - bplapply(0:9, FUN=function(n, fib) { fib(n) }, fib=fib) WISHLIST: Considering the above recursive issue solved, a slightly more explicit and standardized solution is then: values - bplapply(0:9, FUN=function(n, BPGLOBALS=NULL) { for (name in names(BPGLOBALS)) assign(name, BPGLOBALS[[name]]) fib(n) }, BPGLOBALS=list(fib=fib)) Could the above be generalized into something as neat as: bpExport(fib) values - bplapply(0:9, FUN=function(n) { BiocParallel::bpImport(fib) fib(n) }) or ideally just (analogously to parallel::clusterExport()): bpExport(fib) values - bplapply(0:9, FUN=fib) /Henrik ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?
Here's an easy thing we can add to BiocParallel in the short term. The following code defines a wrapper function withBPExtraErrorText that simply appends an additional message to the end of any error that looks like it is about a missing variable. We could wrap every evaluation in a similar tryCatch to at least provide a more informative error message when a subprocess has a missing variable. -Ryan withBPExtraErrorText - function(expr) { tryCatch({ expr }, simpleError = function(err) { if (grepl(^object '(.*)' not found$, err$message, perl=TRUE)) { ## It is an error due to a variable not found. err$message - paste0(err$message, . Maybe you forgot to export this variable from the main R session using \bpexport\?) } stop(err) }) } x - 5 ## Succeeds withBPExtraErrorText(x) ## Fails with more informative error message withBPExtraErrorText(y) On Sun Nov 3 14:01:48 2013, Henrik Bengtsson wrote: On Sun, Nov 3, 2013 at 1:29 PM, Michael Lawrence lawrence.mich...@gene.com wrote: An analog to clusterExport is a good idea. To make it even easier, we could have a dynamic environment based on object tables that would catch missing symbols and download them from the parent thread. But maybe there's some benefit to being explicit? A first step to fully automate this would be to provide some (opt in/out) mechanism for code inspection and warn about non-defined objects (cf. 'R CMD check'). That is of course major work, but will certainly spare the community/users 1000's of hours in troubleshooting and the mailing lists from why doesn't my parallel code not work messages. Such protection may be better suited for the 'parallel' package though. Unfortunately, it's beyond my skills/time to pull such a thing together. /Henrik Michael On Sun, Nov 3, 2013 at 12:39 PM, Henrik Bengtsson h...@biostat.ucsf.edu wrote: Hi, in BiocParallel, is there a suggested (or planned) best standards for making *locally* assigned variables (e.g. functions) available to the applied function when it runs in a separate R process (which will be the most common use case)? I understand that avoid local variables should be avoided and it's preferred to put as mush as possible in packages, but that's not always possible or very convenient. EXAMPLE: library('BiocParallel') library('BatchJobs') # Here I pick a recursive functions to make the problem a bit harder, i.e. # the function needs to call itself (itself = see below) fib - function(n=0) { if (n 0) stop(Invalid 'n': , n) if (n == 0 || n == 1) return(1) fib(n-2) + fib(n-1) } # Executing in the current R session cluster.functions - makeClusterFunctionsInteractive() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) # Executing in a separate R process, where fib() is not defined # (not specific to BiocParallel) cluster.functions - makeClusterFunctionsLocal() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) Error in LastError$store(results = results, is.error = !ok, throw.error = TRUE) : Errors occurred during execution. First error message: Error in FUN(...): could not find function fib [...] # The following illustrates that the solution is not always straightforward. # (not specific to BiocParallel; must have been discussed previously) values - bplapply(0:9, FUN=function(n, fib) { fib(n) }, fib=fib) Error in LastError$store(results = results, is.error = !ok, throw.error = TRUE) : Errors occurred during execution. First error message: Error in fib(n): could not find function fib [...] # Workaround; make fib() aware of itself # (this is something the user need to do, and would be very # hard for BiocParallel et al. to automate. BTW, should all # recursive functions be implemented this way?). fib - function(n=0) { if (n 0) stop(Invalid 'n': , n) if (n == 0 || n == 1) return(1) fib - sys.function() # Make function aware of itself fib(n-2) + fib(n-1) } values - bplapply(0:9, FUN=function(n, fib) { fib(n) }, fib=fib) WISHLIST: Considering the above recursive issue solved, a slightly more explicit and standardized solution is then: values - bplapply(0:9, FUN=function(n, BPGLOBALS=NULL) { for (name in names(BPGLOBALS)) assign(name, BPGLOBALS[[name]]) fib(n) }, BPGLOBALS=list(fib=fib)) Could the above be generalized into something as neat as: bpExport(fib) values - bplapply(0:9, FUN=function(n) { BiocParallel::bpImport(fib) fib(n) }) or ideally just (analogously to parallel::clusterExport()): bpExport(fib) values - bplapply(0:9, FUN=fib) /Henrik
Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?
Another potential easy step we can do is that if FUN function in the user's workspace, we automatically export that function under the same name in the children. This would make recursive functions just work, but it might be a bit too magical. On 11/3/13, 2:38 PM, Ryan wrote: Here's an easy thing we can add to BiocParallel in the short term. The following code defines a wrapper function withBPExtraErrorText that simply appends an additional message to the end of any error that looks like it is about a missing variable. We could wrap every evaluation in a similar tryCatch to at least provide a more informative error message when a subprocess has a missing variable. -Ryan withBPExtraErrorText - function(expr) { tryCatch({ expr }, simpleError = function(err) { if (grepl(^object '(.*)' not found$, err$message, perl=TRUE)) { ## It is an error due to a variable not found. err$message - paste0(err$message, . Maybe you forgot to export this variable from the main R session using \bpexport\?) } stop(err) }) } x - 5 ## Succeeds withBPExtraErrorText(x) ## Fails with more informative error message withBPExtraErrorText(y) On Sun Nov 3 14:01:48 2013, Henrik Bengtsson wrote: On Sun, Nov 3, 2013 at 1:29 PM, Michael Lawrence lawrence.mich...@gene.com wrote: An analog to clusterExport is a good idea. To make it even easier, we could have a dynamic environment based on object tables that would catch missing symbols and download them from the parent thread. But maybe there's some benefit to being explicit? A first step to fully automate this would be to provide some (opt in/out) mechanism for code inspection and warn about non-defined objects (cf. 'R CMD check'). That is of course major work, but will certainly spare the community/users 1000's of hours in troubleshooting and the mailing lists from why doesn't my parallel code not work messages. Such protection may be better suited for the 'parallel' package though. Unfortunately, it's beyond my skills/time to pull such a thing together. /Henrik Michael On Sun, Nov 3, 2013 at 12:39 PM, Henrik Bengtsson h...@biostat.ucsf.edu wrote: Hi, in BiocParallel, is there a suggested (or planned) best standards for making *locally* assigned variables (e.g. functions) available to the applied function when it runs in a separate R process (which will be the most common use case)? I understand that avoid local variables should be avoided and it's preferred to put as mush as possible in packages, but that's not always possible or very convenient. EXAMPLE: library('BiocParallel') library('BatchJobs') # Here I pick a recursive functions to make the problem a bit harder, i.e. # the function needs to call itself (itself = see below) fib - function(n=0) { if (n 0) stop(Invalid 'n': , n) if (n == 0 || n == 1) return(1) fib(n-2) + fib(n-1) } # Executing in the current R session cluster.functions - makeClusterFunctionsInteractive() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) # Executing in a separate R process, where fib() is not defined # (not specific to BiocParallel) cluster.functions - makeClusterFunctionsLocal() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) Error in LastError$store(results = results, is.error = !ok, throw.error = TRUE) : Errors occurred during execution. First error message: Error in FUN(...): could not find function fib [...] # The following illustrates that the solution is not always straightforward. # (not specific to BiocParallel; must have been discussed previously) values - bplapply(0:9, FUN=function(n, fib) { fib(n) }, fib=fib) Error in LastError$store(results = results, is.error = !ok, throw.error = TRUE) : Errors occurred during execution. First error message: Error in fib(n): could not find function fib [...] # Workaround; make fib() aware of itself # (this is something the user need to do, and would be very # hard for BiocParallel et al. to automate. BTW, should all # recursive functions be implemented this way?). fib - function(n=0) { if (n 0) stop(Invalid 'n': , n) if (n == 0 || n == 1) return(1) fib - sys.function() # Make function aware of itself fib(n-2) + fib(n-1) } values - bplapply(0:9, FUN=function(n, fib) { fib(n) }, fib=fib) WISHLIST: Considering the above recursive issue solved, a slightly more explicit and standardized solution is then: values - bplapply(0:9, FUN=function(n, BPGLOBALS=NULL) { for (name in names(BPGLOBALS)) assign(name, BPGLOBALS[[name]]) fib(n) }, BPGLOBALS=list(fib=fib)) Could
Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?
Henrik, See https://github.com/duncantl/CodeDepends (as used by used by https://github.com/gmbecker/RCacheSuite). It will identify necessarily defined symbols (input variables) for code that is not doing certain tricks (eg get(), mixing data.frame columns and gobal variables in formulas, etc ). Tierney's codetools package also does things along these lines but there are some situations where it has trouble. I can give more detail if desired. ~G On Sun, Nov 3, 2013 at 3:04 PM, Ryan r...@thompsonclan.org wrote: Another potential easy step we can do is that if FUN function in the user's workspace, we automatically export that function under the same name in the children. This would make recursive functions just work, but it might be a bit too magical. On 11/3/13, 2:38 PM, Ryan wrote: Here's an easy thing we can add to BiocParallel in the short term. The following code defines a wrapper function withBPExtraErrorText that simply appends an additional message to the end of any error that looks like it is about a missing variable. We could wrap every evaluation in a similar tryCatch to at least provide a more informative error message when a subprocess has a missing variable. -Ryan withBPExtraErrorText - function(expr) { tryCatch({ expr }, simpleError = function(err) { if (grepl(^object '(.*)' not found$, err$message, perl=TRUE)) { ## It is an error due to a variable not found. err$message - paste0(err$message, . Maybe you forgot to export this variable from the main R session using \bpexport\?) } stop(err) }) } x - 5 ## Succeeds withBPExtraErrorText(x) ## Fails with more informative error message withBPExtraErrorText(y) On Sun Nov 3 14:01:48 2013, Henrik Bengtsson wrote: On Sun, Nov 3, 2013 at 1:29 PM, Michael Lawrence lawrence.mich...@gene.com wrote: An analog to clusterExport is a good idea. To make it even easier, we could have a dynamic environment based on object tables that would catch missing symbols and download them from the parent thread. But maybe there's some benefit to being explicit? A first step to fully automate this would be to provide some (opt in/out) mechanism for code inspection and warn about non-defined objects (cf. 'R CMD check'). That is of course major work, but will certainly spare the community/users 1000's of hours in troubleshooting and the mailing lists from why doesn't my parallel code not work messages. Such protection may be better suited for the 'parallel' package though. Unfortunately, it's beyond my skills/time to pull such a thing together. /Henrik Michael On Sun, Nov 3, 2013 at 12:39 PM, Henrik Bengtsson h...@biostat.ucsf.edu wrote: Hi, in BiocParallel, is there a suggested (or planned) best standards for making *locally* assigned variables (e.g. functions) available to the applied function when it runs in a separate R process (which will be the most common use case)? I understand that avoid local variables should be avoided and it's preferred to put as mush as possible in packages, but that's not always possible or very convenient. EXAMPLE: library('BiocParallel') library('BatchJobs') # Here I pick a recursive functions to make the problem a bit harder, i.e. # the function needs to call itself (itself = see below) fib - function(n=0) { if (n 0) stop(Invalid 'n': , n) if (n == 0 || n == 1) return(1) fib(n-2) + fib(n-1) } # Executing in the current R session cluster.functions - makeClusterFunctionsInteractive() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) # Executing in a separate R process, where fib() is not defined # (not specific to BiocParallel) cluster.functions - makeClusterFunctionsLocal() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) Error in LastError$store(results = results, is.error = !ok, throw.error = TRUE) : Errors occurred during execution. First error message: Error in FUN(...): could not find function fib [...] # The following illustrates that the solution is not always straightforward. # (not specific to BiocParallel; must have been discussed previously) values - bplapply(0:9, FUN=function(n, fib) { fib(n) }, fib=fib) Error in LastError$store(results = results, is.error = !ok, throw.error = TRUE) : Errors occurred during execution. First error message: Error in fib(n): could not find function fib [...] # Workaround; make fib() aware of itself # (this is something the user need to do, and would be very # hard for BiocParallel et al. to
Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?
I guess all we need to do is to detect whether a function would try to access a free variable in the user's workspace, and warn/error if so. It looks like CodeDepends could do that. I could try to come up with an implementation. I guess we would add CodeDepends as an optional dependency for BiocParallel, and only do the checks if CodeDepends is available. On Sun Nov 3 17:10:45 2013, Gabriel Becker wrote: Henrik, See https://github.com/duncantl/CodeDepends (as used by used by https://github.com/gmbecker/RCacheSuite). It will identify necessarily defined symbols (input variables) for code that is not doing certain tricks (eg get(), mixing data.frame columns and gobal variables in formulas, etc ). Tierney's codetools package also does things along these lines but there are some situations where it has trouble. I can give more detail if desired. ~G On Sun, Nov 3, 2013 at 3:04 PM, Ryan r...@thompsonclan.org mailto:r...@thompsonclan.org wrote: Another potential easy step we can do is that if FUN function in the user's workspace, we automatically export that function under the same name in the children. This would make recursive functions just work, but it might be a bit too magical. On 11/3/13, 2:38 PM, Ryan wrote: Here's an easy thing we can add to BiocParallel in the short term. The following code defines a wrapper function withBPExtraErrorText that simply appends an additional message to the end of any error that looks like it is about a missing variable. We could wrap every evaluation in a similar tryCatch to at least provide a more informative error message when a subprocess has a missing variable. -Ryan withBPExtraErrorText - function(expr) { tryCatch({ expr }, simpleError = function(err) { if (grepl(^object '(.*)' not found$, err$message, perl=TRUE)) { ## It is an error due to a variable not found. err$message - paste0(err$message, . Maybe you forgot to export this variable from the main R session using \bpexport\?) } stop(err) }) } x - 5 ## Succeeds withBPExtraErrorText(x) ## Fails with more informative error message withBPExtraErrorText(y) On Sun Nov 3 14:01:48 2013, Henrik Bengtsson wrote: On Sun, Nov 3, 2013 at 1:29 PM, Michael Lawrence lawrence.mich...@gene.com mailto:lawrence.mich...@gene.com wrote: An analog to clusterExport is a good idea. To make it even easier, we could have a dynamic environment based on object tables that would catch missing symbols and download them from the parent thread. But maybe there's some benefit to being explicit? A first step to fully automate this would be to provide some (opt in/out) mechanism for code inspection and warn about non-defined objects (cf. 'R CMD check'). That is of course major work, but will certainly spare the community/users 1000's of hours in troubleshooting and the mailing lists from why doesn't my parallel code not work messages. Such protection may be better suited for the 'parallel' package though. Unfortunately, it's beyond my skills/time to pull such a thing together. /Henrik Michael On Sun, Nov 3, 2013 at 12:39 PM, Henrik Bengtsson h...@biostat.ucsf.edu mailto:h...@biostat.ucsf.edu wrote: Hi, in BiocParallel, is there a suggested (or planned) best standards for making *locally* assigned variables (e.g. functions) available to the applied function when it runs in a separate R process (which will be the most common use case)? I understand that avoid local variables should be avoided and it's preferred to put as mush as possible in packages, but that's not always possible or very convenient. EXAMPLE: library('BiocParallel') library('BatchJobs') # Here I pick a recursive functions to make the problem a bit harder, i.e. # the function needs to call itself (itself = see below) fib - function(n=0) { if (n 0) stop(Invalid 'n': , n) if (n == 0 || n ==
Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?
Ryan (et al), FYI: f function() { x = rnorm(x) x } findGlobals(f) [1] = { rnorm x should be in the list of globals but it isn't. ~G sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] codetools_0.2-8 On Sun, Nov 3, 2013 at 5:37 PM, Ryan r...@thompsonclan.org wrote: Looking at the codetools package, I think findGlobals is basically exactly what we want here, right? As you say, there are necessarily limitations due to R being a dynamic language, but the goal is to catch common errors, not stop people from tricking the check. I think I'll try to code something up soon. -Ryan On 11/3/13, 5:10 PM, Gabriel Becker wrote: Henrik, See https://github.com/duncantl/CodeDepends (as used by used by https://github.com/gmbecker/RCacheSuite). It will identify necessarily defined symbols (input variables) for code that is not doing certain tricks (eg get(), mixing data.frame columns and gobal variables in formulas, etc ). Tierney's codetools package also does things along these lines but there are some situations where it has trouble. I can give more detail if desired. ~G On Sun, Nov 3, 2013 at 3:04 PM, Ryan r...@thompsonclan.org wrote: Another potential easy step we can do is that if FUN function in the user's workspace, we automatically export that function under the same name in the children. This would make recursive functions just work, but it might be a bit too magical. On 11/3/13, 2:38 PM, Ryan wrote: Here's an easy thing we can add to BiocParallel in the short term. The following code defines a wrapper function withBPExtraErrorText that simply appends an additional message to the end of any error that looks like it is about a missing variable. We could wrap every evaluation in a similar tryCatch to at least provide a more informative error message when a subprocess has a missing variable. -Ryan withBPExtraErrorText - function(expr) { tryCatch({ expr }, simpleError = function(err) { if (grepl(^object '(.*)' not found$, err$message, perl=TRUE)) { ## It is an error due to a variable not found. err$message - paste0(err$message, . Maybe you forgot to export this variable from the main R session using \bpexport\?) } stop(err) }) } x - 5 ## Succeeds withBPExtraErrorText(x) ## Fails with more informative error message withBPExtraErrorText(y) On Sun Nov 3 14:01:48 2013, Henrik Bengtsson wrote: On Sun, Nov 3, 2013 at 1:29 PM, Michael Lawrence lawrence.mich...@gene.com wrote: An analog to clusterExport is a good idea. To make it even easier, we could have a dynamic environment based on object tables that would catch missing symbols and download them from the parent thread. But maybe there's some benefit to being explicit? A first step to fully automate this would be to provide some (opt in/out) mechanism for code inspection and warn about non-defined objects (cf. 'R CMD check'). That is of course major work, but will certainly spare the community/users 1000's of hours in troubleshooting and the mailing lists from why doesn't my parallel code not work messages. Such protection may be better suited for the 'parallel' package though. Unfortunately, it's beyond my skills/time to pull such a thing together. /Henrik Michael On Sun, Nov 3, 2013 at 12:39 PM, Henrik Bengtsson h...@biostat.ucsf.edu wrote: Hi, in BiocParallel, is there a suggested (or planned) best standards for making *locally* assigned variables (e.g. functions) available to the applied function when it runs in a separate R process (which will be the most common use case)? I understand that avoid local variables should be avoided and it's preferred to put as mush as possible in packages, but that's not always possible or very convenient. EXAMPLE: library('BiocParallel') library('BatchJobs') # Here I pick a recursive functions to make the problem a bit harder, i.e. # the function needs to call itself (itself = see below) fib - function(n=0) { if (n 0) stop(Invalid 'n': , n) if (n == 0 || n == 1) return(1) fib(n-2) + fib(n-1) } # Executing in the current R session cluster.functions - makeClusterFunctionsInteractive() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) # Executing in a separate R
Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?
Ok, here is my attempt at a function to get the list of user-defined free variables that a function refers to: https://gist.github.com/DarwinAwardWinner/7298557 Is uses codetools, so it is subject to the limitations of that package, but for simple examples, it successfully detects when a function refers to something in the global env. On Sun Nov 3 21:14:29 2013, Gabriel Becker wrote: Ryan (et al), FYI: f function() { x = rnorm(x) x } findGlobals(f) [1] = { rnorm x should be in the list of globals but it isn't. ~G sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] codetools_0.2-8 On Sun, Nov 3, 2013 at 5:37 PM, Ryan r...@thompsonclan.org mailto:r...@thompsonclan.org wrote: Looking at the codetools package, I think findGlobals is basically exactly what we want here, right? As you say, there are necessarily limitations due to R being a dynamic language, but the goal is to catch common errors, not stop people from tricking the check. I think I'll try to code something up soon. -Ryan On 11/3/13, 5:10 PM, Gabriel Becker wrote: Henrik, See https://github.com/duncantl/CodeDepends (as used by used by https://github.com/gmbecker/RCacheSuite). It will identify necessarily defined symbols (input variables) for code that is not doing certain tricks (eg get(), mixing data.frame columns and gobal variables in formulas, etc ). Tierney's codetools package also does things along these lines but there are some situations where it has trouble. I can give more detail if desired. ~G On Sun, Nov 3, 2013 at 3:04 PM, Ryan r...@thompsonclan.org mailto:r...@thompsonclan.org wrote: Another potential easy step we can do is that if FUN function in the user's workspace, we automatically export that function under the same name in the children. This would make recursive functions just work, but it might be a bit too magical. On 11/3/13, 2:38 PM, Ryan wrote: Here's an easy thing we can add to BiocParallel in the short term. The following code defines a wrapper function withBPExtraErrorText that simply appends an additional message to the end of any error that looks like it is about a missing variable. We could wrap every evaluation in a similar tryCatch to at least provide a more informative error message when a subprocess has a missing variable. -Ryan withBPExtraErrorText - function(expr) { tryCatch({ expr }, simpleError = function(err) { if (grepl(^object '(.*)' not found$, err$message, perl=TRUE)) { ## It is an error due to a variable not found. err$message - paste0(err$message, . Maybe you forgot to export this variable from the main R session using \bpexport\?) } stop(err) }) } x - 5 ## Succeeds withBPExtraErrorText(x) ## Fails with more informative error message withBPExtraErrorText(y) On Sun Nov 3 14:01:48 2013, Henrik Bengtsson wrote: On Sun, Nov 3, 2013 at 1:29 PM, Michael Lawrence lawrence.mich...@gene.com mailto:lawrence.mich...@gene.com wrote: An analog to clusterExport is a good idea. To make it even easier, we could have a dynamic environment based on object tables that would catch missing symbols and download them from the parent thread. But maybe there's some benefit to being explicit? A first step to fully automate this would be to provide some (opt in/out) mechanism for code inspection and warn about non-defined objects (cf. 'R CMD check'). That is of course major work, but will certainly spare the community/users 1000's of hours in troubleshooting and the mailing lists from why doesn't my parallel code not work messages. Such protection may be better suited for the 'parallel' package though.