[Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?
Hi, in BiocParallel, is there a suggested (or planned) best standards for making *locally* assigned variables (e.g. functions) available to the applied function when it runs in a separate R process (which will be the most common use case)? I understand that avoid local variables should be avoided and it's preferred to put as mush as possible in packages, but that's not always possible or very convenient. EXAMPLE: library('BiocParallel') library('BatchJobs') # Here I pick a recursive functions to make the problem a bit harder, i.e. # the function needs to call itself (itself = see below) fib - function(n=0) { if (n 0) stop(Invalid 'n': , n) if (n == 0 || n == 1) return(1) fib(n-2) + fib(n-1) } # Executing in the current R session cluster.functions - makeClusterFunctionsInteractive() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) # Executing in a separate R process, where fib() is not defined # (not specific to BiocParallel) cluster.functions - makeClusterFunctionsLocal() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) Error in LastError$store(results = results, is.error = !ok, throw.error = TRUE) : Errors occurred during execution. First error message: Error in FUN(...): could not find function fib [...] # The following illustrates that the solution is not always straightforward. # (not specific to BiocParallel; must have been discussed previously) values - bplapply(0:9, FUN=function(n, fib) { fib(n) }, fib=fib) Error in LastError$store(results = results, is.error = !ok, throw.error = TRUE) : Errors occurred during execution. First error message: Error in fib(n): could not find function fib [...] # Workaround; make fib() aware of itself # (this is something the user need to do, and would be very # hard for BiocParallel et al. to automate. BTW, should all # recursive functions be implemented this way?). fib - function(n=0) { if (n 0) stop(Invalid 'n': , n) if (n == 0 || n == 1) return(1) fib - sys.function() # Make function aware of itself fib(n-2) + fib(n-1) } values - bplapply(0:9, FUN=function(n, fib) { fib(n) }, fib=fib) WISHLIST: Considering the above recursive issue solved, a slightly more explicit and standardized solution is then: values - bplapply(0:9, FUN=function(n, BPGLOBALS=NULL) { for (name in names(BPGLOBALS)) assign(name, BPGLOBALS[[name]]) fib(n) }, BPGLOBALS=list(fib=fib)) Could the above be generalized into something as neat as: bpExport(fib) values - bplapply(0:9, FUN=function(n) { BiocParallel::bpImport(fib) fib(n) }) or ideally just (analogously to parallel::clusterExport()): bpExport(fib) values - bplapply(0:9, FUN=fib) /Henrik ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?
An analog to clusterExport is a good idea. To make it even easier, we could have a dynamic environment based on object tables that would catch missing symbols and download them from the parent thread. But maybe there's some benefit to being explicit? Michael On Sun, Nov 3, 2013 at 12:39 PM, Henrik Bengtsson h...@biostat.ucsf.eduwrote: Hi, in BiocParallel, is there a suggested (or planned) best standards for making *locally* assigned variables (e.g. functions) available to the applied function when it runs in a separate R process (which will be the most common use case)? I understand that avoid local variables should be avoided and it's preferred to put as mush as possible in packages, but that's not always possible or very convenient. EXAMPLE: library('BiocParallel') library('BatchJobs') # Here I pick a recursive functions to make the problem a bit harder, i.e. # the function needs to call itself (itself = see below) fib - function(n=0) { if (n 0) stop(Invalid 'n': , n) if (n == 0 || n == 1) return(1) fib(n-2) + fib(n-1) } # Executing in the current R session cluster.functions - makeClusterFunctionsInteractive() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) # Executing in a separate R process, where fib() is not defined # (not specific to BiocParallel) cluster.functions - makeClusterFunctionsLocal() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) Error in LastError$store(results = results, is.error = !ok, throw.error = TRUE) : Errors occurred during execution. First error message: Error in FUN(...): could not find function fib [...] # The following illustrates that the solution is not always straightforward. # (not specific to BiocParallel; must have been discussed previously) values - bplapply(0:9, FUN=function(n, fib) { fib(n) }, fib=fib) Error in LastError$store(results = results, is.error = !ok, throw.error = TRUE) : Errors occurred during execution. First error message: Error in fib(n): could not find function fib [...] # Workaround; make fib() aware of itself # (this is something the user need to do, and would be very # hard for BiocParallel et al. to automate. BTW, should all # recursive functions be implemented this way?). fib - function(n=0) { if (n 0) stop(Invalid 'n': , n) if (n == 0 || n == 1) return(1) fib - sys.function() # Make function aware of itself fib(n-2) + fib(n-1) } values - bplapply(0:9, FUN=function(n, fib) { fib(n) }, fib=fib) WISHLIST: Considering the above recursive issue solved, a slightly more explicit and standardized solution is then: values - bplapply(0:9, FUN=function(n, BPGLOBALS=NULL) { for (name in names(BPGLOBALS)) assign(name, BPGLOBALS[[name]]) fib(n) }, BPGLOBALS=list(fib=fib)) Could the above be generalized into something as neat as: bpExport(fib) values - bplapply(0:9, FUN=function(n) { BiocParallel::bpImport(fib) fib(n) }) or ideally just (analogously to parallel::clusterExport()): bpExport(fib) values - bplapply(0:9, FUN=fib) /Henrik ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?
On Sun, Nov 3, 2013 at 1:29 PM, Michael Lawrence lawrence.mich...@gene.com wrote: An analog to clusterExport is a good idea. To make it even easier, we could have a dynamic environment based on object tables that would catch missing symbols and download them from the parent thread. But maybe there's some benefit to being explicit? A first step to fully automate this would be to provide some (opt in/out) mechanism for code inspection and warn about non-defined objects (cf. 'R CMD check'). That is of course major work, but will certainly spare the community/users 1000's of hours in troubleshooting and the mailing lists from why doesn't my parallel code not work messages. Such protection may be better suited for the 'parallel' package though. Unfortunately, it's beyond my skills/time to pull such a thing together. /Henrik Michael On Sun, Nov 3, 2013 at 12:39 PM, Henrik Bengtsson h...@biostat.ucsf.edu wrote: Hi, in BiocParallel, is there a suggested (or planned) best standards for making *locally* assigned variables (e.g. functions) available to the applied function when it runs in a separate R process (which will be the most common use case)? I understand that avoid local variables should be avoided and it's preferred to put as mush as possible in packages, but that's not always possible or very convenient. EXAMPLE: library('BiocParallel') library('BatchJobs') # Here I pick a recursive functions to make the problem a bit harder, i.e. # the function needs to call itself (itself = see below) fib - function(n=0) { if (n 0) stop(Invalid 'n': , n) if (n == 0 || n == 1) return(1) fib(n-2) + fib(n-1) } # Executing in the current R session cluster.functions - makeClusterFunctionsInteractive() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) # Executing in a separate R process, where fib() is not defined # (not specific to BiocParallel) cluster.functions - makeClusterFunctionsLocal() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) Error in LastError$store(results = results, is.error = !ok, throw.error = TRUE) : Errors occurred during execution. First error message: Error in FUN(...): could not find function fib [...] # The following illustrates that the solution is not always straightforward. # (not specific to BiocParallel; must have been discussed previously) values - bplapply(0:9, FUN=function(n, fib) { fib(n) }, fib=fib) Error in LastError$store(results = results, is.error = !ok, throw.error = TRUE) : Errors occurred during execution. First error message: Error in fib(n): could not find function fib [...] # Workaround; make fib() aware of itself # (this is something the user need to do, and would be very # hard for BiocParallel et al. to automate. BTW, should all # recursive functions be implemented this way?). fib - function(n=0) { if (n 0) stop(Invalid 'n': , n) if (n == 0 || n == 1) return(1) fib - sys.function() # Make function aware of itself fib(n-2) + fib(n-1) } values - bplapply(0:9, FUN=function(n, fib) { fib(n) }, fib=fib) WISHLIST: Considering the above recursive issue solved, a slightly more explicit and standardized solution is then: values - bplapply(0:9, FUN=function(n, BPGLOBALS=NULL) { for (name in names(BPGLOBALS)) assign(name, BPGLOBALS[[name]]) fib(n) }, BPGLOBALS=list(fib=fib)) Could the above be generalized into something as neat as: bpExport(fib) values - bplapply(0:9, FUN=function(n) { BiocParallel::bpImport(fib) fib(n) }) or ideally just (analogously to parallel::clusterExport()): bpExport(fib) values - bplapply(0:9, FUN=fib) /Henrik ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?
Here's an easy thing we can add to BiocParallel in the short term. The following code defines a wrapper function withBPExtraErrorText that simply appends an additional message to the end of any error that looks like it is about a missing variable. We could wrap every evaluation in a similar tryCatch to at least provide a more informative error message when a subprocess has a missing variable. -Ryan withBPExtraErrorText - function(expr) { tryCatch({ expr }, simpleError = function(err) { if (grepl(^object '(.*)' not found$, err$message, perl=TRUE)) { ## It is an error due to a variable not found. err$message - paste0(err$message, . Maybe you forgot to export this variable from the main R session using \bpexport\?) } stop(err) }) } x - 5 ## Succeeds withBPExtraErrorText(x) ## Fails with more informative error message withBPExtraErrorText(y) On Sun Nov 3 14:01:48 2013, Henrik Bengtsson wrote: On Sun, Nov 3, 2013 at 1:29 PM, Michael Lawrence lawrence.mich...@gene.com wrote: An analog to clusterExport is a good idea. To make it even easier, we could have a dynamic environment based on object tables that would catch missing symbols and download them from the parent thread. But maybe there's some benefit to being explicit? A first step to fully automate this would be to provide some (opt in/out) mechanism for code inspection and warn about non-defined objects (cf. 'R CMD check'). That is of course major work, but will certainly spare the community/users 1000's of hours in troubleshooting and the mailing lists from why doesn't my parallel code not work messages. Such protection may be better suited for the 'parallel' package though. Unfortunately, it's beyond my skills/time to pull such a thing together. /Henrik Michael On Sun, Nov 3, 2013 at 12:39 PM, Henrik Bengtsson h...@biostat.ucsf.edu wrote: Hi, in BiocParallel, is there a suggested (or planned) best standards for making *locally* assigned variables (e.g. functions) available to the applied function when it runs in a separate R process (which will be the most common use case)? I understand that avoid local variables should be avoided and it's preferred to put as mush as possible in packages, but that's not always possible or very convenient. EXAMPLE: library('BiocParallel') library('BatchJobs') # Here I pick a recursive functions to make the problem a bit harder, i.e. # the function needs to call itself (itself = see below) fib - function(n=0) { if (n 0) stop(Invalid 'n': , n) if (n == 0 || n == 1) return(1) fib(n-2) + fib(n-1) } # Executing in the current R session cluster.functions - makeClusterFunctionsInteractive() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) # Executing in a separate R process, where fib() is not defined # (not specific to BiocParallel) cluster.functions - makeClusterFunctionsLocal() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) Error in LastError$store(results = results, is.error = !ok, throw.error = TRUE) : Errors occurred during execution. First error message: Error in FUN(...): could not find function fib [...] # The following illustrates that the solution is not always straightforward. # (not specific to BiocParallel; must have been discussed previously) values - bplapply(0:9, FUN=function(n, fib) { fib(n) }, fib=fib) Error in LastError$store(results = results, is.error = !ok, throw.error = TRUE) : Errors occurred during execution. First error message: Error in fib(n): could not find function fib [...] # Workaround; make fib() aware of itself # (this is something the user need to do, and would be very # hard for BiocParallel et al. to automate. BTW, should all # recursive functions be implemented this way?). fib - function(n=0) { if (n 0) stop(Invalid 'n': , n) if (n == 0 || n == 1) return(1) fib - sys.function() # Make function aware of itself fib(n-2) + fib(n-1) } values - bplapply(0:9, FUN=function(n, fib) { fib(n) }, fib=fib) WISHLIST: Considering the above recursive issue solved, a slightly more explicit and standardized solution is then: values - bplapply(0:9, FUN=function(n, BPGLOBALS=NULL) { for (name in names(BPGLOBALS)) assign(name, BPGLOBALS[[name]]) fib(n) }, BPGLOBALS=list(fib=fib)) Could the above be generalized into something as neat as: bpExport(fib) values - bplapply(0:9, FUN=function(n) { BiocParallel::bpImport(fib) fib(n) }) or ideally just (analogously to parallel::clusterExport()): bpExport(fib) values - bplapply(0:9, FUN=fib) /Henrik
Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?
Another potential easy step we can do is that if FUN function in the user's workspace, we automatically export that function under the same name in the children. This would make recursive functions just work, but it might be a bit too magical. On 11/3/13, 2:38 PM, Ryan wrote: Here's an easy thing we can add to BiocParallel in the short term. The following code defines a wrapper function withBPExtraErrorText that simply appends an additional message to the end of any error that looks like it is about a missing variable. We could wrap every evaluation in a similar tryCatch to at least provide a more informative error message when a subprocess has a missing variable. -Ryan withBPExtraErrorText - function(expr) { tryCatch({ expr }, simpleError = function(err) { if (grepl(^object '(.*)' not found$, err$message, perl=TRUE)) { ## It is an error due to a variable not found. err$message - paste0(err$message, . Maybe you forgot to export this variable from the main R session using \bpexport\?) } stop(err) }) } x - 5 ## Succeeds withBPExtraErrorText(x) ## Fails with more informative error message withBPExtraErrorText(y) On Sun Nov 3 14:01:48 2013, Henrik Bengtsson wrote: On Sun, Nov 3, 2013 at 1:29 PM, Michael Lawrence lawrence.mich...@gene.com wrote: An analog to clusterExport is a good idea. To make it even easier, we could have a dynamic environment based on object tables that would catch missing symbols and download them from the parent thread. But maybe there's some benefit to being explicit? A first step to fully automate this would be to provide some (opt in/out) mechanism for code inspection and warn about non-defined objects (cf. 'R CMD check'). That is of course major work, but will certainly spare the community/users 1000's of hours in troubleshooting and the mailing lists from why doesn't my parallel code not work messages. Such protection may be better suited for the 'parallel' package though. Unfortunately, it's beyond my skills/time to pull such a thing together. /Henrik Michael On Sun, Nov 3, 2013 at 12:39 PM, Henrik Bengtsson h...@biostat.ucsf.edu wrote: Hi, in BiocParallel, is there a suggested (or planned) best standards for making *locally* assigned variables (e.g. functions) available to the applied function when it runs in a separate R process (which will be the most common use case)? I understand that avoid local variables should be avoided and it's preferred to put as mush as possible in packages, but that's not always possible or very convenient. EXAMPLE: library('BiocParallel') library('BatchJobs') # Here I pick a recursive functions to make the problem a bit harder, i.e. # the function needs to call itself (itself = see below) fib - function(n=0) { if (n 0) stop(Invalid 'n': , n) if (n == 0 || n == 1) return(1) fib(n-2) + fib(n-1) } # Executing in the current R session cluster.functions - makeClusterFunctionsInteractive() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) # Executing in a separate R process, where fib() is not defined # (not specific to BiocParallel) cluster.functions - makeClusterFunctionsLocal() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) Error in LastError$store(results = results, is.error = !ok, throw.error = TRUE) : Errors occurred during execution. First error message: Error in FUN(...): could not find function fib [...] # The following illustrates that the solution is not always straightforward. # (not specific to BiocParallel; must have been discussed previously) values - bplapply(0:9, FUN=function(n, fib) { fib(n) }, fib=fib) Error in LastError$store(results = results, is.error = !ok, throw.error = TRUE) : Errors occurred during execution. First error message: Error in fib(n): could not find function fib [...] # Workaround; make fib() aware of itself # (this is something the user need to do, and would be very # hard for BiocParallel et al. to automate. BTW, should all # recursive functions be implemented this way?). fib - function(n=0) { if (n 0) stop(Invalid 'n': , n) if (n == 0 || n == 1) return(1) fib - sys.function() # Make function aware of itself fib(n-2) + fib(n-1) } values - bplapply(0:9, FUN=function(n, fib) { fib(n) }, fib=fib) WISHLIST: Considering the above recursive issue solved, a slightly more explicit and standardized solution is then: values - bplapply(0:9, FUN=function(n, BPGLOBALS=NULL) { for (name in names(BPGLOBALS)) assign(name, BPGLOBALS[[name]]) fib(n) }, BPGLOBALS=list(fib=fib)) Could
Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?
Henrik, See https://github.com/duncantl/CodeDepends (as used by used by https://github.com/gmbecker/RCacheSuite). It will identify necessarily defined symbols (input variables) for code that is not doing certain tricks (eg get(), mixing data.frame columns and gobal variables in formulas, etc ). Tierney's codetools package also does things along these lines but there are some situations where it has trouble. I can give more detail if desired. ~G On Sun, Nov 3, 2013 at 3:04 PM, Ryan r...@thompsonclan.org wrote: Another potential easy step we can do is that if FUN function in the user's workspace, we automatically export that function under the same name in the children. This would make recursive functions just work, but it might be a bit too magical. On 11/3/13, 2:38 PM, Ryan wrote: Here's an easy thing we can add to BiocParallel in the short term. The following code defines a wrapper function withBPExtraErrorText that simply appends an additional message to the end of any error that looks like it is about a missing variable. We could wrap every evaluation in a similar tryCatch to at least provide a more informative error message when a subprocess has a missing variable. -Ryan withBPExtraErrorText - function(expr) { tryCatch({ expr }, simpleError = function(err) { if (grepl(^object '(.*)' not found$, err$message, perl=TRUE)) { ## It is an error due to a variable not found. err$message - paste0(err$message, . Maybe you forgot to export this variable from the main R session using \bpexport\?) } stop(err) }) } x - 5 ## Succeeds withBPExtraErrorText(x) ## Fails with more informative error message withBPExtraErrorText(y) On Sun Nov 3 14:01:48 2013, Henrik Bengtsson wrote: On Sun, Nov 3, 2013 at 1:29 PM, Michael Lawrence lawrence.mich...@gene.com wrote: An analog to clusterExport is a good idea. To make it even easier, we could have a dynamic environment based on object tables that would catch missing symbols and download them from the parent thread. But maybe there's some benefit to being explicit? A first step to fully automate this would be to provide some (opt in/out) mechanism for code inspection and warn about non-defined objects (cf. 'R CMD check'). That is of course major work, but will certainly spare the community/users 1000's of hours in troubleshooting and the mailing lists from why doesn't my parallel code not work messages. Such protection may be better suited for the 'parallel' package though. Unfortunately, it's beyond my skills/time to pull such a thing together. /Henrik Michael On Sun, Nov 3, 2013 at 12:39 PM, Henrik Bengtsson h...@biostat.ucsf.edu wrote: Hi, in BiocParallel, is there a suggested (or planned) best standards for making *locally* assigned variables (e.g. functions) available to the applied function when it runs in a separate R process (which will be the most common use case)? I understand that avoid local variables should be avoided and it's preferred to put as mush as possible in packages, but that's not always possible or very convenient. EXAMPLE: library('BiocParallel') library('BatchJobs') # Here I pick a recursive functions to make the problem a bit harder, i.e. # the function needs to call itself (itself = see below) fib - function(n=0) { if (n 0) stop(Invalid 'n': , n) if (n == 0 || n == 1) return(1) fib(n-2) + fib(n-1) } # Executing in the current R session cluster.functions - makeClusterFunctionsInteractive() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) # Executing in a separate R process, where fib() is not defined # (not specific to BiocParallel) cluster.functions - makeClusterFunctionsLocal() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) Error in LastError$store(results = results, is.error = !ok, throw.error = TRUE) : Errors occurred during execution. First error message: Error in FUN(...): could not find function fib [...] # The following illustrates that the solution is not always straightforward. # (not specific to BiocParallel; must have been discussed previously) values - bplapply(0:9, FUN=function(n, fib) { fib(n) }, fib=fib) Error in LastError$store(results = results, is.error = !ok, throw.error = TRUE) : Errors occurred during execution. First error message: Error in fib(n): could not find function fib [...] # Workaround; make fib() aware of itself # (this is something the user need to do, and would be very # hard for BiocParallel et al. to
Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?
I guess all we need to do is to detect whether a function would try to access a free variable in the user's workspace, and warn/error if so. It looks like CodeDepends could do that. I could try to come up with an implementation. I guess we would add CodeDepends as an optional dependency for BiocParallel, and only do the checks if CodeDepends is available. On Sun Nov 3 17:10:45 2013, Gabriel Becker wrote: Henrik, See https://github.com/duncantl/CodeDepends (as used by used by https://github.com/gmbecker/RCacheSuite). It will identify necessarily defined symbols (input variables) for code that is not doing certain tricks (eg get(), mixing data.frame columns and gobal variables in formulas, etc ). Tierney's codetools package also does things along these lines but there are some situations where it has trouble. I can give more detail if desired. ~G On Sun, Nov 3, 2013 at 3:04 PM, Ryan r...@thompsonclan.org mailto:r...@thompsonclan.org wrote: Another potential easy step we can do is that if FUN function in the user's workspace, we automatically export that function under the same name in the children. This would make recursive functions just work, but it might be a bit too magical. On 11/3/13, 2:38 PM, Ryan wrote: Here's an easy thing we can add to BiocParallel in the short term. The following code defines a wrapper function withBPExtraErrorText that simply appends an additional message to the end of any error that looks like it is about a missing variable. We could wrap every evaluation in a similar tryCatch to at least provide a more informative error message when a subprocess has a missing variable. -Ryan withBPExtraErrorText - function(expr) { tryCatch({ expr }, simpleError = function(err) { if (grepl(^object '(.*)' not found$, err$message, perl=TRUE)) { ## It is an error due to a variable not found. err$message - paste0(err$message, . Maybe you forgot to export this variable from the main R session using \bpexport\?) } stop(err) }) } x - 5 ## Succeeds withBPExtraErrorText(x) ## Fails with more informative error message withBPExtraErrorText(y) On Sun Nov 3 14:01:48 2013, Henrik Bengtsson wrote: On Sun, Nov 3, 2013 at 1:29 PM, Michael Lawrence lawrence.mich...@gene.com mailto:lawrence.mich...@gene.com wrote: An analog to clusterExport is a good idea. To make it even easier, we could have a dynamic environment based on object tables that would catch missing symbols and download them from the parent thread. But maybe there's some benefit to being explicit? A first step to fully automate this would be to provide some (opt in/out) mechanism for code inspection and warn about non-defined objects (cf. 'R CMD check'). That is of course major work, but will certainly spare the community/users 1000's of hours in troubleshooting and the mailing lists from why doesn't my parallel code not work messages. Such protection may be better suited for the 'parallel' package though. Unfortunately, it's beyond my skills/time to pull such a thing together. /Henrik Michael On Sun, Nov 3, 2013 at 12:39 PM, Henrik Bengtsson h...@biostat.ucsf.edu mailto:h...@biostat.ucsf.edu wrote: Hi, in BiocParallel, is there a suggested (or planned) best standards for making *locally* assigned variables (e.g. functions) available to the applied function when it runs in a separate R process (which will be the most common use case)? I understand that avoid local variables should be avoided and it's preferred to put as mush as possible in packages, but that's not always possible or very convenient. EXAMPLE: library('BiocParallel') library('BatchJobs') # Here I pick a recursive functions to make the problem a bit harder, i.e. # the function needs to call itself (itself = see below) fib - function(n=0) { if (n 0) stop(Invalid 'n': , n) if (n == 0 || n ==
Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?
Ryan (et al), FYI: f function() { x = rnorm(x) x } findGlobals(f) [1] = { rnorm x should be in the list of globals but it isn't. ~G sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] codetools_0.2-8 On Sun, Nov 3, 2013 at 5:37 PM, Ryan r...@thompsonclan.org wrote: Looking at the codetools package, I think findGlobals is basically exactly what we want here, right? As you say, there are necessarily limitations due to R being a dynamic language, but the goal is to catch common errors, not stop people from tricking the check. I think I'll try to code something up soon. -Ryan On 11/3/13, 5:10 PM, Gabriel Becker wrote: Henrik, See https://github.com/duncantl/CodeDepends (as used by used by https://github.com/gmbecker/RCacheSuite). It will identify necessarily defined symbols (input variables) for code that is not doing certain tricks (eg get(), mixing data.frame columns and gobal variables in formulas, etc ). Tierney's codetools package also does things along these lines but there are some situations where it has trouble. I can give more detail if desired. ~G On Sun, Nov 3, 2013 at 3:04 PM, Ryan r...@thompsonclan.org wrote: Another potential easy step we can do is that if FUN function in the user's workspace, we automatically export that function under the same name in the children. This would make recursive functions just work, but it might be a bit too magical. On 11/3/13, 2:38 PM, Ryan wrote: Here's an easy thing we can add to BiocParallel in the short term. The following code defines a wrapper function withBPExtraErrorText that simply appends an additional message to the end of any error that looks like it is about a missing variable. We could wrap every evaluation in a similar tryCatch to at least provide a more informative error message when a subprocess has a missing variable. -Ryan withBPExtraErrorText - function(expr) { tryCatch({ expr }, simpleError = function(err) { if (grepl(^object '(.*)' not found$, err$message, perl=TRUE)) { ## It is an error due to a variable not found. err$message - paste0(err$message, . Maybe you forgot to export this variable from the main R session using \bpexport\?) } stop(err) }) } x - 5 ## Succeeds withBPExtraErrorText(x) ## Fails with more informative error message withBPExtraErrorText(y) On Sun Nov 3 14:01:48 2013, Henrik Bengtsson wrote: On Sun, Nov 3, 2013 at 1:29 PM, Michael Lawrence lawrence.mich...@gene.com wrote: An analog to clusterExport is a good idea. To make it even easier, we could have a dynamic environment based on object tables that would catch missing symbols and download them from the parent thread. But maybe there's some benefit to being explicit? A first step to fully automate this would be to provide some (opt in/out) mechanism for code inspection and warn about non-defined objects (cf. 'R CMD check'). That is of course major work, but will certainly spare the community/users 1000's of hours in troubleshooting and the mailing lists from why doesn't my parallel code not work messages. Such protection may be better suited for the 'parallel' package though. Unfortunately, it's beyond my skills/time to pull such a thing together. /Henrik Michael On Sun, Nov 3, 2013 at 12:39 PM, Henrik Bengtsson h...@biostat.ucsf.edu wrote: Hi, in BiocParallel, is there a suggested (or planned) best standards for making *locally* assigned variables (e.g. functions) available to the applied function when it runs in a separate R process (which will be the most common use case)? I understand that avoid local variables should be avoided and it's preferred to put as mush as possible in packages, but that's not always possible or very convenient. EXAMPLE: library('BiocParallel') library('BatchJobs') # Here I pick a recursive functions to make the problem a bit harder, i.e. # the function needs to call itself (itself = see below) fib - function(n=0) { if (n 0) stop(Invalid 'n': , n) if (n == 0 || n == 1) return(1) fib(n-2) + fib(n-1) } # Executing in the current R session cluster.functions - makeClusterFunctionsInteractive() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) # Executing in a separate R
Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?
Ok, here is my attempt at a function to get the list of user-defined free variables that a function refers to: https://gist.github.com/DarwinAwardWinner/7298557 Is uses codetools, so it is subject to the limitations of that package, but for simple examples, it successfully detects when a function refers to something in the global env. On Sun Nov 3 21:14:29 2013, Gabriel Becker wrote: Ryan (et al), FYI: f function() { x = rnorm(x) x } findGlobals(f) [1] = { rnorm x should be in the list of globals but it isn't. ~G sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] codetools_0.2-8 On Sun, Nov 3, 2013 at 5:37 PM, Ryan r...@thompsonclan.org mailto:r...@thompsonclan.org wrote: Looking at the codetools package, I think findGlobals is basically exactly what we want here, right? As you say, there are necessarily limitations due to R being a dynamic language, but the goal is to catch common errors, not stop people from tricking the check. I think I'll try to code something up soon. -Ryan On 11/3/13, 5:10 PM, Gabriel Becker wrote: Henrik, See https://github.com/duncantl/CodeDepends (as used by used by https://github.com/gmbecker/RCacheSuite). It will identify necessarily defined symbols (input variables) for code that is not doing certain tricks (eg get(), mixing data.frame columns and gobal variables in formulas, etc ). Tierney's codetools package also does things along these lines but there are some situations where it has trouble. I can give more detail if desired. ~G On Sun, Nov 3, 2013 at 3:04 PM, Ryan r...@thompsonclan.org mailto:r...@thompsonclan.org wrote: Another potential easy step we can do is that if FUN function in the user's workspace, we automatically export that function under the same name in the children. This would make recursive functions just work, but it might be a bit too magical. On 11/3/13, 2:38 PM, Ryan wrote: Here's an easy thing we can add to BiocParallel in the short term. The following code defines a wrapper function withBPExtraErrorText that simply appends an additional message to the end of any error that looks like it is about a missing variable. We could wrap every evaluation in a similar tryCatch to at least provide a more informative error message when a subprocess has a missing variable. -Ryan withBPExtraErrorText - function(expr) { tryCatch({ expr }, simpleError = function(err) { if (grepl(^object '(.*)' not found$, err$message, perl=TRUE)) { ## It is an error due to a variable not found. err$message - paste0(err$message, . Maybe you forgot to export this variable from the main R session using \bpexport\?) } stop(err) }) } x - 5 ## Succeeds withBPExtraErrorText(x) ## Fails with more informative error message withBPExtraErrorText(y) On Sun Nov 3 14:01:48 2013, Henrik Bengtsson wrote: On Sun, Nov 3, 2013 at 1:29 PM, Michael Lawrence lawrence.mich...@gene.com mailto:lawrence.mich...@gene.com wrote: An analog to clusterExport is a good idea. To make it even easier, we could have a dynamic environment based on object tables that would catch missing symbols and download them from the parent thread. But maybe there's some benefit to being explicit? A first step to fully automate this would be to provide some (opt in/out) mechanism for code inspection and warn about non-defined objects (cf. 'R CMD check'). That is of course major work, but will certainly spare the community/users 1000's of hours in troubleshooting and the mailing lists from why doesn't my parallel code not work messages. Such protection may be better suited for the 'parallel' package though.