[Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?

2013-11-03 Thread Henrik Bengtsson
Hi,

in BiocParallel, is there a suggested (or planned) best standards for
making *locally* assigned variables (e.g. functions) available to the
applied function when it runs in a separate R process (which will be
the most common use case)?  I understand that avoid local variables
should be avoided and it's preferred to put as mush as possible in
packages, but that's not always possible or very convenient.

EXAMPLE:

library('BiocParallel')
library('BatchJobs')

# Here I pick a recursive functions to make the problem a bit harder, i.e.
# the function needs to call itself (itself = see below)
fib - function(n=0) {
  if (n  0) stop(Invalid 'n': , n)
  if (n == 0 || n == 1) return(1)
  fib(n-2) + fib(n-1)
}

# Executing in the current R session
cluster.functions - makeClusterFunctionsInteractive()
bpParams - BatchJobsParam(cluster.functions=cluster.functions)
register(bpParams)
values - bplapply(0:9, FUN=fib)
## SubmitJobs |++| 100% (00:00:00)
## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00)


# Executing in a separate R process, where fib() is not defined
# (not specific to BiocParallel)
cluster.functions - makeClusterFunctionsLocal()
bpParams - BatchJobsParam(cluster.functions=cluster.functions)
register(bpParams)
values - bplapply(0:9, FUN=fib)
## SubmitJobs |++| 100% (00:00:00)
## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00)
Error in LastError$store(results = results, is.error = !ok, throw.error = TRUE)
:
  Errors occurred during execution. First error message:
Error in FUN(...): could not find function fib
[...]


# The following illustrates that the solution is not always straightforward.
# (not specific to BiocParallel; must have been discussed previously)
values - bplapply(0:9, FUN=function(n, fib) {
  fib(n)
}, fib=fib)
Error in LastError$store(results = results, is.error = !ok,
throw.error = TRUE) :
  Errors occurred during execution. First error message:
Error in fib(n): could not find function fib
[...]

# Workaround; make fib() aware of itself
# (this is something the user need to do, and would be very
#  hard for BiocParallel et al. to automate.  BTW, should all
#  recursive functions be implemented this way?).
fib - function(n=0) {
  if (n  0) stop(Invalid 'n': , n)
  if (n == 0 || n == 1) return(1)
  fib - sys.function() # Make function aware of itself
  fib(n-2) + fib(n-1)
}
values - bplapply(0:9, FUN=function(n, fib) {
  fib(n)
}, fib=fib)


WISHLIST:
Considering the above recursive issue solved, a slightly more explicit
and standardized solution is then:

values - bplapply(0:9, FUN=function(n, BPGLOBALS=NULL) {
  for (name in names(BPGLOBALS)) assign(name, BPGLOBALS[[name]])
  fib(n)
}, BPGLOBALS=list(fib=fib))

Could the above be generalized into something as neat as:

bpExport(fib)
values - bplapply(0:9, FUN=function(n) {
  BiocParallel::bpImport(fib)
  fib(n)
})

or ideally just (analogously to parallel::clusterExport()):

bpExport(fib)
values - bplapply(0:9, FUN=fib)

/Henrik

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?

2013-11-03 Thread Michael Lawrence
An analog to clusterExport is a good idea. To make it even easier, we could
have a dynamic environment based on object tables that would catch missing
symbols and download them from the parent thread. But maybe there's some
benefit to being explicit?

Michael


On Sun, Nov 3, 2013 at 12:39 PM, Henrik Bengtsson h...@biostat.ucsf.eduwrote:

 Hi,

 in BiocParallel, is there a suggested (or planned) best standards for
 making *locally* assigned variables (e.g. functions) available to the
 applied function when it runs in a separate R process (which will be
 the most common use case)?  I understand that avoid local variables
 should be avoided and it's preferred to put as mush as possible in
 packages, but that's not always possible or very convenient.

 EXAMPLE:

 library('BiocParallel')
 library('BatchJobs')

 # Here I pick a recursive functions to make the problem a bit harder, i.e.
 # the function needs to call itself (itself = see below)
 fib - function(n=0) {
   if (n  0) stop(Invalid 'n': , n)
   if (n == 0 || n == 1) return(1)
   fib(n-2) + fib(n-1)
 }

 # Executing in the current R session
 cluster.functions - makeClusterFunctionsInteractive()
 bpParams - BatchJobsParam(cluster.functions=cluster.functions)
 register(bpParams)
 values - bplapply(0:9, FUN=fib)
 ## SubmitJobs |++| 100% (00:00:00)
 ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00)


 # Executing in a separate R process, where fib() is not defined
 # (not specific to BiocParallel)
 cluster.functions - makeClusterFunctionsLocal()
 bpParams - BatchJobsParam(cluster.functions=cluster.functions)
 register(bpParams)
 values - bplapply(0:9, FUN=fib)
 ## SubmitJobs |++| 100% (00:00:00)
 ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00)
 Error in LastError$store(results = results, is.error = !ok, throw.error =
 TRUE)
 :
   Errors occurred during execution. First error message:
 Error in FUN(...): could not find function fib
 [...]


 # The following illustrates that the solution is not always
 straightforward.
 # (not specific to BiocParallel; must have been discussed previously)
 values - bplapply(0:9, FUN=function(n, fib) {
   fib(n)
 }, fib=fib)
 Error in LastError$store(results = results, is.error = !ok,
 throw.error = TRUE) :
   Errors occurred during execution. First error message:
 Error in fib(n): could not find function fib
 [...]

 # Workaround; make fib() aware of itself
 # (this is something the user need to do, and would be very
 #  hard for BiocParallel et al. to automate.  BTW, should all
 #  recursive functions be implemented this way?).
 fib - function(n=0) {
   if (n  0) stop(Invalid 'n': , n)
   if (n == 0 || n == 1) return(1)
   fib - sys.function() # Make function aware of itself
   fib(n-2) + fib(n-1)
 }
 values - bplapply(0:9, FUN=function(n, fib) {
   fib(n)
 }, fib=fib)


 WISHLIST:
 Considering the above recursive issue solved, a slightly more explicit
 and standardized solution is then:

 values - bplapply(0:9, FUN=function(n, BPGLOBALS=NULL) {
   for (name in names(BPGLOBALS)) assign(name, BPGLOBALS[[name]])
   fib(n)
 }, BPGLOBALS=list(fib=fib))

 Could the above be generalized into something as neat as:

 bpExport(fib)
 values - bplapply(0:9, FUN=function(n) {
   BiocParallel::bpImport(fib)
   fib(n)
 })

 or ideally just (analogously to parallel::clusterExport()):

 bpExport(fib)
 values - bplapply(0:9, FUN=fib)

 /Henrik

 ___
 Bioc-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/bioc-devel


[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?

2013-11-03 Thread Henrik Bengtsson
On Sun, Nov 3, 2013 at 1:29 PM, Michael Lawrence
lawrence.mich...@gene.com wrote:
 An analog to clusterExport is a good idea. To make it even easier, we could
 have a dynamic environment based on object tables that would catch missing
 symbols and download them from the parent thread. But maybe there's some
 benefit to being explicit?

A first step to fully automate this would be to provide some (opt
in/out) mechanism for code inspection and warn about non-defined
objects (cf. 'R CMD check').  That is of course major work, but will
certainly spare the community/users 1000's of hours in troubleshooting
and the mailing lists from why doesn't my parallel code not work
messages.  Such protection may be better suited for the 'parallel'
package though.  Unfortunately, it's beyond my skills/time to pull
such a thing together.

/Henrik


 Michael


 On Sun, Nov 3, 2013 at 12:39 PM, Henrik Bengtsson h...@biostat.ucsf.edu
 wrote:

 Hi,

 in BiocParallel, is there a suggested (or planned) best standards for
 making *locally* assigned variables (e.g. functions) available to the
 applied function when it runs in a separate R process (which will be
 the most common use case)?  I understand that avoid local variables
 should be avoided and it's preferred to put as mush as possible in
 packages, but that's not always possible or very convenient.

 EXAMPLE:

 library('BiocParallel')
 library('BatchJobs')

 # Here I pick a recursive functions to make the problem a bit harder, i.e.
 # the function needs to call itself (itself = see below)
 fib - function(n=0) {
   if (n  0) stop(Invalid 'n': , n)
   if (n == 0 || n == 1) return(1)
   fib(n-2) + fib(n-1)
 }

 # Executing in the current R session
 cluster.functions - makeClusterFunctionsInteractive()
 bpParams - BatchJobsParam(cluster.functions=cluster.functions)
 register(bpParams)
 values - bplapply(0:9, FUN=fib)
 ## SubmitJobs |++| 100% (00:00:00)
 ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00)


 # Executing in a separate R process, where fib() is not defined
 # (not specific to BiocParallel)
 cluster.functions - makeClusterFunctionsLocal()
 bpParams - BatchJobsParam(cluster.functions=cluster.functions)
 register(bpParams)
 values - bplapply(0:9, FUN=fib)
 ## SubmitJobs |++| 100% (00:00:00)
 ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00)
 Error in LastError$store(results = results, is.error = !ok, throw.error =
 TRUE)
 :
   Errors occurred during execution. First error message:
 Error in FUN(...): could not find function fib
 [...]


 # The following illustrates that the solution is not always
 straightforward.
 # (not specific to BiocParallel; must have been discussed previously)
 values - bplapply(0:9, FUN=function(n, fib) {
   fib(n)
 }, fib=fib)
 Error in LastError$store(results = results, is.error = !ok,
 throw.error = TRUE) :
   Errors occurred during execution. First error message:
 Error in fib(n): could not find function fib
 [...]

 # Workaround; make fib() aware of itself
 # (this is something the user need to do, and would be very
 #  hard for BiocParallel et al. to automate.  BTW, should all
 #  recursive functions be implemented this way?).
 fib - function(n=0) {
   if (n  0) stop(Invalid 'n': , n)
   if (n == 0 || n == 1) return(1)
   fib - sys.function() # Make function aware of itself
   fib(n-2) + fib(n-1)
 }
 values - bplapply(0:9, FUN=function(n, fib) {
   fib(n)
 }, fib=fib)


 WISHLIST:
 Considering the above recursive issue solved, a slightly more explicit
 and standardized solution is then:

 values - bplapply(0:9, FUN=function(n, BPGLOBALS=NULL) {
   for (name in names(BPGLOBALS)) assign(name, BPGLOBALS[[name]])
   fib(n)
 }, BPGLOBALS=list(fib=fib))

 Could the above be generalized into something as neat as:

 bpExport(fib)
 values - bplapply(0:9, FUN=function(n) {
   BiocParallel::bpImport(fib)
   fib(n)
 })

 or ideally just (analogously to parallel::clusterExport()):

 bpExport(fib)
 values - bplapply(0:9, FUN=fib)

 /Henrik

 ___
 Bioc-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/bioc-devel



___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?

2013-11-03 Thread Ryan
Here's an easy thing we can add to BiocParallel in the short term. The 
following code defines a wrapper function withBPExtraErrorText that 
simply appends an additional message to the end of any error that looks 
like it is about a missing variable. We could wrap every evaluation in 
a similar tryCatch to at least provide a more informative error message 
when a subprocess has a missing variable.


-Ryan

withBPExtraErrorText - function(expr) {
   tryCatch({
   expr
   }, simpleError = function(err) {
   if (grepl(^object '(.*)' not found$, err$message, perl=TRUE)) 
{

   ## It is an error due to a variable not found.
   err$message - paste0(err$message, . Maybe you forgot to 
export this variable from the main R session using \bpexport\?)

   }
   stop(err)
   })
}

x - 5

## Succeeds
withBPExtraErrorText(x)

## Fails with more informative error message
withBPExtraErrorText(y)



On Sun Nov  3 14:01:48 2013, Henrik Bengtsson wrote:

On Sun, Nov 3, 2013 at 1:29 PM, Michael Lawrence
lawrence.mich...@gene.com wrote:

An analog to clusterExport is a good idea. To make it even easier, we could
have a dynamic environment based on object tables that would catch missing
symbols and download them from the parent thread. But maybe there's some
benefit to being explicit?


A first step to fully automate this would be to provide some (opt
in/out) mechanism for code inspection and warn about non-defined
objects (cf. 'R CMD check').  That is of course major work, but will
certainly spare the community/users 1000's of hours in troubleshooting
and the mailing lists from why doesn't my parallel code not work
messages.  Such protection may be better suited for the 'parallel'
package though.  Unfortunately, it's beyond my skills/time to pull
such a thing together.

/Henrik



Michael


On Sun, Nov 3, 2013 at 12:39 PM, Henrik Bengtsson h...@biostat.ucsf.edu
wrote:


Hi,

in BiocParallel, is there a suggested (or planned) best standards for
making *locally* assigned variables (e.g. functions) available to the
applied function when it runs in a separate R process (which will be
the most common use case)?  I understand that avoid local variables
should be avoided and it's preferred to put as mush as possible in
packages, but that's not always possible or very convenient.

EXAMPLE:

library('BiocParallel')
library('BatchJobs')

# Here I pick a recursive functions to make the problem a bit harder, i.e.
# the function needs to call itself (itself = see below)
fib - function(n=0) {
   if (n  0) stop(Invalid 'n': , n)
   if (n == 0 || n == 1) return(1)
   fib(n-2) + fib(n-1)
}

# Executing in the current R session
cluster.functions - makeClusterFunctionsInteractive()
bpParams - BatchJobsParam(cluster.functions=cluster.functions)
register(bpParams)
values - bplapply(0:9, FUN=fib)
## SubmitJobs |++| 100% (00:00:00)
## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00)


# Executing in a separate R process, where fib() is not defined
# (not specific to BiocParallel)
cluster.functions - makeClusterFunctionsLocal()
bpParams - BatchJobsParam(cluster.functions=cluster.functions)
register(bpParams)
values - bplapply(0:9, FUN=fib)
## SubmitJobs |++| 100% (00:00:00)
## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00)
Error in LastError$store(results = results, is.error = !ok, throw.error =
TRUE)
:
   Errors occurred during execution. First error message:
Error in FUN(...): could not find function fib
[...]


# The following illustrates that the solution is not always
straightforward.
# (not specific to BiocParallel; must have been discussed previously)
values - bplapply(0:9, FUN=function(n, fib) {
   fib(n)
}, fib=fib)
Error in LastError$store(results = results, is.error = !ok,
throw.error = TRUE) :
   Errors occurred during execution. First error message:
Error in fib(n): could not find function fib
[...]

# Workaround; make fib() aware of itself
# (this is something the user need to do, and would be very
#  hard for BiocParallel et al. to automate.  BTW, should all
#  recursive functions be implemented this way?).
fib - function(n=0) {
   if (n  0) stop(Invalid 'n': , n)
   if (n == 0 || n == 1) return(1)
   fib - sys.function() # Make function aware of itself
   fib(n-2) + fib(n-1)
}
values - bplapply(0:9, FUN=function(n, fib) {
   fib(n)
}, fib=fib)


WISHLIST:
Considering the above recursive issue solved, a slightly more explicit
and standardized solution is then:

values - bplapply(0:9, FUN=function(n, BPGLOBALS=NULL) {
   for (name in names(BPGLOBALS)) assign(name, BPGLOBALS[[name]])
   fib(n)
}, BPGLOBALS=list(fib=fib))

Could the above be generalized into something as neat as:

bpExport(fib)
values - bplapply(0:9, FUN=function(n) {
   BiocParallel::bpImport(fib)
   fib(n)
})

or ideally just (analogously to parallel::clusterExport()):

bpExport(fib)
values - bplapply(0:9, FUN=fib)

/Henrik


Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?

2013-11-03 Thread Ryan
Another potential easy step we can do is that if FUN function in the 
user's workspace, we automatically export that function under the same 
name in the children. This would make recursive functions just work, but 
it might be a bit too magical.


On 11/3/13, 2:38 PM, Ryan wrote:
Here's an easy thing we can add to BiocParallel in the short term. The 
following code defines a wrapper function withBPExtraErrorText that 
simply appends an additional message to the end of any error that 
looks like it is about a missing variable. We could wrap every 
evaluation in a similar tryCatch to at least provide a more 
informative error message when a subprocess has a missing variable.


-Ryan

withBPExtraErrorText - function(expr) {
   tryCatch({
   expr
   }, simpleError = function(err) {
   if (grepl(^object '(.*)' not found$, err$message, perl=TRUE)) {
   ## It is an error due to a variable not found.
   err$message - paste0(err$message, . Maybe you forgot to 
export this variable from the main R session using \bpexport\?)

   }
   stop(err)
   })
}

x - 5

## Succeeds
withBPExtraErrorText(x)

## Fails with more informative error message
withBPExtraErrorText(y)



On Sun Nov  3 14:01:48 2013, Henrik Bengtsson wrote:

On Sun, Nov 3, 2013 at 1:29 PM, Michael Lawrence
lawrence.mich...@gene.com wrote:
An analog to clusterExport is a good idea. To make it even easier, 
we could
have a dynamic environment based on object tables that would catch 
missing
symbols and download them from the parent thread. But maybe there's 
some

benefit to being explicit?


A first step to fully automate this would be to provide some (opt
in/out) mechanism for code inspection and warn about non-defined
objects (cf. 'R CMD check').  That is of course major work, but will
certainly spare the community/users 1000's of hours in troubleshooting
and the mailing lists from why doesn't my parallel code not work
messages.  Such protection may be better suited for the 'parallel'
package though.  Unfortunately, it's beyond my skills/time to pull
such a thing together.

/Henrik



Michael


On Sun, Nov 3, 2013 at 12:39 PM, Henrik Bengtsson h...@biostat.ucsf.edu
wrote:


Hi,

in BiocParallel, is there a suggested (or planned) best standards for
making *locally* assigned variables (e.g. functions) available to the
applied function when it runs in a separate R process (which will be
the most common use case)?  I understand that avoid local variables
should be avoided and it's preferred to put as mush as possible in
packages, but that's not always possible or very convenient.

EXAMPLE:

library('BiocParallel')
library('BatchJobs')

# Here I pick a recursive functions to make the problem a bit 
harder, i.e.

# the function needs to call itself (itself = see below)
fib - function(n=0) {
   if (n  0) stop(Invalid 'n': , n)
   if (n == 0 || n == 1) return(1)
   fib(n-2) + fib(n-1)
}

# Executing in the current R session
cluster.functions - makeClusterFunctionsInteractive()
bpParams - BatchJobsParam(cluster.functions=cluster.functions)
register(bpParams)
values - bplapply(0:9, FUN=fib)
## SubmitJobs |++| 100% (00:00:00)
## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00)


# Executing in a separate R process, where fib() is not defined
# (not specific to BiocParallel)
cluster.functions - makeClusterFunctionsLocal()
bpParams - BatchJobsParam(cluster.functions=cluster.functions)
register(bpParams)
values - bplapply(0:9, FUN=fib)
## SubmitJobs |++| 100% (00:00:00)
## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00)
Error in LastError$store(results = results, is.error = !ok, 
throw.error =

TRUE)
:
   Errors occurred during execution. First error message:
Error in FUN(...): could not find function fib
[...]


# The following illustrates that the solution is not always
straightforward.
# (not specific to BiocParallel; must have been discussed previously)
values - bplapply(0:9, FUN=function(n, fib) {
   fib(n)
}, fib=fib)
Error in LastError$store(results = results, is.error = !ok,
throw.error = TRUE) :
   Errors occurred during execution. First error message:
Error in fib(n): could not find function fib
[...]

# Workaround; make fib() aware of itself
# (this is something the user need to do, and would be very
#  hard for BiocParallel et al. to automate.  BTW, should all
#  recursive functions be implemented this way?).
fib - function(n=0) {
   if (n  0) stop(Invalid 'n': , n)
   if (n == 0 || n == 1) return(1)
   fib - sys.function() # Make function aware of itself
   fib(n-2) + fib(n-1)
}
values - bplapply(0:9, FUN=function(n, fib) {
   fib(n)
}, fib=fib)


WISHLIST:
Considering the above recursive issue solved, a slightly more explicit
and standardized solution is then:

values - bplapply(0:9, FUN=function(n, BPGLOBALS=NULL) {
   for (name in names(BPGLOBALS)) assign(name, BPGLOBALS[[name]])
   fib(n)
}, BPGLOBALS=list(fib=fib))

Could 

Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?

2013-11-03 Thread Gabriel Becker
Henrik,

See https://github.com/duncantl/CodeDepends (as used by used by
https://github.com/gmbecker/RCacheSuite). It will identify necessarily
defined symbols (input variables) for code that is not doing certain tricks
(eg get(), mixing data.frame columns and gobal variables in formulas, etc ).

Tierney's codetools package also does things along these lines but there
are some situations where it has trouble. I can give more detail if desired.

~G


On Sun, Nov 3, 2013 at 3:04 PM, Ryan r...@thompsonclan.org wrote:

 Another potential easy step we can do is that if FUN function in the
 user's workspace, we automatically export that function under the same name
 in the children. This would make recursive functions just work, but it
 might be a bit too magical.


 On 11/3/13, 2:38 PM, Ryan wrote:

 Here's an easy thing we can add to BiocParallel in the short term. The
 following code defines a wrapper function withBPExtraErrorText that
 simply appends an additional message to the end of any error that looks
 like it is about a missing variable. We could wrap every evaluation in a
 similar tryCatch to at least provide a more informative error message when
 a subprocess has a missing variable.

 -Ryan

 withBPExtraErrorText - function(expr) {
tryCatch({
expr
}, simpleError = function(err) {
if (grepl(^object '(.*)' not found$, err$message, perl=TRUE)) {
## It is an error due to a variable not found.
err$message - paste0(err$message, . Maybe you forgot to
 export this variable from the main R session using \bpexport\?)
}
stop(err)
})
 }

 x - 5

 ## Succeeds
 withBPExtraErrorText(x)

 ## Fails with more informative error message
 withBPExtraErrorText(y)



 On Sun Nov  3 14:01:48 2013, Henrik Bengtsson wrote:

 On Sun, Nov 3, 2013 at 1:29 PM, Michael Lawrence
 lawrence.mich...@gene.com wrote:

 An analog to clusterExport is a good idea. To make it even easier, we
 could
 have a dynamic environment based on object tables that would catch
 missing
 symbols and download them from the parent thread. But maybe there's some
 benefit to being explicit?


 A first step to fully automate this would be to provide some (opt
 in/out) mechanism for code inspection and warn about non-defined
 objects (cf. 'R CMD check').  That is of course major work, but will
 certainly spare the community/users 1000's of hours in troubleshooting
 and the mailing lists from why doesn't my parallel code not work
 messages.  Such protection may be better suited for the 'parallel'
 package though.  Unfortunately, it's beyond my skills/time to pull
 such a thing together.

 /Henrik


 Michael


 On Sun, Nov 3, 2013 at 12:39 PM, Henrik Bengtsson h...@biostat.ucsf.edu
 wrote:


 Hi,

 in BiocParallel, is there a suggested (or planned) best standards for
 making *locally* assigned variables (e.g. functions) available to the
 applied function when it runs in a separate R process (which will be
 the most common use case)?  I understand that avoid local variables
 should be avoided and it's preferred to put as mush as possible in
 packages, but that's not always possible or very convenient.

 EXAMPLE:

 library('BiocParallel')
 library('BatchJobs')

 # Here I pick a recursive functions to make the problem a bit harder,
 i.e.
 # the function needs to call itself (itself = see below)
 fib - function(n=0) {
if (n  0) stop(Invalid 'n': , n)
if (n == 0 || n == 1) return(1)
fib(n-2) + fib(n-1)
 }

 # Executing in the current R session
 cluster.functions - makeClusterFunctionsInteractive()
 bpParams - BatchJobsParam(cluster.functions=cluster.functions)
 register(bpParams)
 values - bplapply(0:9, FUN=fib)
 ## SubmitJobs |++| 100% (00:00:00)
 ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00)


 # Executing in a separate R process, where fib() is not defined
 # (not specific to BiocParallel)
 cluster.functions - makeClusterFunctionsLocal()
 bpParams - BatchJobsParam(cluster.functions=cluster.functions)
 register(bpParams)
 values - bplapply(0:9, FUN=fib)
 ## SubmitJobs |++| 100% (00:00:00)
 ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00)
 Error in LastError$store(results = results, is.error = !ok,
 throw.error =
 TRUE)
 :
Errors occurred during execution. First error message:
 Error in FUN(...): could not find function fib
 [...]


 # The following illustrates that the solution is not always
 straightforward.
 # (not specific to BiocParallel; must have been discussed previously)
 values - bplapply(0:9, FUN=function(n, fib) {
fib(n)
 }, fib=fib)
 Error in LastError$store(results = results, is.error = !ok,
 throw.error = TRUE) :
Errors occurred during execution. First error message:
 Error in fib(n): could not find function fib
 [...]

 # Workaround; make fib() aware of itself
 # (this is something the user need to do, and would be very
 #  hard for BiocParallel et al. to 

Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?

2013-11-03 Thread Ryan
I guess all we need to do is to detect whether a function would try to 
access a free variable in the user's workspace, and warn/error if so. 
It looks like CodeDepends could do that. I could try to come up with an 
implementation. I guess we would add CodeDepends as an optional 
dependency for BiocParallel, and only do the checks if CodeDepends is 
available.


On Sun Nov  3 17:10:45 2013, Gabriel Becker wrote:

Henrik,

See https://github.com/duncantl/CodeDepends (as used by used by
https://github.com/gmbecker/RCacheSuite). It will identify necessarily
defined symbols (input variables) for code that is not doing certain
tricks (eg get(), mixing data.frame columns and gobal variables in
formulas, etc ).

Tierney's codetools package also does things along these lines but
there are some situations where it has trouble. I can give more detail
if desired.

~G


On Sun, Nov 3, 2013 at 3:04 PM, Ryan r...@thompsonclan.org
mailto:r...@thompsonclan.org wrote:

Another potential easy step we can do is that if FUN function in
the user's workspace, we automatically export that function under
the same name in the children. This would make recursive functions
just work, but it might be a bit too magical.


On 11/3/13, 2:38 PM, Ryan wrote:

Here's an easy thing we can add to BiocParallel in the short
term. The following code defines a wrapper function
withBPExtraErrorText that simply appends an additional
message to the end of any error that looks like it is about a
missing variable. We could wrap every evaluation in a similar
tryCatch to at least provide a more informative error message
when a subprocess has a missing variable.

-Ryan

withBPExtraErrorText - function(expr) {
   tryCatch({
   expr
   }, simpleError = function(err) {
   if (grepl(^object '(.*)' not found$, err$message,
perl=TRUE)) {
   ## It is an error due to a variable not found.
   err$message - paste0(err$message, . Maybe you
forgot to export this variable from the main R session using
\bpexport\?)
   }
   stop(err)
   })
}

x - 5

## Succeeds
withBPExtraErrorText(x)

## Fails with more informative error message
withBPExtraErrorText(y)



On Sun Nov  3 14:01:48 2013, Henrik Bengtsson wrote:

On Sun, Nov 3, 2013 at 1:29 PM, Michael Lawrence
lawrence.mich...@gene.com
mailto:lawrence.mich...@gene.com wrote:

An analog to clusterExport is a good idea. To make it
even easier, we could
have a dynamic environment based on object tables that
would catch missing
symbols and download them from the parent thread. But
maybe there's some
benefit to being explicit?


A first step to fully automate this would be to provide
some (opt
in/out) mechanism for code inspection and warn about
non-defined
objects (cf. 'R CMD check').  That is of course major
work, but will
certainly spare the community/users 1000's of hours in
troubleshooting
and the mailing lists from why doesn't my parallel code
not work
messages.  Such protection may be better suited for the
'parallel'
package though.  Unfortunately, it's beyond my skills/time
to pull
such a thing together.

/Henrik


Michael


On Sun, Nov 3, 2013 at 12:39 PM, Henrik Bengtsson
h...@biostat.ucsf.edu mailto:h...@biostat.ucsf.edu
wrote:


Hi,

in BiocParallel, is there a suggested (or planned)
best standards for
making *locally* assigned variables (e.g.
functions) available to the
applied function when it runs in a separate R
process (which will be
the most common use case)?  I understand that
avoid local variables
should be avoided and it's preferred to put as
mush as possible in
packages, but that's not always possible or very
convenient.

EXAMPLE:

library('BiocParallel')
library('BatchJobs')

# Here I pick a recursive functions to make the
problem a bit harder, i.e.
# the function needs to call itself (itself =
see below)
fib - function(n=0) {
   if (n  0) stop(Invalid 'n': , n)
   if (n == 0 || n == 

Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?

2013-11-03 Thread Gabriel Becker
Ryan (et al),

FYI:

 f
function() {
x = rnorm(x)
x
}
 findGlobals(f)
[1] = { rnorm

x should be in the list of globals but it isn't.

~G

 sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

other attached packages:
[1] codetools_0.2-8



On Sun, Nov 3, 2013 at 5:37 PM, Ryan r...@thompsonclan.org wrote:

  Looking at the codetools package, I think findGlobals is basically
 exactly what we want here, right? As you say, there are necessarily
 limitations due to R being a dynamic language, but the goal is to catch
 common errors, not stop people from tricking the check.

 I think I'll try to code something up soon.

 -Ryan


 On 11/3/13, 5:10 PM, Gabriel Becker wrote:

  Henrik,

 See https://github.com/duncantl/CodeDepends (as used by used by
 https://github.com/gmbecker/RCacheSuite). It will identify necessarily
 defined symbols (input variables) for code that is not doing certain tricks
 (eg get(), mixing data.frame columns and gobal variables in formulas, etc ).

  Tierney's codetools package also does things along these lines but there
 are some situations where it has trouble. I can give more detail if desired.

  ~G


 On Sun, Nov 3, 2013 at 3:04 PM, Ryan r...@thompsonclan.org wrote:

 Another potential easy step we can do is that if FUN function in the
 user's workspace, we automatically export that function under the same name
 in the children. This would make recursive functions just work, but it
 might be a bit too magical.


 On 11/3/13, 2:38 PM, Ryan wrote:

 Here's an easy thing we can add to BiocParallel in the short term. The
 following code defines a wrapper function withBPExtraErrorText that
 simply appends an additional message to the end of any error that looks
 like it is about a missing variable. We could wrap every evaluation in a
 similar tryCatch to at least provide a more informative error message when
 a subprocess has a missing variable.

 -Ryan

 withBPExtraErrorText - function(expr) {
tryCatch({
expr
}, simpleError = function(err) {
if (grepl(^object '(.*)' not found$, err$message, perl=TRUE)) {
## It is an error due to a variable not found.
err$message - paste0(err$message, . Maybe you forgot to
 export this variable from the main R session using \bpexport\?)
}
stop(err)
})
 }

 x - 5

 ## Succeeds
 withBPExtraErrorText(x)

 ## Fails with more informative error message
 withBPExtraErrorText(y)



 On Sun Nov  3 14:01:48 2013, Henrik Bengtsson wrote:

 On Sun, Nov 3, 2013 at 1:29 PM, Michael Lawrence
 lawrence.mich...@gene.com wrote:

 An analog to clusterExport is a good idea. To make it even easier, we
 could
 have a dynamic environment based on object tables that would catch
 missing
 symbols and download them from the parent thread. But maybe there's
 some
 benefit to being explicit?


 A first step to fully automate this would be to provide some (opt
 in/out) mechanism for code inspection and warn about non-defined
 objects (cf. 'R CMD check').  That is of course major work, but will
 certainly spare the community/users 1000's of hours in troubleshooting
 and the mailing lists from why doesn't my parallel code not work
 messages.  Such protection may be better suited for the 'parallel'
 package though.  Unfortunately, it's beyond my skills/time to pull
 such a thing together.

 /Henrik


 Michael


 On Sun, Nov 3, 2013 at 12:39 PM, Henrik Bengtsson h...@biostat.ucsf.edu
 
 wrote:


 Hi,

 in BiocParallel, is there a suggested (or planned) best standards for
 making *locally* assigned variables (e.g. functions) available to the
 applied function when it runs in a separate R process (which will be
 the most common use case)?  I understand that avoid local variables
 should be avoided and it's preferred to put as mush as possible in
 packages, but that's not always possible or very convenient.

 EXAMPLE:

 library('BiocParallel')
 library('BatchJobs')

 # Here I pick a recursive functions to make the problem a bit harder,
 i.e.
 # the function needs to call itself (itself = see below)
 fib - function(n=0) {
if (n  0) stop(Invalid 'n': , n)
if (n == 0 || n == 1) return(1)
fib(n-2) + fib(n-1)
 }

 # Executing in the current R session
 cluster.functions - makeClusterFunctionsInteractive()
 bpParams - BatchJobsParam(cluster.functions=cluster.functions)
 register(bpParams)
 values - bplapply(0:9, FUN=fib)
 ## SubmitJobs |++| 100% (00:00:00)
 ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00)


 # Executing in a separate R 

Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?

2013-11-03 Thread Ryan
Ok, here is my attempt at a function to get the list of user-defined 
free variables that a function refers to:


https://gist.github.com/DarwinAwardWinner/7298557

Is uses codetools, so it is subject to the limitations of that package, 
but for simple examples, it successfully detects when a function refers 
to something in the global env.


On Sun Nov  3 21:14:29 2013, Gabriel Becker wrote:

Ryan (et al),

FYI:

 f
function() {
x = rnorm(x)
x
}
 findGlobals(f)
[1] = { rnorm

x should be in the list of globals but it isn't.

~G

 sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

other attached packages:
[1] codetools_0.2-8



On Sun, Nov 3, 2013 at 5:37 PM, Ryan r...@thompsonclan.org
mailto:r...@thompsonclan.org wrote:

Looking at the codetools package, I think findGlobals is
basically exactly what we want here, right? As you say, there are
necessarily limitations due to R being a dynamic language, but the
goal is to catch common errors, not stop people from tricking the
check.

I think I'll try to code something up soon.

-Ryan


On 11/3/13, 5:10 PM, Gabriel Becker wrote:

Henrik,

See https://github.com/duncantl/CodeDepends (as used by used by
https://github.com/gmbecker/RCacheSuite). It will identify
necessarily defined symbols (input variables) for code that is
not doing certain tricks (eg get(), mixing data.frame columns and
gobal variables in formulas, etc ).

Tierney's codetools package also does things along these lines
but there are some situations where it has trouble. I can give
more detail if desired.

~G


On Sun, Nov 3, 2013 at 3:04 PM, Ryan r...@thompsonclan.org
mailto:r...@thompsonclan.org wrote:

Another potential easy step we can do is that if FUN function
in the user's workspace, we automatically export that
function under the same name in the children. This would make
recursive functions just work, but it might be a bit too
magical.


On 11/3/13, 2:38 PM, Ryan wrote:

Here's an easy thing we can add to BiocParallel in the
short term. The following code defines a wrapper function
withBPExtraErrorText that simply appends an additional
message to the end of any error that looks like it is
about a missing variable. We could wrap every evaluation
in a similar tryCatch to at least provide a more
informative error message when a subprocess has a missing
variable.

-Ryan

withBPExtraErrorText - function(expr) {
   tryCatch({
   expr
   }, simpleError = function(err) {
   if (grepl(^object '(.*)' not found$,
err$message, perl=TRUE)) {
   ## It is an error due to a variable not found.
   err$message - paste0(err$message, . Maybe
you forgot to export this variable from the main R
session using \bpexport\?)
   }
   stop(err)
   })
}

x - 5

## Succeeds
withBPExtraErrorText(x)

## Fails with more informative error message
withBPExtraErrorText(y)



On Sun Nov  3 14:01:48 2013, Henrik Bengtsson wrote:

On Sun, Nov 3, 2013 at 1:29 PM, Michael Lawrence
lawrence.mich...@gene.com
mailto:lawrence.mich...@gene.com wrote:

An analog to clusterExport is a good idea. To
make it even easier, we could
have a dynamic environment based on object tables
that would catch missing
symbols and download them from the parent thread.
But maybe there's some
benefit to being explicit?


A first step to fully automate this would be to
provide some (opt
in/out) mechanism for code inspection and warn about
non-defined
objects (cf. 'R CMD check').  That is of course major
work, but will
certainly spare the community/users 1000's of hours
in troubleshooting
and the mailing lists from why doesn't my parallel
code not work
messages.  Such protection may be better suited for
the 'parallel'
package though.