Re: [Bioc-devel] Memory usage for bplapply

2019-01-06 Thread Martin Morgan
From the earlier example, whether the worker sees all the data or not depends 
on whether it is in the environment of FUN, the object sent to the worker.

I don't really know about packages and forked processes. I'd bet that the 
vector allocations are essentially constant, but that the S-Expressions that 
point to the symbols do actually get modified, e.g., when the user creates a 
symbol that references a package symbol (possibly incrementing the NAMED status 
of the S-expression) or even when the garbage collector comes along and decides 
that the S-expression in the package should be moved to a different generation.

Be sure to understand the difference (maybe you do) between the environment in 
which the function is defined and the environment in which it is called. Also 
note that as you restrict the environment in which a function is defined, you 
restrict the operations that the function perform; the reason a function foo in 
a package can call another function bar in the same package is because bar is 
defined in the same environment as foo, and 

  > local(1 + 2, envir = emptyenv())
 Error in 1 + 2 : could not find function "+"

Usually the bigger problem is that one serializes large data on the manager and 
sends it to the worker (e.g., reading chunks of a BAM file on the manager and 
sending each chunk to the worker) rather than arranging to do the heavy IO on 
the worker (e.g., sending instructions that the worker is supposed to read 
chromosome 1 from disk).

I think if one is worrying about memory at this level, then it's time to get a 
bigger computer!

Martin

On 1/6/19, 9:48 PM, "Shian Su"  wrote:




Can I get a indication here about what is expected to consume memory under 
fork and socket models as well as patterns to mitigate excessive memory 
consumption?


When using sockets, the model is that of multiple communicating machines 
running on their own memory, so it makes sense that memory usage is duplicated 
for loaded packages and the parent environment. But is the while data object 
duplicated or
 only the portion of the tasks assigned to a thread? i.e. 4 mb of packages, 
4 mb of parent environment, 4 mb of data to run bplapply over, is each thread 
going to consume 12mb or 9mb of memory? It is unclear to me whether the data 
object operated on should
 be thought of as a part of the parent environment.


When using forks, the model is that of multiple processes running on shared 
memory. This is specific to macOS and Unix variants and I believe the model is 
meant to share memory until a write operation causes variables to be copied. I 
also believe
 R’s internal memory management can potentially touch all the variables and 
cause copies, so the worse case scenario is that everything is copied. What’s 
unclear is whether this applies to loaded packages, are they under the 
supervision of a garbage collector?
 So as per the previous scenario, from the second thread onwards, do we 
expect up to (0 + 4 + 1)mb, (4 + 4 + 1)mb or (4 + 4 + 4)mb of memory usage? 
Maybe even the ideal scenario of (0 + 0 + 1)?


With regards to patterns to efficiently use memory, is it sufficient to 
keep the parent environment as compact as possible? Are there clever ways to 
use local() for this?


Kind regards,
Shian


On 6 Jan 2019, at 9:24 am, Martin Morgan  wrote:

In
 one R session I did library(SummarizedExperiment) and then saved search(). 
In another R session I loaded the packages on the search path in reverse order, 
recording pryr::mem_used() after each. I ended up with

 mem_used
methods
   25870312
datasets
  30062016
utils
 30062136
grDevices
 30062256
graphics
  30062376
stats
 30062496
stats4
32262992
parallel
  32495080
BiocGenerics
  38903928
S4Vectors
 59586928
IRanges
  100171896
GenomeInfoDb
 113791328
GenomicRanges
154729400
Biobase
  163335336
matrixStats
  163518520
BiocParallel
 167373512
DelayedArray
 280812736
SummarizedExperiment
 317386656

Each
 of the Bioconductor dependencies of SummarizedExperiment contribute to the 
overall size. Two dependencies (Biobase, DelayedArray) look a little 
unnecessary to me (they do not provide functionality that must be used by 
SummarizedExperiment) but removing them
 only reduces the total footprint to about 300MB. Somehow it makes sense 
that a package like SummarizedExperiment uses the data structures defined in 
other packages, and that it has a complex dependency graph. It is surprising 
how large the final footprint
 is.

One
 possible way 

Re: [Bioc-devel] Memory usage for bplapply

2019-01-06 Thread Shian Su
Can I get a indication here about what is expected to consume memory under fork 
and socket models as well as patterns to mitigate excessive memory consumption?

When using sockets, the model is that of multiple communicating machines 
running on their own memory, so it makes sense that memory usage is duplicated 
for loaded packages and the parent environment. But is the while data object 
duplicated or only the portion of the tasks assigned to a thread? i.e. 4 mb of 
packages, 4 mb of parent environment, 4 mb of data to run bplapply over, is 
each thread going to consume 12mb or 9mb of memory? It is unclear to me whether 
the data object operated on should be thought of as a part of the parent 
environment.

When using forks, the model is that of multiple processes running on shared 
memory. This is specific to macOS and Unix variants and I believe the model is 
meant to share memory until a write operation causes variables to be copied. I 
also believe R’s internal memory management can potentially touch all the 
variables and cause copies, so the worse case scenario is that everything is 
copied. What’s unclear is whether this applies to loaded packages, are they 
under the supervision of a garbage collector? So as per the previous scenario, 
from the second thread onwards, do we expect up to (0 + 4 + 1)mb, (4 + 4 + 1)mb 
or (4 + 4 + 4)mb of memory usage? Maybe even the ideal scenario of (0 + 0 + 1)?

With regards to patterns to efficiently use memory, is it sufficient to keep 
the parent environment as compact as possible? Are there clever ways to use 
local() for this?

Kind regards,
Shian

On 6 Jan 2019, at 9:24 am, Martin Morgan 
mailto:mtmorgan.b...@gmail.com>> wrote:

In one R session I did library(SummarizedExperiment) and then saved search(). 
In another R session I loaded the packages on the search path in reverse order, 
recording pryr::mem_used() after each. I ended up with

 mem_used
methods   25870312
datasets  30062016
utils 30062136
grDevices 30062256
graphics  30062376
stats 30062496
stats432262992
parallel  32495080
BiocGenerics  38903928
S4Vectors 59586928
IRanges  100171896
GenomeInfoDb 113791328
GenomicRanges154729400
Biobase  163335336
matrixStats  163518520
BiocParallel 167373512
DelayedArray 280812736
SummarizedExperiment 317386656

Each of the Bioconductor dependencies of SummarizedExperiment contribute to the 
overall size. Two dependencies (Biobase, DelayedArray) look a little 
unnecessary to me (they do not provide functionality that must be used by 
SummarizedExperiment) but removing them only reduces the total footprint to 
about 300MB. Somehow it makes sense that a package like SummarizedExperiment 
uses the data structures defined in other packages, and that it has a complex 
dependency graph. It is surprising how large the final footprint is.

One possible way to avoid at least some of the cost is to Import: 
SummarizedExperiment in the DESCRIPTION file, but not mention 
SummarizedExperiment in the NAMESPACE. Use SummarizedExperiment::assay() in the 
code. I think this has complicated side effects, e.g., adding methods to the 
imported methods table in your package (look for ".__T__" and ".__C__" (generic 
and class definitions) in ls(parent.env(getNamespace(, that 
indirectly increase the size of your package.

I'm not exactly sure what you mean in your second paragraph, maybe a specific 
example (if necessary, create a small package on github) would help. It sounds 
like you're saying that even with doSNOW() there are additional costs to 
loading your package on the worker compared to in the master...

Martin

On 1/5/19, 2:44 PM, "Lulu Chen" mailto:luluc...@vt.edu>> 
wrote:

   Hi Martin,


   Thanks for your explanation which make me understand BiocParallel much 
better.


   I compare memory usage in my code before packaged (using doSNOW) and after 
packaged (using BiocParallel) and find the increased memory is caused by the 
attached packages, especially 'SummarizedExperiment'.
   As required to support common Bioconductor class, I used 
importFrom(SummarizedExperiment,assay). After deleting this, the memory for 
each thread save nearly 200Mb. I open a new R session and find
pryr::mem_used()
   38.5 MB
library(SummarizedExperiment)

pryr::mem_used()
   314 MB

(I am still using R 3.5.2, not sure any update in develop version). I think 
it should be a issue. A lot of packages are importing SummarizedExperiment just 
for a support and never know it can cause such a problem.


   My package still imports other packages, e.g limma, fdrtool. Checked by 
pryr::mem_used() as above, only 1~2 Mb increase for each. I also check 
my_package in a new session, which is around 5Mb. However,  each thread in 
parallel computation still increases
much larger 

Re: [Bioc-devel] Memory usage for bplapply

2019-01-05 Thread Martin Morgan
In one R session I did library(SummarizedExperiment) and then saved search(). 
In another R session I loaded the packages on the search path in reverse order, 
recording pryr::mem_used() after each. I ended up with

  mem_used
methods   25870312
datasets  30062016
utils 30062136
grDevices 30062256
graphics  30062376
stats 30062496
stats432262992
parallel  32495080
BiocGenerics  38903928
S4Vectors 59586928
IRanges  100171896
GenomeInfoDb 113791328
GenomicRanges154729400
Biobase  163335336
matrixStats  163518520
BiocParallel 167373512
DelayedArray 280812736
SummarizedExperiment 317386656

Each of the Bioconductor dependencies of SummarizedExperiment contribute to the 
overall size. Two dependencies (Biobase, DelayedArray) look a little 
unnecessary to me (they do not provide functionality that must be used by 
SummarizedExperiment) but removing them only reduces the total footprint to 
about 300MB. Somehow it makes sense that a package like SummarizedExperiment 
uses the data structures defined in other packages, and that it has a complex 
dependency graph. It is surprising how large the final footprint is.

One possible way to avoid at least some of the cost is to Import: 
SummarizedExperiment in the DESCRIPTION file, but not mention 
SummarizedExperiment in the NAMESPACE. Use SummarizedExperiment::assay() in the 
code. I think this has complicated side effects, e.g., adding methods to the 
imported methods table in your package (look for ".__T__" and ".__C__" (generic 
and class definitions) in ls(parent.env(getNamespace(, that 
indirectly increase the size of your package.

I'm not exactly sure what you mean in your second paragraph, maybe a specific 
example (if necessary, create a small package on github) would help. It sounds 
like you're saying that even with doSNOW() there are additional costs to 
loading your package on the worker compared to in the master...

Martin

On 1/5/19, 2:44 PM, "Lulu Chen"  wrote:

Hi Martin,


Thanks for your explanation which make me understand BiocParallel much 
better. 


I compare memory usage in my code before packaged (using doSNOW) and after 
packaged (using BiocParallel) and find the increased memory is caused by the 
attached packages, especially 'SummarizedExperiment'. 
As required to support common Bioconductor class, I used 
importFrom(SummarizedExperiment,assay). After deleting this, the memory for 
each thread save nearly 200Mb. I open a new R session and find
> pryr::mem_used()
38.5 MB
> library(SummarizedExperiment)

> pryr::mem_used()
314 MB

 (I am still using R 3.5.2, not sure any update in develop version). I 
think it should be a issue. A lot of packages are importing 
SummarizedExperiment just for a support and never know it can cause such a 
problem.


My package still imports other packages, e.g limma, fdrtool. Checked by 
pryr::mem_used() as above, only 1~2 Mb increase for each. I also check 
my_package in a new session, which is around 5Mb. However,  each thread in 
parallel computation still increases
 much larger than 5 Mb. I did a simulation: In my old code with doSNOW, I 
just inserted "require('my_package')" into foreach loop and keep other code as 
the same. I used 20 cores and 1000 jobs. Each thread still increases 20~30 Mb. 
I don't know if there are
 any other thing that cause extra cost to each thread. Thanks!


Best,
Lulu






On Fri, Jan 4, 2019 at 2:38 PM Martin Morgan  
wrote:


Memory use can be complicated to understand.

library(BiocParallel)

v <- replicate(100, rnorm(1), simplify=FALSE)
bplapply(v, sum)

by default, bplapply splits 100 jobs (each element of the list) equally 
between the number of cores available, and sends just the necessary data to the 
cores. Again by default, the jobs are sent 'en masse' to the cores, so if there 
were 10 cores (and hence
 10 tasks), the first core would receive the first 10 jobs and 10 x 1 
elements, and so on. The memory used to store v on the workers would be 
approximately the size of v, # of workers * jobs /per worker  * job size = 10 * 
10 * 1.

If memory were particularly tight, or if computation time for each job was 
highly variable, it might be advantageous to sends jobs one at a time, by 
setting the number of tasks equal to the number of jobs SnowParam(workers = 10, 
tasks = length(v)). Then the
 amount of memory used to store v would only be # of workers * 1  * 1; 
this is generally slower, because there is much more communication between the 
manager and the workers.

m <- matrix(rnorm(100 * 1), 100, 1)

Re: [Bioc-devel] Memory usage for bplapply

2019-01-05 Thread Lulu Chen
Hi Martin,

Thanks for your explanation which make me understand BiocParallel
much better.

I compare memory usage in my code before packaged (using doSNOW) and after
packaged (using BiocParallel) and find the increased memory is caused by
the attached packages, especially 'SummarizedExperiment'.
As required to support common Bioconductor class, I used
importFrom(SummarizedExperiment,assay). After deleting this, the memory for
each thread save nearly 200Mb. I open a new R session and find
> pryr::mem_used()
38.5 MB
> library(SummarizedExperiment)
> pryr::mem_used()
314 MB
 (I am still using R 3.5.2, not sure any update in develop version). I
think it should be a issue. A lot of packages are importing
SummarizedExperiment just for a support and never know it can cause such a
problem.

My package still imports other packages, e.g limma, fdrtool. Checked by
pryr::mem_used() as above, only 1~2 Mb increase for each. I also check
my_package in a new session, which is around 5Mb. However,  each thread in
parallel computation still increases much larger than 5 Mb. I did a
simulation: In my old code with doSNOW, I just inserted
"require('my_package')" into foreach loop and keep other code as the same.
I used 20 cores and 1000 jobs. Each thread still increases 20~30 Mb. I
don't know if there are any other thing that cause extra cost to each
thread. Thanks!

Best,
Lulu

On Fri, Jan 4, 2019 at 2:38 PM Martin Morgan 
wrote:

> Memory use can be complicated to understand.
>
> library(BiocParallel)
>
> v <- replicate(100, rnorm(1), simplify=FALSE)
> bplapply(v, sum)
>
> by default, bplapply splits 100 jobs (each element of the list) equally
> between the number of cores available, and sends just the necessary data to
> the cores. Again by default, the jobs are sent 'en masse' to the cores, so
> if there were 10 cores (and hence 10 tasks), the first core would receive
> the first 10 jobs and 10 x 1 elements, and so on. The memory used to
> store v on the workers would be approximately the size of v, # of workers *
> jobs /per worker  * job size = 10 * 10 * 1.
>
> If memory were particularly tight, or if computation time for each job was
> highly variable, it might be advantageous to sends jobs one at a time, by
> setting the number of tasks equal to the number of jobs SnowParam(workers =
> 10, tasks = length(v)). Then the amount of memory used to store v would
> only be # of workers * 1  * 1; this is generally slower, because there
> is much more communication between the manager and the workers.
>
> m <- matrix(rnorm(100 * 1), 100, 1)
> bplapply(seq_len(nrow(m)), function(i, m) sum(m[i]), m)
>
> Here bplapply doesn't know how to send just some rows to the workers, so
> each worker gets a complete copy of m. This would be expensive.
>
> f <- function(x) sum(x)
>
> g <- function() {
> v <- replicate(100, rnorm(1), simplify=FALSE)
> bplapply(v, f)
> }
>
> this has the same memory consequences as above, the function `f()` is
> defined in the .GlobalEnv, so only the function definition (small) is sent
> to the workers.
>
> h <- function() {
> f <- function(x) sum(x)
> v <- replicate(100, rnorm(1), simplify=FALSE)
> bplapply(v, f)
> }
>
>  This is expensive. The function `f()` is defined in the body of the
> function `h()`. So the workers receive both the function f and the
> environment in which it defined. The environment includes v, so each worker
> receives a slice of v (for f() to operate on) AND an entire copy of v
> (because it is in the body of the environment where `f()` was defined. A
> similar cost would be paid in a package, if the package defined large data
> objects at load time.
>
> For more guidance, it might be helpful to provide a simplified example of
> what you did with doSNOW, and what you do with BiocParallel.
>
> Hope that helps,
>
> Martin
>
> On 1/3/19, 11:52 PM, "Bioc-devel on behalf of Lulu Chen" <
> bioc-devel-boun...@r-project.org on behalf of luluc...@vt.edu> wrote:
>
> Dear all,
>
> I met a memory issue for bplapply with SnowParam(). I need to calculate
> something from a large matrix many many times. But from the
> discussions in
> https://support.bioconductor.org/p/92587, I learned that bplapply
> copied
> the current and parent environment to each worker thread. Then means
> the
> large matrix in my package will be copied so many times. Do you have
> better
> suggestions in windows platform?
>
> Before I tried to package my code, I used doSNOW package with foreach
> %dopar%. It seems to consume less memory in each core (almost the size
> of
> the matrix the task needs). But bplapply seems to copy more then
> objects in
> current environment and the above one level environment. I am very
> confused.and just guess it was copying everything.
>
> Thanks for any help!
> Best,
> Lulu
>
> [[alternative HTML version 

Re: [Bioc-devel] Memory usage for bplapply

2019-01-04 Thread Martin Morgan
Memory use can be complicated to understand.

library(BiocParallel)

v <- replicate(100, rnorm(1), simplify=FALSE)
bplapply(v, sum)

by default, bplapply splits 100 jobs (each element of the list) equally between 
the number of cores available, and sends just the necessary data to the cores. 
Again by default, the jobs are sent 'en masse' to the cores, so if there were 
10 cores (and hence 10 tasks), the first core would receive the first 10 jobs 
and 10 x 1 elements, and so on. The memory used to store v on the workers 
would be approximately the size of v, # of workers * jobs /per worker  * job 
size = 10 * 10 * 1.

If memory were particularly tight, or if computation time for each job was 
highly variable, it might be advantageous to sends jobs one at a time, by 
setting the number of tasks equal to the number of jobs SnowParam(workers = 10, 
tasks = length(v)). Then the amount of memory used to store v would only be # 
of workers * 1  * 1; this is generally slower, because there is much more 
communication between the manager and the workers.

m <- matrix(rnorm(100 * 1), 100, 1)
bplapply(seq_len(nrow(m)), function(i, m) sum(m[i]), m)

Here bplapply doesn't know how to send just some rows to the workers, so each 
worker gets a complete copy of m. This would be expensive.

f <- function(x) sum(x)

g <- function() {
v <- replicate(100, rnorm(1), simplify=FALSE)
bplapply(v, f)
}

this has the same memory consequences as above, the function `f()` is defined 
in the .GlobalEnv, so only the function definition (small) is sent to the 
workers.

h <- function() {
f <- function(x) sum(x)
v <- replicate(100, rnorm(1), simplify=FALSE)
bplapply(v, f)
}

 This is expensive. The function `f()` is defined in the body of the function 
`h()`. So the workers receive both the function f and the environment in which 
it defined. The environment includes v, so each worker receives a slice of v 
(for f() to operate on) AND an entire copy of v (because it is in the body of 
the environment where `f()` was defined. A similar cost would be paid in a 
package, if the package defined large data objects at load time.

For more guidance, it might be helpful to provide a simplified example of what 
you did with doSNOW, and what you do with BiocParallel.

Hope that helps,

Martin

On 1/3/19, 11:52 PM, "Bioc-devel on behalf of Lulu Chen" 
 wrote:

Dear all,

I met a memory issue for bplapply with SnowParam(). I need to calculate
something from a large matrix many many times. But from the discussions in
https://support.bioconductor.org/p/92587, I learned that bplapply copied
the current and parent environment to each worker thread. Then means the
large matrix in my package will be copied so many times. Do you have better
suggestions in windows platform?

Before I tried to package my code, I used doSNOW package with foreach
%dopar%. It seems to consume less memory in each core (almost the size of
the matrix the task needs). But bplapply seems to copy more then objects in
current environment and the above one level environment. I am very
confused.and just guess it was copying everything.

Thanks for any help!
Best,
Lulu

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel