Re: [Bioc-devel] GPos slower than GRanges ?

2018-02-09 Thread Hervé Pagès

Hi Charles,

On 02/08/2018 08:03 PM, Charles Plessy wrote:

Hello,

I have just discovered the GPos class, and I would like to use it in
my "CAGEr" package, where for the moment I store single-nucleotide
positions of transcription start sites in GRanges of width 1.

But a simple microbenchmark sugests that, although GPos are more
memory-efficient, they also may be more CPU-hungry, at least
with the "range" function.

Is there a way to optimise, or is it better to stay with
GRanges of width 1 when memory is not an issue ?


gpos1 <- GPos(c("chr1:44-53", "chr1:5-10", "chr2:2-5"))



granges1 <- GRanges(gpos1)



microbenchmark::microbenchmark(range(granges1), range(gpos1))

Unit: milliseconds
 expr  min   lqmean   median   uq  max neval cld
  range(granges1) 21.42761 21.97009 24.1627 22.24532 22.92655 179.9715   100  a
 range(gpos1) 30.11515 30.84472 32.8824 31.36639 32.19281 104.3027   100   b


Timing such small objects is not really meaningful.

GPos objects are optimized to perform well when they contain long runs
of consecutive positions. For example:

  gpos2 <- GPos(GRanges("chr1", successiveIRanges(rep(990, 2000), 
gapwidth=10)))

  gr2 <- as(gpos2, "GRanges")

  microbenchmark(range(gpos2), range(gr2))
  # Unit: milliseconds
  #  expr  min   lq mean   median   uq  max 
neval cld
  #  range(gpos2) 102.4948 111.9229 137.5418 116.0058 134.2129 239.0805 
  100  a
  #range(gr2) 111.3651 118.2075 154.2758 133.3702 211.2164 232.4975 
  100   b


  microbenchmark(coverage(gpos2), coverage(gr2))
  # Unit: milliseconds
  # expr   min   lq mean   median   uq 
max neval
  #  coverage(gpos2)  98.09502 106.3827 143.7039 111.9778 138.1875 
304.8126   100
  #coverage(gr2) 152.82492 168.9123 204.8362 175.1129 189.7343 
363.9795   100

 cld
  a
   b

so not a big difference but a small advantage for GPos.

However, a big advantage for GPos in terms of memory footprint:

  object.size(gpos2)
  # 26520 bytes
  object.size(gr2)
  # 15849120 bytes

Anyway, if memory is not an issue, then it won't make much difference
whether you use GRanges or GPos.

Cheers,
H.





sessionInfo()

R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.7.0
LAPACK: /usr/lib/lapack/liblapack.so.3.7.0

locale:
  [1] LC_CTYPE=en_GB.UTF-8   LC_NUMERIC=C   LC_TIME=en_GB.UTF-8
  [4] LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8
LC_MESSAGES=en_GB.UTF-8
  [7] LC_PAPER=en_GB.UTF-8   LC_NAME=C  LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4stats graphics  grDevices utils datasets  
methods   base

other attached packages:
[1] GenomicRanges_1.31.16 GenomeInfoDb_1.15.5   IRanges_2.13.22   
S4Vectors_0.17.30
[5] BiocGenerics_0.25.2

loaded via a namespace (and not attached):
  [1] Rcpp_0.12.14XVector_0.19.8  MASS_7.3-47 
splines_3.4.3
  [5] zlibbioc_1.24.0 munsell_0.4.3   lattice_0.20-35 
colorspace_1.3-2
  [9] rlang_0.1.4 multcomp_1.4-8  plyr_1.8.4  
tools_3.4.3
[13] grid_3.4.3  gtable_0.2.0TH.data_1.0-8   
survival_2.41-3
[17] yaml_2.1.15 lazyeval_0.2.1  tibble_1.3.4
Matrix_1.2-12
[21] GenomeInfoDbData_0.99.1 ggplot2_2.2.1   codetools_0.2-15
microbenchmark_1.4-2.1
[25] bitops_1.0-6RCurl_1.95-4.10 sandwich_2.4-0  
compiler_3.4.3
[29] scales_0.5.0mvtnorm_1.0-6   zoo_1.8-0

(I have also made a benchmark on "real" data, which confirmed the test above)

Have a nice day,

Charles



--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] GPos slower than GRanges ?

2018-02-08 Thread Charles Plessy
Hello,

I have just discovered the GPos class, and I would like to use it in
my "CAGEr" package, where for the moment I store single-nucleotide
positions of transcription start sites in GRanges of width 1.

But a simple microbenchmark sugests that, although GPos are more
memory-efficient, they also may be more CPU-hungry, at least
with the "range" function.

Is there a way to optimise, or is it better to stay with
GRanges of width 1 when memory is not an issue ?

> gpos1 <- GPos(c("chr1:44-53", "chr1:5-10", "chr2:2-5"))

> granges1 <- GRanges(gpos1)

> microbenchmark::microbenchmark(range(granges1), range(gpos1))
Unit: milliseconds
expr  min   lqmean   median   uq  max neval cld
 range(granges1) 21.42761 21.97009 24.1627 22.24532 22.92655 179.9715   100  a 
range(gpos1) 30.11515 30.84472 32.8824 31.36639 32.19281 104.3027   100   b

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.7.0
LAPACK: /usr/lib/lapack/liblapack.so.3.7.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8   LC_NUMERIC=C   LC_TIME=en_GB.UTF-8  
 
 [4] LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8
LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8   LC_NAME=C  LC_ADDRESS=C 
 
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C  
 

attached base packages:
[1] parallel  stats4stats graphics  grDevices utils datasets  
methods   base 

other attached packages:
[1] GenomicRanges_1.31.16 GenomeInfoDb_1.15.5   IRanges_2.13.22   
S4Vectors_0.17.30
[5] BiocGenerics_0.25.2  

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.14XVector_0.19.8  MASS_7.3-47 
splines_3.4.3  
 [5] zlibbioc_1.24.0 munsell_0.4.3   lattice_0.20-35 
colorspace_1.3-2   
 [9] rlang_0.1.4 multcomp_1.4-8  plyr_1.8.4  
tools_3.4.3
[13] grid_3.4.3  gtable_0.2.0TH.data_1.0-8   
survival_2.41-3
[17] yaml_2.1.15 lazyeval_0.2.1  tibble_1.3.4
Matrix_1.2-12  
[21] GenomeInfoDbData_0.99.1 ggplot2_2.2.1   codetools_0.2-15
microbenchmark_1.4-2.1 
[25] bitops_1.0-6RCurl_1.95-4.10 sandwich_2.4-0  
compiler_3.4.3 
[29] scales_0.5.0mvtnorm_1.0-6   zoo_1.8-0 

(I have also made a benchmark on "real" data, which confirmed the test above)

Have a nice day,

Charles

-- 
Charles Plessy, Ph.D. – RIKEN Center for Life Science Technologies
Division of Genomic Technologies – Genomics Miniaturization Technology Unit
1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045 Japan
■■□―― http://population-transcriptomics.org ――□■■

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel