[Rd] R 3.0.0 memory use

2013-04-14 Thread Tim Hesterberg
I did some benchmarking of data frame code, and
it appears that R 3.0.0 is far worse than earlier versions of R
in terms of how many large objects it allocates space for,
for data frame operations - creation, subscripting, subscript replacement.
For a data frame with n rows, it makes either 2 or 4 extra copies of
all of:
8n bytes (e.g. double precision)
24n bytes
32n bytes
E.g., for as.data.frame(numeric vector), instead of allocations
totalling ~8n bytes, it allocates 33 times that much.

Here, compare columns 3 and 5
(columns 2 and 4 are with the dataframe package).

# Summary
#   R-2.14.2R-2.15.3R-3.0.0
#   w/o withw/o withw/o
#   as.data.frame(y)3   1   1   1   5;4;4
#   data.frame(y)   7   3   4   2   6;2;2
#   data.frame(y, z)7 each  3 each  4   2   8;4;4
#   as.data.frame(l)8   3   5   2   9;4;4
#   data.frame(l)   13  5   8   3   12;4;4
#   d$z - z3,2 1,1 3,1 2,1 7;4;4,1
#   d[[z]] - z   4,3 1,1 3,1 2,1 7;4;4,1
#   d[, z] - z   6,4,2   2,2,1   4,2,2   3,2,1   8;4;4,2,2
#   d[z] - z 6,5,2   2,2,1   4,2,2   3,2,1   8;4;4,2,2
#   d[z] - list(z=z) 6,3,2   2,2,1   4,2,2   3,2,1   8;4;4,2,2
#   d[z] - Z #list(z=z)  6,2,2   2,1,1   4,1,2   3,1,1   8;4;4,1,2
#   a - d[y] 2   1   2   1   6;4;4
#   a - d[, y, drop=F]   2   1   2   1   6;4;4

# Where two numbers are given, they refer to:
#   (copies of the old data frame),
#   (copies of the new column)
# A third number refers to numbers of
#   (copies made of an integer vector of row names)

# For R 3.0.0, I'm getting astounding results - many more copies,
# and also some copies of larger objects; in addition to the data
# vectors of size 80K and 160K, also 240K and 320K.
# Where three numbers are given in form a;c;d, they refer to
#   (copies of 80K; 240K; 320K)

The benchmarks are at
http://www.timhesterberg.net/r-packages/memory.R

I'm using versions of R I installed from source on a Linux box, using e.g.
./configure --prefix=(my path) --enable-memory-profiling --with-readline=no
make
make install

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] R 3.0.0 memory use

2013-04-14 Thread luke-tierney

There were a couple of bug fixes to somewhat obscure compound
assignment related bugs that required bumping up internal reference
counts. It's possible that one or more of these are responsible. If so
it is unavoidable for now, but it's worth finding out for sure. With
some stripped down test examples it should be possible to identify
when things changed. I won't have time to look for some time, but if
someone else wanted to nail this down that would be useful.

Best,

luke

On Sun, 14 Apr 2013, Tim Hesterberg wrote:


I did some benchmarking of data frame code, and
it appears that R 3.0.0 is far worse than earlier versions of R
in terms of how many large objects it allocates space for,
for data frame operations - creation, subscripting, subscript replacement.
For a data frame with n rows, it makes either 2 or 4 extra copies of
all of:
   8n bytes (e.g. double precision)
   24n bytes
   32n bytes
E.g., for as.data.frame(numeric vector), instead of allocations
totalling ~8n bytes, it allocates 33 times that much.

Here, compare columns 3 and 5
(columns 2 and 4 are with the dataframe package).

# Summary
#   R-2.14.2R-2.15.3R-3.0.0
#   w/o withw/o withw/o
#   as.data.frame(y)3   1   1   1   5;4;4
#   data.frame(y)   7   3   4   2   6;2;2
#   data.frame(y, z)7 each  3 each  4   2   8;4;4
#   as.data.frame(l)8   3   5   2   9;4;4
#   data.frame(l)   13  5   8   3   12;4;4
#   d$z - z3,2 1,1 3,1 2,1 7;4;4,1
#   d[[z]] - z   4,3 1,1 3,1 2,1 7;4;4,1
#   d[, z] - z   6,4,2   2,2,1   4,2,2   3,2,1   8;4;4,2,2
#   d[z] - z 6,5,2   2,2,1   4,2,2   3,2,1   8;4;4,2,2
#   d[z] - list(z=z) 6,3,2   2,2,1   4,2,2   3,2,1   8;4;4,2,2
#   d[z] - Z #list(z=z)  6,2,2   2,1,1   4,1,2   3,1,1   8;4;4,1,2
#   a - d[y] 2   1   2   1   6;4;4
#   a - d[, y, drop=F]   2   1   2   1   6;4;4

# Where two numbers are given, they refer to:
#   (copies of the old data frame),
#   (copies of the new column)
# A third number refers to numbers of
#   (copies made of an integer vector of row names)

# For R 3.0.0, I'm getting astounding results - many more copies,
# and also some copies of larger objects; in addition to the data
# vectors of size 80K and 160K, also 240K and 320K.
# Where three numbers are given in form a;c;d, they refer to
#   (copies of 80K; 240K; 320K)

The benchmarks are at
http://www.timhesterberg.net/r-packages/memory.R

I'm using versions of R I installed from source on a Linux box, using e.g.
./configure --prefix=(my path) --enable-memory-profiling --with-readline=no
make
make install

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



--
Luke Tierney
Chair, Statistics and Actuarial Science
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa  Phone: 319-335-3386
Department of Statistics andFax:   319-335-3017
   Actuarial Science
241 Schaeffer Hall  email:   luke-tier...@uiowa.edu
Iowa City, IA 52242 WWW:  http://www.stat.uiowa.edu

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] R 3.0.0 memory use

2013-04-14 Thread Martin Morgan

On 04/14/2013 07:11 PM, luke-tier...@uiowa.edu wrote:

There were a couple of bug fixes to somewhat obscure compound
assignment related bugs that required bumping up internal reference
counts. It's possible that one or more of these are responsible. If so
it is unavoidable for now, but it's worth finding out for sure. With
some stripped down test examples it should be possible to identify
when things changed. I won't have time to look for some time, but if
someone else wanted to nail this down that would be useful.


I can't quite tell from Tim's script what he's documenting. In R-2.15.3 I have

 Rprofmem(); Rprofmem(NULL); readLines(Rprofmem.out, warn=FALSE)
character(0)

(or sometimes [1] new page:new page:\Rprofmem\ )

whereas in R-3.0.0

 Rprofmem(); Rprofmem(NULL); readLines(Rprofmem.out, warn=FALSE)
[1] 320040 :80040 :240048 :320040 :80040 :240048 :

I think these are the allocations Tim is seeing. They're from the parser (see 
below) rather than as.data.frame. For Tim's example


  y - 1:10^4 + 0.0
  Rprofmem(); d - as.data.frame(y); Rprofmem(NULL); readLines(Rprofmem.out)

[1] 320040 :80040 :240048 :320040 :80040 :240048 :80040 
:\as.data.frame.numeric\ \as.data.frame\ 

[2] 320040 :80040 :240048 :320040 :80040 :240048 :

only the allocation 80040 is from as.data.frame (from the call stack output).

Under R -d gdb

  (gdb) b R_OutputStackTrace
  (gdb) r
   Rprofmem(); Rprofmem(NULL)

  Breakpoint 1, R_OutputStackTrace (file=0xbd43f0) at 
/home/mtmorgan/src/R-3-0-branch/src/main/memory.c:3434

  3434  {
  (gdb) bt
  #0  R_OutputStackTrace (file=0xbd43f0) at 
/home/mtmorgan/src/R-3-0-branch/src/main/memory.c:3434
  #1  0x7792ff83 in R_ReportAllocation (size=320040) at 
/home/mtmorgan/src/R-3-0-branch/src/main/memory.c:3456
  #2  Rf_allocVector (type=13, length=8) at 
/home/mtmorgan/src/R-3-0-branch/src/main/memory.c:2478

  #3  0x7790bedf in growData () at gram.y:3391

and the memory allocations are from these lines in the parser gram.y

PROTECT( bigger = allocVector( INTSXP, data_size * DATA_ROWS ) ) ;
PROTECT( biggertext = allocVector( STRSXP, data_size ) );

I'm not sure why these show up under R 3.0.0, though.

$ R-2-15-branch/bin/R --version
R version 2.15.3 Patched (2013-03-13 r62579) -- Security Blanket
Copyright (C) 2013 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)

R-3-0-branch$ bin/R --version
R version 3.0.0 Patched (2013-04-14 r62579) -- Masked Marvel
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-unknown-linux-gnu (64-bit)

Martin





Best,

luke

On Sun, 14 Apr 2013, Tim Hesterberg wrote:


I did some benchmarking of data frame code, and
it appears that R 3.0.0 is far worse than earlier versions of R
in terms of how many large objects it allocates space for,
for data frame operations - creation, subscripting, subscript replacement.
For a data frame with n rows, it makes either 2 or 4 extra copies of
all of:
   8n bytes (e.g. double precision)
   24n bytes
   32n bytes
E.g., for as.data.frame(numeric vector), instead of allocations
totalling ~8n bytes, it allocates 33 times that much.

Here, compare columns 3 and 5
(columns 2 and 4 are with the dataframe package).

# Summary
#   R-2.14.2R-2.15.3R-3.0.0
#   w/o withw/o withw/o
#   as.data.frame(y)3   1   1   1   5;4;4
#   data.frame(y)   7   3   4   2   6;2;2
#   data.frame(y, z)7 each  3 each  4   2   8;4;4
#   as.data.frame(l)8   3   5   2   9;4;4
#   data.frame(l)   13  5   8   3   12;4;4
#   d$z - z3,2 1,1 3,1 2,1 7;4;4,1
#   d[[z]] - z   4,3 1,1 3,1 2,1 7;4;4,1
#   d[, z] - z   6,4,2   2,2,1   4,2,2   3,2,1   8;4;4,2,2
#   d[z] - z 6,5,2   2,2,1   4,2,2   3,2,1   8;4;4,2,2
#   d[z] - list(z=z) 6,3,2   2,2,1   4,2,2   3,2,1   8;4;4,2,2
#   d[z] - Z #list(z=z)  6,2,2   2,1,1   4,1,2   3,1,1   8;4;4,1,2
#   a - d[y] 2   1   2   1   6;4;4
#   a - d[, y, drop=F]   2   1   2   1   6;4;4

# Where two numbers are given, they refer to:
#   (copies of the old data frame),
#   (copies of the new column)
# A third number refers to numbers of
#   (copies made of an integer vector of row names)

# For R 3.0.0, I'm getting astounding results - many more copies,
# and also some copies of larger objects; in addition to the data
# vectors of size 80K and 160K, also 240K and 320K.
# Where three numbers are given in form a;c;d, they refer to
#   (copies of 80K; 240K; 320K)

The benchmarks are at
http://www.timhesterberg.net/r-packages/memory.R

I'm using versions of R I installed from source on a Linux box, using e.g.
./configure --prefix=(my 

Re: [Rd] R 3.0.0 memory use

2013-04-14 Thread Tim Hesterberg
When I change the data set size, the extra allocations do
not change in size. This supports Luke and Martin's diagnosis.

The extra allocations are either 2 or 4 allocations each of size
 80040
240048
320040

Details (you may skip):

(Fresh session of R 3.0.0)
 y - 1:10^4 + 0.0
 Rprofmem(temp.out, threshold = 10^4)
 d - as.data.frame(y)
 Rprofmem(NULL); system(cat temp.out)
320040 :80040 :240048 :320040 :80040 :240048 :80040 :as.data.frame.numeric 
as.data.frame 
320040 :80040 :240048 :320040 :80040 :240048 : 
 # Try increasing size by a factor of 10
 y - 1:10^5 + 0.0
 Rprofmem(temp.out, threshold = 10^4)
 d - as.data.frame(y)
 Rprofmem(NULL); system(cat temp.out)
320040 :80040 :240048 :320040 :80040 :240048 :800040 :as.data.frame.numeric 
as.data.frame 
320040 :80040 :240048 :320040 :80040 :240048 : 

The number of allocations shown, of different sizes:

3.0.0   3.0.0   2.15.3  2.15.3
first   second  first   second
240048  4   4   0   0
320040  4   4   0   0
 80040  5   4   1   0
800040  0   1   0   1

So it looks like both R 2.15.3 and R 3.0.0 are making
one copy of the data, plus extra allocations.

(Fresh session of R 2.15.3)
 y - 1:10^4 + 0.0
 Rprofmem(temp.out, threshold = 10^4)
 d - as.data.frame(y)
 Rprofmem(NULL); system(cat temp.out)
80040 :as.data.frame.numeric as.data.frame 
 # Increase size by factor of 10
 y - 1:10^5 + 0.0
 Rprofmem(temp.out, threshold = 10^4)
 d - as.data.frame(y)
 Rprofmem(NULL); system(cat temp.out)
800040 :as.data.frame.numeric as.data.frame 




On Sun, 14 Apr 2013 19:15:45 -0700 Martin Morgan mtmor...@fhcrc.org wrote:
On 04/14/2013 07:11 PM, luke-tier...@uiowa.edu wrote:
 There were a couple of bug fixes to somewhat obscure compound
 assignment related bugs that required bumping up internal reference
 counts. It's possible that one or more of these are responsible. If so
 it is unavoidable for now, but it's worth finding out for sure. With
 some stripped down test examples it should be possible to identify
 when things changed. I won't have time to look for some time, but if
 someone else wanted to nail this down that would be useful.

I can't quite tell from Tim's script what he's documenting. In R-2.15.3 I have

  Rprofmem(); Rprofmem(NULL); readLines(Rprofmem.out, warn=FALSE)
character(0)

(or sometimes [1] new page:new page:\Rprofmem\ )

whereas in R-3.0.0

  Rprofmem(); Rprofmem(NULL); readLines(Rprofmem.out, warn=FALSE)
[1] 320040 :80040 :240048 :320040 :80040 :240048 :

I think these are the allocations Tim is seeing. They're from the parser (see
below) rather than as.data.frame. For Tim's example

   y - 1:10^4 + 0.0
   Rprofmem(); d - as.data.frame(y); Rprofmem(NULL); readLines(Rprofmem.out)

[1] 320040 :80040 :240048 :320040 :80040 :240048 :80040
:\as.data.frame.numeric\ \as.data.frame\ 
[2] 320040 :80040 :240048 :320040 :80040 :240048 :

only the allocation 80040 is from as.data.frame (from the call stack output).

Under R -d gdb

   (gdb) b R_OutputStackTrace
   (gdb) r
Rprofmem(); Rprofmem(NULL)

   Breakpoint 1, R_OutputStackTrace (file=0xbd43f0) at
/home/mtmorgan/src/R-3-0-branch/src/main/memory.c:3434
   3434{
   (gdb) bt
   #0  R_OutputStackTrace (file=0xbd43f0) at
/home/mtmorgan/src/R-3-0-branch/src/main/memory.c:3434
   #1  0x7792ff83 in R_ReportAllocation (size=320040) at
/home/mtmorgan/src/R-3-0-branch/src/main/memory.c:3456
   #2  Rf_allocVector (type=13, length=8) at
/home/mtmorgan/src/R-3-0-branch/src/main/memory.c:2478
   #3  0x7790bedf in growData () at gram.y:3391

and the memory allocations are from these lines in the parser gram.y

   PROTECT( bigger = allocVector( INTSXP, data_size * DATA_ROWS ) ) ;
   PROTECT( biggertext = allocVector( STRSXP, data_size ) );

I'm not sure why these show up under R 3.0.0, though.

$ R-2-15-branch/bin/R --version
R version 2.15.3 Patched (2013-03-13 r62579) -- Security Blanket
Copyright (C) 2013 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)

R-3-0-branch$ bin/R --version
R version 3.0.0 Patched (2013-04-14 r62579) -- Masked Marvel
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-unknown-linux-gnu (64-bit)

Martin




 Best,

 luke

 On Sun, 14 Apr 2013, Tim Hesterberg wrote:

 I did some benchmarking of data frame code, and
 it appears that R 3.0.0 is far worse than earlier versions of R
 in terms of how many large objects it allocates space for,
 for data frame operations - creation, subscripting, subscript replacement.
 For a data frame with n rows, it makes either 2 or 4 extra copies of
 all of:
8n bytes (e.g. double precision)
24n bytes
32n bytes
 E.g., for as.data.frame(numeric vector), instead of allocations
 totalling ~8n bytes, it allocates 33 times that much.

 Here, compare columns 3 and 5
 (columns 2 and 4 are with the dataframe package).

 # Summary
 #