Re: [Rd] serialize() to via temporary file is heaps faster than doing it directly (on Windows)

2008-08-29 Thread Henrik Bengtsson
I just want to re-post this thread in case it slipped through the
"summer sieve" of someone that might be interested and/or has a real
solution beyond my serialize2() patch.

Cheers

Henrik

On Thu, Jul 24, 2008 at 8:10 PM, Henrik Bengtsson <[EMAIL PROTECTED]> wrote:
> Hi,
>
> FYI, I just notice that on Windows (but not Linux) it is orders of
> magnitude (below it's 50x) faster to serialize() and object to a
> temporary file and then read it back, than to serialize to an object
> directly.  This has for instance impact on how fast digest::digest()
> can provide a checksum.
>
> Example:
> x <- 1:1e7;
> t1 <- system.time(raw1 <- serialize(x, connection=NULL));
> print(t1);
> #user  system elapsed
> #   174.23  129.35  304.70  ## 5 minutes
> t2 <- system.time(raw2 <- serialize2(x, connection=NULL));
> print(t2);
> # user  system elapsed
> # 2.190.185.72  ## 5 seconds
> print(t1/t2);
> #  usersystem   elapsed
> #   79.55708 718.6  53.26923
> stopifnot(identical(raw1, raw2));
>
> where serialize2() is serialize():ing to file and reading the results back:
>
> serialize2 <- function(object, connection, ...) {
>  if (is.null(connection)) {
># It is faster to serialize to a temporary file and read it back
>pathname <- tempfile();
>con <- file(pathname, open="wb");
>on.exit({
>  if (!is.null(con))
>close(con);
>  if (file.exists(pathname))
>file.remove(pathname);
>});
>base::serialize(object, connection=con, ...);
>close(con);
>con <- NULL;
>fileSize <- file.info(pathname)$size;
>readBin(pathname, what="raw", n=fileSize);
>  } else {
>base::serialize(object, connection=connection, ...);
>  }
> } # serialize2()
>
> The above benchmarking was done in a fresh R v2.7.1 session on WinXP Pro:
>
>> sessionInfo()
> R version 2.7.1 Patched (2008-06-27 r46012)
> i386-pc-mingw32
>
> locale:
> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United 
> States.1252;LC_MON
> ETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United 
> States.1252
>
> attached base packages:
> [1] stats graphics  grDevices utils datasets  methods   base
>
>
> When I do the same on a Linux machine there is no difference:
>
>> sessionInfo()
> R version 2.7.1 (2008-06-23)
> x86_64-unknown-linux-gnu
>
> locale:
> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics  grDevices utils datasets  methods   base
>
> Is there an obvious reason (and an obvious fix) for this?
>
> Cheers
>
> Henrik
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] serialize() to via temporary file is heaps faster than doing it directly (on Windows)

2008-07-24 Thread Henrik Bengtsson
Hi,

FYI, I just notice that on Windows (but not Linux) it is orders of
magnitude (below it's 50x) faster to serialize() and object to a
temporary file and then read it back, than to serialize to an object
directly.  This has for instance impact on how fast digest::digest()
can provide a checksum.

Example:
x <- 1:1e7;
t1 <- system.time(raw1 <- serialize(x, connection=NULL));
print(t1);
#user  system elapsed
#   174.23  129.35  304.70  ## 5 minutes
t2 <- system.time(raw2 <- serialize2(x, connection=NULL));
print(t2);
# user  system elapsed
# 2.190.185.72  ## 5 seconds
print(t1/t2);
#  usersystem   elapsed
#   79.55708 718.6  53.26923
stopifnot(identical(raw1, raw2));

where serialize2() is serialize():ing to file and reading the results back:

serialize2 <- function(object, connection, ...) {
  if (is.null(connection)) {
# It is faster to serialize to a temporary file and read it back
pathname <- tempfile();
con <- file(pathname, open="wb");
on.exit({
  if (!is.null(con))
close(con);
  if (file.exists(pathname))
file.remove(pathname);
});
base::serialize(object, connection=con, ...);
close(con);
con <- NULL;
fileSize <- file.info(pathname)$size;
readBin(pathname, what="raw", n=fileSize);
  } else {
base::serialize(object, connection=connection, ...);
  }
} # serialize2()

The above benchmarking was done in a fresh R v2.7.1 session on WinXP Pro:

> sessionInfo()
R version 2.7.1 Patched (2008-06-27 r46012)
i386-pc-mingw32

locale:
LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MON
ETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base


When I do the same on a Linux machine there is no difference:

> sessionInfo()
R version 2.7.1 (2008-06-23)
x86_64-unknown-linux-gnu

locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

Is there an obvious reason (and an obvious fix) for this?

Cheers

Henrik

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel