Re: [Rd] [.data.frame speedup

2008-07-05 Thread Martin Maechler
 TH == Tim Hesterberg [EMAIL PROTECTED]
 on Thu, 3 Jul 2008 17:04:24 -0700 writes:

TH I made a couple of a changes from the previous version:
TH - don't use functions anyMissing or notSorted (which aren't in base R)
TH - don't check for dup.row.names attribute (need to modify other 
functions
TH before that is useful)
TH I have not tested this with a wide variety of inputs; I'm assuming that
TH you have some regression tests.

yes, we do (and they are part of the R source).

TH Here are the file differences.  Let me know if you'd like a different
TH format.


TH $ diff -c dataframe.R dataframe2.R
TH *** dataframe.RThu Jul  3 15:48:12 2008
TH --- dataframe2.RThu Jul  3 16:36:46 2008
...

context diff is fine (I typically use '-u' but that's not important).

From your patch,
I've currently ended in this patch :

--- dataframe.R.~19~2008-07-03 02:13:21.000163000 +0200
+++ dataframe.R 2008-07-05 13:02:33.29000 +0200
@@ -579,14 +579,18 @@
 ## row names might have NAs.
 if(is.null(rows)) rows - attr(xx, row.names)
 rows - rows[i]
-   if((ina - any(is.na(rows))) | (dup - any(duplicated(rows {
-   ## both will coerce integer 'rows' to character:
-   if (!dup  is.character(rows)) dup - NA %in% rows
-   if(ina)
-   rows[is.na(rows)] - NA
-   if(dup)
-   rows - make.unique(as.character(rows))
-   }
+
+   ## Do not want to check for duplicates if don't need to
+   noDuplicateRowNames -
+   (is.logical(i) || (li - length(i))  2 ||
+(is.numeric(i)  (min(0, i, na.rm=TRUE)  0 ||
+  (!any(is.na(i))  all(i[-li]  i[-1L])
+   ## TODO: is.unsorted(., strict=FALSE/TRUE)
+   if(any(is.na(rows)))
+   rows[is.na(rows)] - NA # coerces to integer
+   if(!noDuplicateRowNames  any(duplicated(rows)))
+   rows - make.unique(as.character(rows)) # coerces to integer
+
 ## new in 1.8.0  -- might have duplicate columns
 if(any(duplicated(nm - names(x names(x) - make.unique(nm)
 if(is.null(rows)) rows - attr(xx, row.names)[i]


TH Here's some code for testing, and timings

.

I've rationalized (wrote functions) and slightly extended your
tests, they are now public at
   ftp://ftp.stat.math.ethz.ch/U/maechler/R/data.frame-TH-ex.R

Unfortunately, they show that the speedup is negative in some
cases, e.g. for the 'i - 1:n' case for n - 1000 or 1.
I've replicated every system.time() 12 times, to get a sense of
the precision, and that's still the conclusion.

In other words, your proposed  'noDuplicateRowNames'
computations are sometimes more expensive than the duplicated(.)
call they replace.

To me, that means that the whole exercise was probbaly in vain:
We are not making the code more complicated if it's not a
uniform improvement.  Too bad.

Martin Maechler, ETH Zurich

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [.data.frame speedup

2008-07-03 Thread Martin Maechler
 TH == Tim Hesterberg [EMAIL PROTECTED]
 on Tue, 1 Jul 2008 15:23:53 -0700 writes:

TH There is a bug in the standard version of [.data.frame;
TH it mixes up handling duplicates and NAs when subscripting rows.

TH x - data.frame(x=1:3, y=2:4, row.names=c(a,b,NA))
TH y - x[c(2:3, NA),]
TH y

TH It creates a data frame with duplicate rows, but won't print.

and that's a bug, indeed
(introduced to R version 2.5.0, when the [.data.frame  code was much
optimized for speed, with quite some care), and I have commited
a fix (and a regression test) to both R-devel and R-patched.

Thanks a lot for the bug report, Tim!

Now about your newly proposed code:
I'm sorry to say that it looks so much different from the source
code in
  https://svn.r-project.org/R/trunk/src/library/base/R/dataframe.R
that I don't think we would accept it as a substitute, easily.

Could you try to provide a minimal patch against the source code
and also a selfcontained example that exhibits the speed gain
you are aiming for ?

Best regards,
Martin Maechler, ETH Zurich

[.]


TH On Tue, Jul 1, 2008 at 11:20 AM, Tim Hesterberg [EMAIL PROTECTED]
TH wrote:

 Below is a version of [.data.frame that is faster
 for subscripting rows of large data frames; it avoids calling
 duplicated(rows)
 if there is no need to check for duplicate row names, when:
 i is logical
 attr(x, dup.row.names) is not NULL (S+ compatibility)
 i is numeric and negative
 i is strictly increasing
 

TH [[alternative HTML version deleted]]

TH __
TH R-devel@r-project.org mailing list
TH https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [.data.frame speedup

2008-07-03 Thread Tim Hesterberg
I made a couple of a changes from the previous version:
 - don't use functions anyMissing or notSorted (which aren't in base R)
 - don't check for dup.row.names attribute (need to modify other functions
   before that is useful)
I have not tested this with a wide variety of inputs; I'm assuming that
you have some regression tests.

Here are the file differences.  Let me know if you'd like a different
format.

$ diff -c dataframe.R dataframe2.R
*** dataframe.RThu Jul  3 15:48:12 2008
--- dataframe2.RThu Jul  3 16:36:46 2008
***
*** 530,535 
--- 530,541 
  x - .Call(R_copyDFattr, xx, x, PACKAGE=base)
  oldClass(x) - attr(x, row.names) - NULL

+ # Do not want to check for duplicates if don't need to
+ noDuplicateRowNames - (is.logical(i) ||
+ length(i)  2 ||
+ (is.numeric(i)  min(i, 0, na.rm=TRUE)  0)
||
+ (!any(is.na(i))  all(i[-length(i)]i[-1])))
+
  if(!missing(j)) { # df[i, j]
  x - x[j]
  cols - names(x)  # needed for 'drop'
***
*** 579,592 
  ## row names might have NAs.
  if(is.null(rows)) rows - attr(xx, row.names)
  rows - rows[i]
! if((ina - any(is.na(rows))) | (dup - any(duplicated(rows {
! ## both will coerce integer 'rows' to character:
! if (!dup  is.character(rows)) dup - NA %in% rows
! if(ina)
! rows[is.na(rows)] - NA
! if(dup)
! rows - make.unique(as.character(rows))
! }
  ## new in 1.8.0  -- might have duplicate columns
  if(any(duplicated(nm - names(x names(x) - make.unique(nm)
  if(is.null(rows)) rows - attr(xx, row.names)[i]
--- 585,594 
  ## row names might have NAs.
  if(is.null(rows)) rows - attr(xx, row.names)
  rows - rows[i]
! if(any(is.na(rows)))
!   rows[is.na(rows)] - NA # coerces to integer
! if(!noDuplicateRowNames  any(duplicated(rows)))
!   rows - make.unique(as.character(rows)) # coerces to integer
  ## new in 1.8.0  -- might have duplicate columns
  if(any(duplicated(nm - names(x names(x) - make.unique(nm)
  if(is.null(rows)) rows - attr(xx, row.names)[i]



Here's some code for testing, and timings

# Use:
# R --no-init-file --no-site-file

x - data.frame(a=1:4, b=2:5)

# Run these commands with the default and new versions of [.data.frame
trace(duplicated)
trace(make.unique)
x[2:1]
x[1]
x[1:2]
x[1:3, ]# save one call to duplicated(rows)
x[c(T,F,F,T), ] # save one call to duplicated(rows)
x[-1,]  # save one call to duplicated(rows)
x[-(1:2),]  # save one call to duplicated(rows)
x[3:1, ]
x[c(1,3,2,4,3), ]
untrace(duplicated)
untrace(make.unique)


# Timings
# Run one of these lines, then everything afterward
n - 10^5
n - 10^6
n - 10^7

y - data.frame(a=1:n, b=1:n)

i - 1:n
system.time(temp - y[i, ])
#   n   old new
#   10^5.128.052
#   10^6.237.591
#   10^73.102.882

i - rep(TRUE, n)
system.time(temp - y[i, ])
#   n   old new
#   10^5.157.053
#   10^6.787.449
#   10^73.799   2.138

i - -1
system.time(temp - y[i, ])
#   n   old new
#   10^5.157.051
#   10^6.614.497
#   10^74.163   2.482

i - rep(1:(n/2), 2) # expect no speedup for this case
system.time(temp - y[i, ])
#   n   old new
#   10^5.559.782
#   10^66.066   6.078

# Times shown are the user times reported by system.time

# The time savings are mostly quite substantial in the
# cases I expect a savings.

# I've noticed a lot of variability in results from system.time,
# so I don't view these as very accurate, and I don't worry
# much about the cases where the time appears worse.


On Thu, Jul 3, 2008 at 1:08 PM, Martin Maechler [EMAIL PROTECTED]
wrote:

  TH == Tim Hesterberg [EMAIL PROTECTED]
  on Tue, 1 Jul 2008 15:23:53 -0700 writes:

TH There is a bug in the standard version of [.data.frame;
TH it mixes up handling duplicates and NAs when subscripting rows.

TH x - data.frame(x=1:3, y=2:4, row.names=c(a,b,NA))
TH y - x[c(2:3, NA),]
TH y

TH It creates a data frame with duplicate rows, but won't print.

 and that's a bug, indeed
 (introduced to R version 2.5.0, when the [.data.frame  code was much
 optimized for speed, with quite some care), and I have commited
 a fix (and a regression test) to both R-devel and R-patched.

 Thanks a lot for the bug report, Tim!

 Now about your newly proposed code:
 I'm sorry to say that it looks so much different from the source
 code in
  https://svn.r-project.org/R/trunk/src/library/base/R/dataframe.R
 that I don't think we would accept it as a substitute, easily.

 Could you try to provide a minimal patch against the source code
 and also a selfcontained example that 

Re: [Rd] [.data.frame speedup

2008-07-01 Thread Tim Hesterberg
There is a bug in the standard version of [.data.frame;
it mixes up handling duplicates and NAs when subscripting rows.
  x - data.frame(x=1:3, y=2:4, row.names=c(a,b,NA))
  y - x[c(2:3, NA),]
  y
It creates a data frame with duplicate rows, but won't print.

In the previous message I included a version of [.data.frame;
it fails for the same example, for a different reason.  Here
is a fix.


subscript.data.frame -
function (x, i, j,
  drop = if (missing(i)) TRUE else length(cols) == 1)
{
  # This version of [.data.frame avoid wasting time enforcing unique
  # row names if possible.
  mdrop - missing(drop)
  Narg - nargs() - (!mdrop)
  if (Narg  3) {
if (!mdrop)
  warning(drop argument will be ignored)
if (missing(i))
  return(x)
if (is.matrix(i))
  return(as.matrix(x)[i])
y - NextMethod([)
cols - names(y)
if (!is.null(cols)  any(is.na(cols)))
  stop(undefined columns selected)
if (any(duplicated(cols)))
  names(y) - make.unique(cols)
return(structure(y, class = oldClass(x),
 row.names = .row_names_info(x, 0L)))
  }
  if (missing(i)) {
if (missing(j)  drop  length(x) == 1L)
  return(.subset2(x, 1L))
y - if (missing(j))
  x
else .subset(x, j)
if (drop  length(y) == 1L)
  return(.subset2(y, 1L))
cols - names(y)
if (any(is.na(cols)))
  stop(undefined columns selected)
if (any(duplicated(cols)))
  names(y) - make.unique(cols)
nrow - .row_names_info(x, 2L)
if (drop  !mdrop  nrow == 1L)
  return(structure(y, class = NULL, row.names = NULL))
else return(structure(y, class = oldClass(x),
  row.names = .row_names_info(x, 0L)))
  }
  xx - x
  cols - names(xx)
  x - vector(list, length(x))
  x - .Call(R_copyDFattr, xx, x, PACKAGE = base)
  oldClass(x) - attr(x, row.names) - NULL
  # Do not want to check for duplicates if don't need to
  noDuplicateRowNames - (is.logical(i) ||
  (!is.null(attr(x, dup.row.names))) ||
  (is.numeric(i)  min(i, 0, na.rm=TRUE)  0) ||
  (!anyMissing(i)  !notSorted(i, strict = TRUE)))
  if (!missing(j)) {
x - x[j]
cols - names(x)
if (drop  length(x) == 1L) {
  if (is.character(i)) {
rows - attr(xx, row.names)
i - pmatch(i, rows, duplicates.ok = TRUE)
  }
  xj - .subset2(.subset(xx, j), 1L)
  return(if (length(dim(xj)) != 2L) xj[i] else xj[i,
 , drop = FALSE])
}
if (any(is.na(cols)))
  stop(undefined columns selected)
nxx - structure(seq_along(xx), names = names(xx))
sxx - match(nxx[j], seq_along(xx))
  }
  else sxx - seq_along(x)
  rows - NULL
  if (is.character(i)) {
rows - attr(xx, row.names)
i - pmatch(i, rows, duplicates.ok = TRUE)
  }
  for (j in seq_along(x)) {
xj - xx[[sxx[j]]]
x[[j]] - if (length(dim(xj)) != 2L)
  xj[i]
else xj[i, , drop = FALSE]
  }
  if (drop) {
n - length(x)
if (n == 1L)
  return(x[[1L]])
if (n  1L) {
  xj - x[[1L]]
  nrow - if (length(dim(xj)) == 2L)
dim(xj)[1L]
  else length(xj)
  drop - !mdrop  nrow == 1L
}
else drop - FALSE
  }
  if (!drop) {
if (is.null(rows))
  rows - attr(xx, row.names)
rows - rows[i]
if(any(is.na(rows)))
  rows[is.na(rows)] - NA
if(!noDuplicateRowNames  any(duplicated(rows)))
rows - make.unique(as.character(rows))
if (any(duplicated(nm - names(x
  names(x) - make.unique(nm)
if (is.null(rows))
  rows - attr(xx, row.names)[i]
attr(x, row.names) - rows
oldClass(x) - oldClass(xx)
  }
  x
}

# That requires anyMissing from the splus2R package,
# plus notSorted (or a version of is.unsorted with argument 'strict' added).

notSorted - function(x, decreasing = FALSE, strict = FALSE, na.rm = FALSE){
  # return TRUE if x is not sorted
  # If decreasing=FALSE, check for sort in increasing order
  # If strict=TRUE, ties correspond to not being sorted
  n - length(x)
  if(length(n)  2)
return(FALSE)
  if(!is.atomic(x) || (!na.rm  any(is.na(x
return(NA)
  if(na.rm  any(ii - is.na(x)))
x - x[!ii]
  if(decreasing){
ifelse1(strict,
any(x[-1] = x[-n]),
any(x[-1]   x[-n]))
  } else { # check for sort in increasing order
ifelse1(strict,
any(x[-1] = x[-n]),
any(x[-1]   x[-n]))
  }
}


On Tue, Jul 1, 2008 at 11:20 AM, Tim Hesterberg [EMAIL PROTECTED]
wrote:

 Below is a version of [.data.frame that is faster
 for subscripting rows of large data frames; it avoids calling
 duplicated(rows)
 if there is no need to check for duplicate row names, when:
 i is logical
 attr(x, dup.row.names) is not NULL (S+ compatibility)
 i is numeric and negative
 i is strictly increasing


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list

Re: [Rd] [.data.frame speedup

2008-07-01 Thread Tim Hesterberg
Here is a revised version of notSorted; change argument order (to be more
like
is.unsorted) and fix blunder.

notSorted - function(x, na.rm = FALSE, decreasing = FALSE, strict = FALSE){
  # return TRUE if x is not sorted
  # If decreasing=FALSE, check for sort in increasing order
  # If strict=TRUE, ties correspond to not being sorted
  n - length(x)
  if(n  2)
return(FALSE)
  if(!is.atomic(x) || (!na.rm  any(is.na(x
return(NA)
  if(na.rm  any(ii - is.na(x))){
x - x[!ii]
n - length(x)
  }
  if(decreasing){
ifelse1(strict,
any(x[-1] = x[-n]),
any(x[-1]   x[-n]))
  } else { # check for sort in increasing order
ifelse1(strict,
any(x[-1] = x[-n]),
any(x[-1]   x[-n]))
  }
}


On Tue, Jul 1, 2008 at 3:23 PM, Tim Hesterberg [EMAIL PROTECTED]
wrote:

 There is a bug in the standard version of [.data.frame;
 it mixes up handling duplicates and NAs when subscripting rows.
   x - data.frame(x=1:3, y=2:4, row.names=c(a,b,NA))
   y - x[c(2:3, NA),]
   y
 It creates a data frame with duplicate rows, but won't print.

 In the previous message I included a version of [.data.frame;
 it fails for the same example, for a different reason.  Here
 is a fix.


 subscript.data.frame -
 function (x, i, j,
   drop = if (missing(i)) TRUE else length(cols) == 1)
 {
   # This version of [.data.frame avoid wasting time enforcing unique
   # row names if possible.

   mdrop - missing(drop)
   Narg - nargs() - (!mdrop)
   if (Narg  3) {
 if (!mdrop)
   warning(drop argument will be ignored)
 if (missing(i))
   return(x)
 if (is.matrix(i))
   return(as.matrix(x)[i])
 y - NextMethod([)
 cols - names(y)
 if (!is.null(cols)  any(is.na(cols)))
   stop(undefined columns selected)
 if (any(duplicated(cols)))
   names(y) - make.unique(cols)
 return(structure(y, class = oldClass(x),
  row.names = .row_names_info(x, 0L)))
   }
   if (missing(i)) {
 if (missing(j)  drop  length(x) == 1L)
   return(.subset2(x, 1L))
 y - if (missing(j))
   x
 else .subset(x, j)
 if (drop  length(y) == 1L)
   return(.subset2(y, 1L))
 cols - names(y)
 if (any(is.na(cols)))
   stop(undefined columns selected)
 if (any(duplicated(cols)))
   names(y) - make.unique(cols)
 nrow - .row_names_info(x, 2L)
 if (drop  !mdrop  nrow == 1L)
   return(structure(y, class = NULL, row.names = NULL))
 else return(structure(y, class = oldClass(x),
   row.names = .row_names_info(x, 0L)))
   }
   xx - x
   cols - names(xx)
   x - vector(list, length(x))
   x - .Call(R_copyDFattr, xx, x, PACKAGE = base)
   oldClass(x) - attr(x, row.names) - NULL
   # Do not want to check for duplicates if don't need to
   noDuplicateRowNames - (is.logical(i) ||
   (!is.null(attr(x, dup.row.names))) ||
   (is.numeric(i)  min(i, 0, na.rm=TRUE)  0) ||
   (!anyMissing(i)  !notSorted(i, strict = TRUE)))

   if (!missing(j)) {
 x - x[j]
 cols - names(x)
 if (drop  length(x) == 1L) {
   if (is.character(i)) {
 rows - attr(xx, row.names)
 i - pmatch(i, rows, duplicates.ok = TRUE)
   }
   xj - .subset2(.subset(xx, j), 1L)
   return(if (length(dim(xj)) != 2L) xj[i] else xj[i,
  , drop = FALSE])
 }
 if (any(is.na(cols)))
   stop(undefined columns selected)
 nxx - structure(seq_along(xx), names = names(xx))
 sxx - match(nxx[j], seq_along(xx))
   }
   else sxx - seq_along(x)
   rows - NULL
   if (is.character(i)) {
 rows - attr(xx, row.names)
 i - pmatch(i, rows, duplicates.ok = TRUE)
   }
   for (j in seq_along(x)) {
 xj - xx[[sxx[j]]]
 x[[j]] - if (length(dim(xj)) != 2L)
   xj[i]
 else xj[i, , drop = FALSE]
   }
   if (drop) {
 n - length(x)
 if (n == 1L)
   return(x[[1L]])
 if (n  1L) {
   xj - x[[1L]]
   nrow - if (length(dim(xj)) == 2L)
 dim(xj)[1L]
   else length(xj)
   drop - !mdrop  nrow == 1L
 }
 else drop - FALSE
   }
   if (!drop) {
 if (is.null(rows))
   rows - attr(xx, row.names)
 rows - rows[i]
 if(any(is.na(rows)))
   rows[is.na(rows)] - NA
 if(!noDuplicateRowNames  any(duplicated(rows)))
 rows - make.unique(as.character(rows))
 if (any(duplicated(nm - names(x
   names(x) - make.unique(nm)
 if (is.null(rows))
   rows - attr(xx, row.names)[i]
 attr(x, row.names) - rows
 oldClass(x) - oldClass(xx)
   }
   x
 }

 # That requires anyMissing from the splus2R package,
 # plus notSorted (or a version of is.unsorted with argument 'strict'
 added).

 (first version of notSorted is omitted)


 On Tue, Jul 1, 2008 at 11:20 AM, Tim Hesterberg [EMAIL PROTECTED]
 wrote:

 Below is a version of [.data.frame that is faster
 for subscripting rows of large data frames; it avoids calling