Re: [Rd] [.data.frame speedup
TH == Tim Hesterberg [EMAIL PROTECTED] on Thu, 3 Jul 2008 17:04:24 -0700 writes: TH I made a couple of a changes from the previous version: TH - don't use functions anyMissing or notSorted (which aren't in base R) TH - don't check for dup.row.names attribute (need to modify other functions TH before that is useful) TH I have not tested this with a wide variety of inputs; I'm assuming that TH you have some regression tests. yes, we do (and they are part of the R source). TH Here are the file differences. Let me know if you'd like a different TH format. TH $ diff -c dataframe.R dataframe2.R TH *** dataframe.RThu Jul 3 15:48:12 2008 TH --- dataframe2.RThu Jul 3 16:36:46 2008 ... context diff is fine (I typically use '-u' but that's not important). From your patch, I've currently ended in this patch : --- dataframe.R.~19~2008-07-03 02:13:21.000163000 +0200 +++ dataframe.R 2008-07-05 13:02:33.29000 +0200 @@ -579,14 +579,18 @@ ## row names might have NAs. if(is.null(rows)) rows - attr(xx, row.names) rows - rows[i] - if((ina - any(is.na(rows))) | (dup - any(duplicated(rows { - ## both will coerce integer 'rows' to character: - if (!dup is.character(rows)) dup - NA %in% rows - if(ina) - rows[is.na(rows)] - NA - if(dup) - rows - make.unique(as.character(rows)) - } + + ## Do not want to check for duplicates if don't need to + noDuplicateRowNames - + (is.logical(i) || (li - length(i)) 2 || +(is.numeric(i) (min(0, i, na.rm=TRUE) 0 || + (!any(is.na(i)) all(i[-li] i[-1L]) + ## TODO: is.unsorted(., strict=FALSE/TRUE) + if(any(is.na(rows))) + rows[is.na(rows)] - NA # coerces to integer + if(!noDuplicateRowNames any(duplicated(rows))) + rows - make.unique(as.character(rows)) # coerces to integer + ## new in 1.8.0 -- might have duplicate columns if(any(duplicated(nm - names(x names(x) - make.unique(nm) if(is.null(rows)) rows - attr(xx, row.names)[i] TH Here's some code for testing, and timings . I've rationalized (wrote functions) and slightly extended your tests, they are now public at ftp://ftp.stat.math.ethz.ch/U/maechler/R/data.frame-TH-ex.R Unfortunately, they show that the speedup is negative in some cases, e.g. for the 'i - 1:n' case for n - 1000 or 1. I've replicated every system.time() 12 times, to get a sense of the precision, and that's still the conclusion. In other words, your proposed 'noDuplicateRowNames' computations are sometimes more expensive than the duplicated(.) call they replace. To me, that means that the whole exercise was probbaly in vain: We are not making the code more complicated if it's not a uniform improvement. Too bad. Martin Maechler, ETH Zurich __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [.data.frame speedup
TH == Tim Hesterberg [EMAIL PROTECTED] on Tue, 1 Jul 2008 15:23:53 -0700 writes: TH There is a bug in the standard version of [.data.frame; TH it mixes up handling duplicates and NAs when subscripting rows. TH x - data.frame(x=1:3, y=2:4, row.names=c(a,b,NA)) TH y - x[c(2:3, NA),] TH y TH It creates a data frame with duplicate rows, but won't print. and that's a bug, indeed (introduced to R version 2.5.0, when the [.data.frame code was much optimized for speed, with quite some care), and I have commited a fix (and a regression test) to both R-devel and R-patched. Thanks a lot for the bug report, Tim! Now about your newly proposed code: I'm sorry to say that it looks so much different from the source code in https://svn.r-project.org/R/trunk/src/library/base/R/dataframe.R that I don't think we would accept it as a substitute, easily. Could you try to provide a minimal patch against the source code and also a selfcontained example that exhibits the speed gain you are aiming for ? Best regards, Martin Maechler, ETH Zurich [.] TH On Tue, Jul 1, 2008 at 11:20 AM, Tim Hesterberg [EMAIL PROTECTED] TH wrote: Below is a version of [.data.frame that is faster for subscripting rows of large data frames; it avoids calling duplicated(rows) if there is no need to check for duplicate row names, when: i is logical attr(x, dup.row.names) is not NULL (S+ compatibility) i is numeric and negative i is strictly increasing TH [[alternative HTML version deleted]] TH __ TH R-devel@r-project.org mailing list TH https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [.data.frame speedup
I made a couple of a changes from the previous version: - don't use functions anyMissing or notSorted (which aren't in base R) - don't check for dup.row.names attribute (need to modify other functions before that is useful) I have not tested this with a wide variety of inputs; I'm assuming that you have some regression tests. Here are the file differences. Let me know if you'd like a different format. $ diff -c dataframe.R dataframe2.R *** dataframe.RThu Jul 3 15:48:12 2008 --- dataframe2.RThu Jul 3 16:36:46 2008 *** *** 530,535 --- 530,541 x - .Call(R_copyDFattr, xx, x, PACKAGE=base) oldClass(x) - attr(x, row.names) - NULL + # Do not want to check for duplicates if don't need to + noDuplicateRowNames - (is.logical(i) || + length(i) 2 || + (is.numeric(i) min(i, 0, na.rm=TRUE) 0) || + (!any(is.na(i)) all(i[-length(i)]i[-1]))) + if(!missing(j)) { # df[i, j] x - x[j] cols - names(x) # needed for 'drop' *** *** 579,592 ## row names might have NAs. if(is.null(rows)) rows - attr(xx, row.names) rows - rows[i] ! if((ina - any(is.na(rows))) | (dup - any(duplicated(rows { ! ## both will coerce integer 'rows' to character: ! if (!dup is.character(rows)) dup - NA %in% rows ! if(ina) ! rows[is.na(rows)] - NA ! if(dup) ! rows - make.unique(as.character(rows)) ! } ## new in 1.8.0 -- might have duplicate columns if(any(duplicated(nm - names(x names(x) - make.unique(nm) if(is.null(rows)) rows - attr(xx, row.names)[i] --- 585,594 ## row names might have NAs. if(is.null(rows)) rows - attr(xx, row.names) rows - rows[i] ! if(any(is.na(rows))) ! rows[is.na(rows)] - NA # coerces to integer ! if(!noDuplicateRowNames any(duplicated(rows))) ! rows - make.unique(as.character(rows)) # coerces to integer ## new in 1.8.0 -- might have duplicate columns if(any(duplicated(nm - names(x names(x) - make.unique(nm) if(is.null(rows)) rows - attr(xx, row.names)[i] Here's some code for testing, and timings # Use: # R --no-init-file --no-site-file x - data.frame(a=1:4, b=2:5) # Run these commands with the default and new versions of [.data.frame trace(duplicated) trace(make.unique) x[2:1] x[1] x[1:2] x[1:3, ]# save one call to duplicated(rows) x[c(T,F,F,T), ] # save one call to duplicated(rows) x[-1,] # save one call to duplicated(rows) x[-(1:2),] # save one call to duplicated(rows) x[3:1, ] x[c(1,3,2,4,3), ] untrace(duplicated) untrace(make.unique) # Timings # Run one of these lines, then everything afterward n - 10^5 n - 10^6 n - 10^7 y - data.frame(a=1:n, b=1:n) i - 1:n system.time(temp - y[i, ]) # n old new # 10^5.128.052 # 10^6.237.591 # 10^73.102.882 i - rep(TRUE, n) system.time(temp - y[i, ]) # n old new # 10^5.157.053 # 10^6.787.449 # 10^73.799 2.138 i - -1 system.time(temp - y[i, ]) # n old new # 10^5.157.051 # 10^6.614.497 # 10^74.163 2.482 i - rep(1:(n/2), 2) # expect no speedup for this case system.time(temp - y[i, ]) # n old new # 10^5.559.782 # 10^66.066 6.078 # Times shown are the user times reported by system.time # The time savings are mostly quite substantial in the # cases I expect a savings. # I've noticed a lot of variability in results from system.time, # so I don't view these as very accurate, and I don't worry # much about the cases where the time appears worse. On Thu, Jul 3, 2008 at 1:08 PM, Martin Maechler [EMAIL PROTECTED] wrote: TH == Tim Hesterberg [EMAIL PROTECTED] on Tue, 1 Jul 2008 15:23:53 -0700 writes: TH There is a bug in the standard version of [.data.frame; TH it mixes up handling duplicates and NAs when subscripting rows. TH x - data.frame(x=1:3, y=2:4, row.names=c(a,b,NA)) TH y - x[c(2:3, NA),] TH y TH It creates a data frame with duplicate rows, but won't print. and that's a bug, indeed (introduced to R version 2.5.0, when the [.data.frame code was much optimized for speed, with quite some care), and I have commited a fix (and a regression test) to both R-devel and R-patched. Thanks a lot for the bug report, Tim! Now about your newly proposed code: I'm sorry to say that it looks so much different from the source code in https://svn.r-project.org/R/trunk/src/library/base/R/dataframe.R that I don't think we would accept it as a substitute, easily. Could you try to provide a minimal patch against the source code and also a selfcontained example that
Re: [Rd] [.data.frame speedup
There is a bug in the standard version of [.data.frame; it mixes up handling duplicates and NAs when subscripting rows. x - data.frame(x=1:3, y=2:4, row.names=c(a,b,NA)) y - x[c(2:3, NA),] y It creates a data frame with duplicate rows, but won't print. In the previous message I included a version of [.data.frame; it fails for the same example, for a different reason. Here is a fix. subscript.data.frame - function (x, i, j, drop = if (missing(i)) TRUE else length(cols) == 1) { # This version of [.data.frame avoid wasting time enforcing unique # row names if possible. mdrop - missing(drop) Narg - nargs() - (!mdrop) if (Narg 3) { if (!mdrop) warning(drop argument will be ignored) if (missing(i)) return(x) if (is.matrix(i)) return(as.matrix(x)[i]) y - NextMethod([) cols - names(y) if (!is.null(cols) any(is.na(cols))) stop(undefined columns selected) if (any(duplicated(cols))) names(y) - make.unique(cols) return(structure(y, class = oldClass(x), row.names = .row_names_info(x, 0L))) } if (missing(i)) { if (missing(j) drop length(x) == 1L) return(.subset2(x, 1L)) y - if (missing(j)) x else .subset(x, j) if (drop length(y) == 1L) return(.subset2(y, 1L)) cols - names(y) if (any(is.na(cols))) stop(undefined columns selected) if (any(duplicated(cols))) names(y) - make.unique(cols) nrow - .row_names_info(x, 2L) if (drop !mdrop nrow == 1L) return(structure(y, class = NULL, row.names = NULL)) else return(structure(y, class = oldClass(x), row.names = .row_names_info(x, 0L))) } xx - x cols - names(xx) x - vector(list, length(x)) x - .Call(R_copyDFattr, xx, x, PACKAGE = base) oldClass(x) - attr(x, row.names) - NULL # Do not want to check for duplicates if don't need to noDuplicateRowNames - (is.logical(i) || (!is.null(attr(x, dup.row.names))) || (is.numeric(i) min(i, 0, na.rm=TRUE) 0) || (!anyMissing(i) !notSorted(i, strict = TRUE))) if (!missing(j)) { x - x[j] cols - names(x) if (drop length(x) == 1L) { if (is.character(i)) { rows - attr(xx, row.names) i - pmatch(i, rows, duplicates.ok = TRUE) } xj - .subset2(.subset(xx, j), 1L) return(if (length(dim(xj)) != 2L) xj[i] else xj[i, , drop = FALSE]) } if (any(is.na(cols))) stop(undefined columns selected) nxx - structure(seq_along(xx), names = names(xx)) sxx - match(nxx[j], seq_along(xx)) } else sxx - seq_along(x) rows - NULL if (is.character(i)) { rows - attr(xx, row.names) i - pmatch(i, rows, duplicates.ok = TRUE) } for (j in seq_along(x)) { xj - xx[[sxx[j]]] x[[j]] - if (length(dim(xj)) != 2L) xj[i] else xj[i, , drop = FALSE] } if (drop) { n - length(x) if (n == 1L) return(x[[1L]]) if (n 1L) { xj - x[[1L]] nrow - if (length(dim(xj)) == 2L) dim(xj)[1L] else length(xj) drop - !mdrop nrow == 1L } else drop - FALSE } if (!drop) { if (is.null(rows)) rows - attr(xx, row.names) rows - rows[i] if(any(is.na(rows))) rows[is.na(rows)] - NA if(!noDuplicateRowNames any(duplicated(rows))) rows - make.unique(as.character(rows)) if (any(duplicated(nm - names(x names(x) - make.unique(nm) if (is.null(rows)) rows - attr(xx, row.names)[i] attr(x, row.names) - rows oldClass(x) - oldClass(xx) } x } # That requires anyMissing from the splus2R package, # plus notSorted (or a version of is.unsorted with argument 'strict' added). notSorted - function(x, decreasing = FALSE, strict = FALSE, na.rm = FALSE){ # return TRUE if x is not sorted # If decreasing=FALSE, check for sort in increasing order # If strict=TRUE, ties correspond to not being sorted n - length(x) if(length(n) 2) return(FALSE) if(!is.atomic(x) || (!na.rm any(is.na(x return(NA) if(na.rm any(ii - is.na(x))) x - x[!ii] if(decreasing){ ifelse1(strict, any(x[-1] = x[-n]), any(x[-1] x[-n])) } else { # check for sort in increasing order ifelse1(strict, any(x[-1] = x[-n]), any(x[-1] x[-n])) } } On Tue, Jul 1, 2008 at 11:20 AM, Tim Hesterberg [EMAIL PROTECTED] wrote: Below is a version of [.data.frame that is faster for subscripting rows of large data frames; it avoids calling duplicated(rows) if there is no need to check for duplicate row names, when: i is logical attr(x, dup.row.names) is not NULL (S+ compatibility) i is numeric and negative i is strictly increasing [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list
Re: [Rd] [.data.frame speedup
Here is a revised version of notSorted; change argument order (to be more like is.unsorted) and fix blunder. notSorted - function(x, na.rm = FALSE, decreasing = FALSE, strict = FALSE){ # return TRUE if x is not sorted # If decreasing=FALSE, check for sort in increasing order # If strict=TRUE, ties correspond to not being sorted n - length(x) if(n 2) return(FALSE) if(!is.atomic(x) || (!na.rm any(is.na(x return(NA) if(na.rm any(ii - is.na(x))){ x - x[!ii] n - length(x) } if(decreasing){ ifelse1(strict, any(x[-1] = x[-n]), any(x[-1] x[-n])) } else { # check for sort in increasing order ifelse1(strict, any(x[-1] = x[-n]), any(x[-1] x[-n])) } } On Tue, Jul 1, 2008 at 3:23 PM, Tim Hesterberg [EMAIL PROTECTED] wrote: There is a bug in the standard version of [.data.frame; it mixes up handling duplicates and NAs when subscripting rows. x - data.frame(x=1:3, y=2:4, row.names=c(a,b,NA)) y - x[c(2:3, NA),] y It creates a data frame with duplicate rows, but won't print. In the previous message I included a version of [.data.frame; it fails for the same example, for a different reason. Here is a fix. subscript.data.frame - function (x, i, j, drop = if (missing(i)) TRUE else length(cols) == 1) { # This version of [.data.frame avoid wasting time enforcing unique # row names if possible. mdrop - missing(drop) Narg - nargs() - (!mdrop) if (Narg 3) { if (!mdrop) warning(drop argument will be ignored) if (missing(i)) return(x) if (is.matrix(i)) return(as.matrix(x)[i]) y - NextMethod([) cols - names(y) if (!is.null(cols) any(is.na(cols))) stop(undefined columns selected) if (any(duplicated(cols))) names(y) - make.unique(cols) return(structure(y, class = oldClass(x), row.names = .row_names_info(x, 0L))) } if (missing(i)) { if (missing(j) drop length(x) == 1L) return(.subset2(x, 1L)) y - if (missing(j)) x else .subset(x, j) if (drop length(y) == 1L) return(.subset2(y, 1L)) cols - names(y) if (any(is.na(cols))) stop(undefined columns selected) if (any(duplicated(cols))) names(y) - make.unique(cols) nrow - .row_names_info(x, 2L) if (drop !mdrop nrow == 1L) return(structure(y, class = NULL, row.names = NULL)) else return(structure(y, class = oldClass(x), row.names = .row_names_info(x, 0L))) } xx - x cols - names(xx) x - vector(list, length(x)) x - .Call(R_copyDFattr, xx, x, PACKAGE = base) oldClass(x) - attr(x, row.names) - NULL # Do not want to check for duplicates if don't need to noDuplicateRowNames - (is.logical(i) || (!is.null(attr(x, dup.row.names))) || (is.numeric(i) min(i, 0, na.rm=TRUE) 0) || (!anyMissing(i) !notSorted(i, strict = TRUE))) if (!missing(j)) { x - x[j] cols - names(x) if (drop length(x) == 1L) { if (is.character(i)) { rows - attr(xx, row.names) i - pmatch(i, rows, duplicates.ok = TRUE) } xj - .subset2(.subset(xx, j), 1L) return(if (length(dim(xj)) != 2L) xj[i] else xj[i, , drop = FALSE]) } if (any(is.na(cols))) stop(undefined columns selected) nxx - structure(seq_along(xx), names = names(xx)) sxx - match(nxx[j], seq_along(xx)) } else sxx - seq_along(x) rows - NULL if (is.character(i)) { rows - attr(xx, row.names) i - pmatch(i, rows, duplicates.ok = TRUE) } for (j in seq_along(x)) { xj - xx[[sxx[j]]] x[[j]] - if (length(dim(xj)) != 2L) xj[i] else xj[i, , drop = FALSE] } if (drop) { n - length(x) if (n == 1L) return(x[[1L]]) if (n 1L) { xj - x[[1L]] nrow - if (length(dim(xj)) == 2L) dim(xj)[1L] else length(xj) drop - !mdrop nrow == 1L } else drop - FALSE } if (!drop) { if (is.null(rows)) rows - attr(xx, row.names) rows - rows[i] if(any(is.na(rows))) rows[is.na(rows)] - NA if(!noDuplicateRowNames any(duplicated(rows))) rows - make.unique(as.character(rows)) if (any(duplicated(nm - names(x names(x) - make.unique(nm) if (is.null(rows)) rows - attr(xx, row.names)[i] attr(x, row.names) - rows oldClass(x) - oldClass(xx) } x } # That requires anyMissing from the splus2R package, # plus notSorted (or a version of is.unsorted with argument 'strict' added). (first version of notSorted is omitted) On Tue, Jul 1, 2008 at 11:20 AM, Tim Hesterberg [EMAIL PROTECTED] wrote: Below is a version of [.data.frame that is faster for subscripting rows of large data frames; it avoids calling