Re: [Rd] Question re: NA, NaNs in R
This isn't quite what you were asking, but might inform your choice. R doesn't try to maintain the distinction between NA and NaN when doing calculations, e.g.: NA + NaN [1] NA NaN + NA [1] NaN So for the aggregate package, I didn't attempt to treat them differently. The aggregate package is available at http://www.timhesterberg.net/r-packages Here is the inst/doc/missingValues.txt file from that package: -- Copyright 2012 Google Inc. All Rights Reserved. Author: Tim Hesterberg roc...@google.com Distributed under GPL 2 or later. Handling of missing values and not-a-numbers. Here I'll note how this package handles missing values. I do it the way R handles them, rather than the more strict way that S+ does. First, for terminology, NaN = not-a-number, e.g. the result of 0/0 NA = missing value or true missing value, e.g. survey non-response xx = I'll uses this for the union of those, or missing value of any kind. For background, at the hardware level there is an IEEE standard that specifies that certain bit patterns are NaN, and specifies that operations involving an NaN result in another NaN. That standard doesn't say anything about missing values, which are important in statistics. So what R and S+ do is to pick one of the bit patterns and declare that to be a NA. In other words, the NA bit pattern is a subset of the NaN bit patterns. At the user level, the reverse seems to hold. You can assign either NA or NaN to an object. But: is.na(x) returns TRUE for both is.nan(x) returns TRUE for NaN and FALSE for NA Based on that, you'd think that NaN is a subset of NA. To tell whether something is a true missing value do: (is.na(x) !is.nan(x)) The S+ convention is that any operation involving NA results in an NA; otherwise any operation involving NaN results in NaN. The R convention is that any operation involving xx results in an xx; a missing value of any kind results in another missing value of any kind. R considers NA and NaN equivalent for testing purposes: all.equal(NA_real_, NaN) gives TRUE. Some R functions follow the S+ convention, e.g. the Math2 functions in src/main/arithmetic.c use this macro: #define if_NA_Math2_set(y,a,b) \ if (ISNA (a) || ISNA (b)) y = NA_REAL; \ else if (ISNAN(a) || ISNAN(b)) y = R_NaN; Other R functions, like the basic arithmetic operations +-/*^, do not (search for PLUSOP in src/main/arithmetic.c). They just let the hardware do the calculations. As a result, you can get odd results like is.nan(NA_real_ + NaN) [1] FALSE is.nan(NaN + NA_real_) [1] TRUE The R help files help(is.na) and help(is.nan) suggest that computations involving NA and NaN are indeterminate. It is faster to use the R convention; most operations are just handled by the hardware, without extra work. In cases like sum(x, na.rm=TRUE), the help file specifies that both NA and NaN are removed. There is one NA but mulitple NaNs. And please re-read 'man memcmp': your cast is wrong. On 10/02/2014 06:52, Kevin Ushey wrote: Hi R-devel, I have a question about the differentiation between NA and NaN values as implemented in R. In arithmetic.c, we have int R_IsNA(double x) { if (isnan(x)) { ieee_double y; y.value = x; return (y.word[lw] == 1954); } return 0; } ieee_double is just used for type punning so we can check the final bits and see if they're equal to 1954; if they are, x is NA, if they're not, x is NaN (as defined for R_IsNaN). My question is -- I can see a substantial increase in speed (on my computer, in certain cases) if I replace this check with int R_IsNA(double x) { return memcmp( (char*)(x), (char*)(NA_REAL), sizeof(double) ) == 0; } IIUC, there is only one bit pattern used to encode R NA values, so this should be safe. But I would like to be sure: Is there any guarantee that the different functions in R would return NA as identical to the bit pattern defined for NA_REAL, for a given architecture? Similarly for NaN value(s) and R_NaN? My guess is that it is possible some functions used internally by R might encode NaN values differently; ie, setting the lower word to a value different than 1954 (hence being NaN, but potentially not identical to R_NaN), or perhaps this is architecture-dependent. However, NA should be one specific bit pattern (?). And, I wonder if there is any guarantee that the different functions used in R would return an NaN value as identical to R_NaN (which appears to be the 'IEEE NaN')? (interested parties can see + run a simple benchmark from the gist at https://gist.github.com/kevinushey/8911432) Thanks, Kevin __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor
[Rd] Suggest adding a testing keyword
I suggest adding this to R_HOME/doc/KEYWORDS.db: Programming|testing: Software testing and add a corresponding entry in R_HOME/doc/KEYWORDS. [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] R 3.0.0 memory use
I did some benchmarking of data frame code, and it appears that R 3.0.0 is far worse than earlier versions of R in terms of how many large objects it allocates space for, for data frame operations - creation, subscripting, subscript replacement. For a data frame with n rows, it makes either 2 or 4 extra copies of all of: 8n bytes (e.g. double precision) 24n bytes 32n bytes E.g., for as.data.frame(numeric vector), instead of allocations totalling ~8n bytes, it allocates 33 times that much. Here, compare columns 3 and 5 (columns 2 and 4 are with the dataframe package). # Summary # R-2.14.2R-2.15.3R-3.0.0 # w/o withw/o withw/o # as.data.frame(y)3 1 1 1 5;4;4 # data.frame(y) 7 3 4 2 6;2;2 # data.frame(y, z)7 each 3 each 4 2 8;4;4 # as.data.frame(l)8 3 5 2 9;4;4 # data.frame(l) 13 5 8 3 12;4;4 # d$z - z3,2 1,1 3,1 2,1 7;4;4,1 # d[[z]] - z 4,3 1,1 3,1 2,1 7;4;4,1 # d[, z] - z 6,4,2 2,2,1 4,2,2 3,2,1 8;4;4,2,2 # d[z] - z 6,5,2 2,2,1 4,2,2 3,2,1 8;4;4,2,2 # d[z] - list(z=z) 6,3,2 2,2,1 4,2,2 3,2,1 8;4;4,2,2 # d[z] - Z #list(z=z) 6,2,2 2,1,1 4,1,2 3,1,1 8;4;4,1,2 # a - d[y] 2 1 2 1 6;4;4 # a - d[, y, drop=F] 2 1 2 1 6;4;4 # Where two numbers are given, they refer to: # (copies of the old data frame), # (copies of the new column) # A third number refers to numbers of # (copies made of an integer vector of row names) # For R 3.0.0, I'm getting astounding results - many more copies, # and also some copies of larger objects; in addition to the data # vectors of size 80K and 160K, also 240K and 320K. # Where three numbers are given in form a;c;d, they refer to # (copies of 80K; 240K; 320K) The benchmarks are at http://www.timhesterberg.net/r-packages/memory.R I'm using versions of R I installed from source on a Linux box, using e.g. ./configure --prefix=(my path) --enable-memory-profiling --with-readline=no make make install __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] R 3.0.0 memory use
When I change the data set size, the extra allocations do not change in size. This supports Luke and Martin's diagnosis. The extra allocations are either 2 or 4 allocations each of size 80040 240048 320040 Details (you may skip): (Fresh session of R 3.0.0) y - 1:10^4 + 0.0 Rprofmem(temp.out, threshold = 10^4) d - as.data.frame(y) Rprofmem(NULL); system(cat temp.out) 320040 :80040 :240048 :320040 :80040 :240048 :80040 :as.data.frame.numeric as.data.frame 320040 :80040 :240048 :320040 :80040 :240048 : # Try increasing size by a factor of 10 y - 1:10^5 + 0.0 Rprofmem(temp.out, threshold = 10^4) d - as.data.frame(y) Rprofmem(NULL); system(cat temp.out) 320040 :80040 :240048 :320040 :80040 :240048 :800040 :as.data.frame.numeric as.data.frame 320040 :80040 :240048 :320040 :80040 :240048 : The number of allocations shown, of different sizes: 3.0.0 3.0.0 2.15.3 2.15.3 first second first second 240048 4 4 0 0 320040 4 4 0 0 80040 5 4 1 0 800040 0 1 0 1 So it looks like both R 2.15.3 and R 3.0.0 are making one copy of the data, plus extra allocations. (Fresh session of R 2.15.3) y - 1:10^4 + 0.0 Rprofmem(temp.out, threshold = 10^4) d - as.data.frame(y) Rprofmem(NULL); system(cat temp.out) 80040 :as.data.frame.numeric as.data.frame # Increase size by factor of 10 y - 1:10^5 + 0.0 Rprofmem(temp.out, threshold = 10^4) d - as.data.frame(y) Rprofmem(NULL); system(cat temp.out) 800040 :as.data.frame.numeric as.data.frame On Sun, 14 Apr 2013 19:15:45 -0700 Martin Morgan mtmor...@fhcrc.org wrote: On 04/14/2013 07:11 PM, luke-tier...@uiowa.edu wrote: There were a couple of bug fixes to somewhat obscure compound assignment related bugs that required bumping up internal reference counts. It's possible that one or more of these are responsible. If so it is unavoidable for now, but it's worth finding out for sure. With some stripped down test examples it should be possible to identify when things changed. I won't have time to look for some time, but if someone else wanted to nail this down that would be useful. I can't quite tell from Tim's script what he's documenting. In R-2.15.3 I have Rprofmem(); Rprofmem(NULL); readLines(Rprofmem.out, warn=FALSE) character(0) (or sometimes [1] new page:new page:\Rprofmem\ ) whereas in R-3.0.0 Rprofmem(); Rprofmem(NULL); readLines(Rprofmem.out, warn=FALSE) [1] 320040 :80040 :240048 :320040 :80040 :240048 : I think these are the allocations Tim is seeing. They're from the parser (see below) rather than as.data.frame. For Tim's example y - 1:10^4 + 0.0 Rprofmem(); d - as.data.frame(y); Rprofmem(NULL); readLines(Rprofmem.out) [1] 320040 :80040 :240048 :320040 :80040 :240048 :80040 :\as.data.frame.numeric\ \as.data.frame\ [2] 320040 :80040 :240048 :320040 :80040 :240048 : only the allocation 80040 is from as.data.frame (from the call stack output). Under R -d gdb (gdb) b R_OutputStackTrace (gdb) r Rprofmem(); Rprofmem(NULL) Breakpoint 1, R_OutputStackTrace (file=0xbd43f0) at /home/mtmorgan/src/R-3-0-branch/src/main/memory.c:3434 3434{ (gdb) bt #0 R_OutputStackTrace (file=0xbd43f0) at /home/mtmorgan/src/R-3-0-branch/src/main/memory.c:3434 #1 0x7792ff83 in R_ReportAllocation (size=320040) at /home/mtmorgan/src/R-3-0-branch/src/main/memory.c:3456 #2 Rf_allocVector (type=13, length=8) at /home/mtmorgan/src/R-3-0-branch/src/main/memory.c:2478 #3 0x7790bedf in growData () at gram.y:3391 and the memory allocations are from these lines in the parser gram.y PROTECT( bigger = allocVector( INTSXP, data_size * DATA_ROWS ) ) ; PROTECT( biggertext = allocVector( STRSXP, data_size ) ); I'm not sure why these show up under R 3.0.0, though. $ R-2-15-branch/bin/R --version R version 2.15.3 Patched (2013-03-13 r62579) -- Security Blanket Copyright (C) 2013 The R Foundation for Statistical Computing ISBN 3-900051-07-0 Platform: x86_64-unknown-linux-gnu (64-bit) R-3-0-branch$ bin/R --version R version 3.0.0 Patched (2013-04-14 r62579) -- Masked Marvel Copyright (C) 2013 The R Foundation for Statistical Computing Platform: x86_64-unknown-linux-gnu (64-bit) Martin Best, luke On Sun, 14 Apr 2013, Tim Hesterberg wrote: I did some benchmarking of data frame code, and it appears that R 3.0.0 is far worse than earlier versions of R in terms of how many large objects it allocates space for, for data frame operations - creation, subscripting, subscript replacement. For a data frame with n rows, it makes either 2 or 4 extra copies of all of: 8n bytes (e.g. double precision) 24n bytes 32n bytes E.g., for as.data.frame(numeric vector), instead of allocations totalling ~8n bytes, it allocates 33 times that much. Here, compare columns 3 and 5 (columns 2 and 4 are with the dataframe package). # Summary
Re: [Rd] Suggest adding a 'pivot' argument to qr.R
On Sep 11, 2012, at 16:02 , Warnes, Gregory wrote: On 9/7/12 2:42 PM, peter dalgaard pda...@gmail.com wrote: On Sep 7, 2012, at 17:16 , Tim Hesterberg wrote: I suggest adding a 'pivot' argument to qr.R, to obtain columns in the same order as the original x, so that a - qr(x) qr.Q(a) %*% qr.R(a, pivot=TRUE) returns x. That would come spiraling down in flames the first time someone tried to use backsolve on it, wouldn't it? I mean, a major point of QR is that R is triangular; doesn't make much sense to permute the columns without retaining the pivoting permutation. As I understand Tim's proposal, the pivot argument defaults to FALSE, so the new behavior would only be activated at the user's request. Sure. I'm just saying that I see little use for the un-pivoted qr.R because, generically, the first thing you want to do with qr.R is to invert it, which is easier when it is triangular. Greg Warnes is correct, I propose keeping the default FALSE, for backward compatibility. My use for the pivoted R is in computing a covariance matrix, using R - qr.R(QR, pivot = TRUE) Rinv - ginverse(R) covTerm - Rinv %*% t(Rinv) But I see that lm() and glm() use chol2inv, that may be preferable. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Need to tell R CMD check that a function qr.R is not a method
When creating a package, I would like a way to tell R that a function with a period in its name is not a method. I'm writing a package now with a modified version of qr.R. R CMD check gives warnings: * checking S3 generic/method consistency ... WARNING qr: function(x, ...) qr.R: function(qr, complete, pivot) See section Generic functions and methods of the Writing R Extensions manual. * checking Rd \usage sections ... NOTE S3 methods shown with full name in documentation object 'QR.Auxiliaries': qr.R The \usage entries for S3 methods should use the \method markup and not their full name. See the chapter Writing R documentation files in the Writing R Extensions manual. [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] suggest that as.double( something double ) not make a copy
I've been playing with passing arguments to .C(), and found that replacing as.double(x) with if(is.double(x)) x else as.double(x) saves time and avoids one copy, in the case that x is already double. I suggest modifying as.double to avoid the extra copy and just return x, when x is already double. Similarly for as.integer, etc. [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Add DUP = FALSE when tabulate() calls .C(R_tabulate
In base/R/tabulate.R, tabulate() calls .C(R_tabulate; I suggest adding DUP = FALSE to that call. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Suggested improvement for src/library/base/man/qraux.Rd
Here is a modified version of qraux.Rd, an edited version of R-2.14.0/src/library/base/man/qraux.Rd This gives some details and an example for the case of pivoting. In this case, it is not true that X = QR; rather X[, pivot] = QR. It may save some other people bugs and time to have this information. Tim Hesterberg -- % File src/library/base/man/qraux.Rd % Part of the R package, http://www.R-project.orghttp://www.r-project.org/ % Copyright 1995-2007 R Core Development Team % Distributed under GPL 2 or later \name{QR.Auxiliaries} \title{Reconstruct the Q, R, or X Matrices from a QR Object} \usage{ qr.X(qr, complete = FALSE, ncol =) qr.Q(qr, complete = FALSE, Dvec =) qr.R(qr, complete = FALSE) } \alias{qr.X} \alias{qr.Q} \alias{qr.R} \arguments{ \item{qr}{object representing a QR decomposition. This will typically have come from a previous call to \code{\link{qr}} or \code{\link{lsfit}}.} \item{complete}{logical expression of length 1. Indicates whether an arbitrary orthogonal completion of the \eqn{\bold{Q}} or \eqn{\bold{X}} matrices is to be made, or whether the \eqn{\bold{R}} matrix is to be completed by binding zero-value rows beneath the square upper triangle.} \item{ncol}{integer in the range \code{1:nrow(qr$qr)}. The number of columns to be in the reconstructed \eqn{\bold{X}}. The default when \code{complete} is \code{FALSE} is the first \code{min(ncol(X), nrow(X))} columns of the original \eqn{\bold{X}} from which the qr object was constructed. The default when \code{complete} is \code{TRUE} is a square matrix with the original \eqn{\bold{X}} in the first \code{ncol(X)} columns and an arbitrary orthogonal completion (unitary completion in the complex case) in the remaining columns.} \item{Dvec}{vector (not matrix) of diagonal values. Each column of the returned \eqn{\bold{Q}} will be multiplied by the corresponding diagonal value. Defaults to all \code{1}s.} } \description{ Returns the original matrix from which the object was constructed or the components of the decomposition. } \value{ \code{qr.X} returns \eqn{\bold{X}}, the original matrix from which the qr object was constructed, provided \code{ncol(X) = nrow(X)}. If \code{complete} is \code{TRUE} or the argument \code{ncol} is greater than \code{ncol(X)}, additional columns from an arbitrary orthogonal (unitary) completion of \code{X} are returned. \code{qr.Q} returns part or all of \bold{Q}, the order-nrow(X) orthogonal (unitary) transformation represented by \code{qr}. If \code{complete} is \code{TRUE}, \bold{Q} has \code{nrow(X)} columns. If \code{complete} is \code{FALSE}, \bold{Q} has \code{ncol(X)} columns. When \code{Dvec} is specified, each column of \bold{Q} is multiplied by the corresponding value in \code{Dvec}. \code{qr.R} returns \bold{R}. This may be pivoted, e.g. if \code{a - qr(x)} then \code{x[, a$pivot]} = \bold{QR}. The number of rows of \bold{R} is either \code{nrow(X)} or \code{ncol(X)} (and may depend on whether \code{complete} is \code{TRUE} or \code{FALSE}. } \seealso{ \code{\link{qr}}, \code{\link{qr.qy}}. } \examples{ p - ncol(x - LifeCycleSavings[,-1]) # not the 'sr' qrstr - qr(x) # dim(x) == c(n,p) qrstr $ rank # = 4 = p Q - qr.Q(qrstr) # dim(Q) == dim(x) R - qr.R(qrstr) # dim(R) == ncol(x) X - qr.X(qrstr) # X == x range(X - as.matrix(x))# ~ 6e-12 ## X == Q \%*\% R if there has been no pivoting, as here. Q \%*\% R # example of pivoting x - cbind(int=1, b1=rep(1:0, each=3), b2=rep(0:1, each=3), c1=rep(c(1,0,0), 2), c2=rep(c(0,1,0), 2), c3=rep(c(0,0,1),2)) # singular, columns b2 and c3 are extra a - qr(x) qr.R(a) # columns are int b1 c1 c2 b2 c3 a$pivot all.equal(x, qr.Q(a) \%*\% qr.R(a))# no all.equal(x[, a$pivot], qr.Q(a) \%*\% qr.R(a)) # yes } \keyword{algebra} \keyword{array} [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] median and data frames
I also favor deprecating mean.data.frame. One possible exception would be for a single-column data frame. But even here I'd say no, lest people expect the same behavior for median, var, ... Pat's suggestion of using stop() would work nicely for mean. (but omit paste - stop handles that). Tim Hesterberg If Martin's proposal is accepted, does that mean that the median method for data frames would be something like: function (x, ...) { stop(paste(you probably mean to use the command: sapply(, deparse(substitute(x)), , median), sep=)) } Pat On 29/04/2011 15:25, Martin Maechler wrote: Paul Johnsonpauljoh...@gmail.com on Thu, 28 Apr 2011 00:20:27 -0500 writes: On Wed, Apr 27, 2011 at 12:44 PM, Patrick Burns pbu...@pburns.seanet.com wrote: Here are some data frames: df3.2- data.frame(1:3, 7:9) df4.2- data.frame(1:4, 7:10) df3.3- data.frame(1:3, 7:9, 10:12) df4.3- data.frame(1:4, 7:10, 10:13) df3.4- data.frame(1:3, 7:9, 10:12, 15:17) df4.4- data.frame(1:4, 7:10, 10:13, 15:18) Now here are some commands and their answers: median(df4.4) [1] 8.5 11.5 median(df3.2[c(1,2,3),]) [1] 2 8 median(df3.2[c(1,3,2),]) [1] 2 NA Warning message: In mean.default(X[[2L]], ...) : argument is not numeric or logical: returning NA The sessionInfo is below, but it looks to me like the present behavior started in 2.10.0. Sometimes it gets the right answer. I'd be grateful to hear how it does that -- I can't figure it out. Hello, Pat. Nice poetry there! I think I have an actual answer, as opposed to the usual crap I spew. I would agree if you said median.data.frame ought to be written to work columnwise, similar to mean.data.frame. apply and sapply always give the correct answer apply(df3.3, 2, median) X1.3 X7.9 X10.12 2 8 11 [...] exactly mean.data.frame is now implemented as mean.data.frame- function(x, ...) sapply(x, mean, ...) exactly. My personal oppinion is that mean.data.frame() should never have been written. People should know, or learn, to use apply functions for such a task. The unfortunate fact that mean.data.frame() exists makes people think that median.data.frame() should too, and then var.data.frame() sd.data.frame() mad.data.frame() min.data.frame() max.data.frame() ... ... all just in order to *not* to have to know sapply() No, rather not. My vote is for deprecating mean.data.frame(). Martin -- Patrick Burns pbu...@pburns.seanet.com twitter: @portfolioprobe http://www.portfolioprobe.com/blog http://www.burns-stat.com (home of 'Some hints for the R beginner' and 'The R Inferno') __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] matrixStats: Extend to arrays too (Was: Re: Suggestion: Adding quick rowMin and rowMax functions to base package)
For consistency with rowSums colSums rowMeans etc., the names should be colMins colMaxs rowMins rowMaxs This is also consistent with S+. FYI, the rowSums naming convention was chosen to avoid conflict with rowsum (which computes column sums!). Tim Hesterberg A well-designed API generalized to work with arrays should probably borrow ideas from how argument 'MARGIN' of apply() works, how argument 'dim' in rowSums() for (though I must say the letter seem a bit ad hoc at first sight given the name of the function). ?There may also be something to learn from the 'reshape' package and so. I'd also recommend looking at plyr::aaply, which fixes a few things that have always annoyed me about apply - namely that it is not idempotent/identical to aperm when the summary function is the identity. Hadley -- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] aperm() should retain class of input object
Having aperm() return an object of the same class is dangerous, there are undoubtedly classes for which that is not appropriate, producing an illegal object for that class or quietly giving incorrect results. Three alternatives are to: * add the keep.class option but with default FALSE * make aperm a generic function - without a keep.class argument - with a ... argument - methods for classes like table could have keep.class = TRUE * make aperm a generic function - without a keep.class argument - with a ... argument - default method have keep.class = TRUE The third option would give the proposed behavior by default, but allow a way out for classes where the behavior is wrong. This puts the burden on a class author to realize the potential problem with aperm, so my preference is one of the first two options. aperm() was designed for multidimensional arrays, but is also useful for table objects, particularly with the lattice, vcd and vcdExtra packages. But aperm() was designed and implemented before other related object classes were conceived, and I propose a small tune-up to make it more generally useful. The problem is that aperm() always returns an object of class 'array', which causes problems for methods designed for table objects. It also requires some package writers to implement both .array and .table methods for the same functionality, usually one in terms of the other. Some examples of unexpected, and initially perplexing results (when only methods for one class are implemented) are shown below. library(vcd) pairs(UCBAdmissions, shade=TRUE) UCB - aperm(UCBAdmissions, c(2, 1, 3)) # UCB is now an array, not a table pairs(UCB, shade=TRUE) There were 50 or more warnings (use warnings() to see the first 50) # fix it, to get pairs.table class(UCB) - table pairs(UCB, shade=TRUE) Of course, I can define a new function, tperm() that does what I think should be the expected behavior: # aperm, for table objects tperm - function(a, perm, resize = TRUE) { result - aperm(a, per, resize) class(result) - class(a) result } But I think it is more natural to include this functionality in aperm() itself. Thus, I propose the following revision of base::aperm(), at the R level: aperm - function (a, perm, resize = TRUE, keep.class=TRUE) { if (missing(perm)) perm - integer(0L) result - .Internal(aperm(a, perm, resize)) if(keep.class) class(result) - class(a) result } I don't think this would break any existing code, except where someone depended on coercion to an array. The drop-in replacement for aperm would set keep.class=FALSE by default, but I think TRUE is more natural. FWIW, here are the methods for table and array objects from my current (non-representative) session. methods(class=table) [1] as.data.frame.table barchart.table* cloud.table* contourplot.table* dotplot.table* [6] head.table* levelplot.table*pairs.table* plot.table* print.table [11] summary.table tail.table* Non-visible functions are asterisked methods(class=array) [1] anyDuplicated.array as.data.frame.array as.raster.array* barchart.array* contourplot.array* dotplot.array* [7] duplicated.arraylevelplot.array*unique.array -- Michael Friendly Email: friendly AT yorku DOT ca Professor, Psychology Dept. York University Voice: 416 736-5115 x66249 Fax: 416 736-5814 4700 Keele StreetWeb:http://www.datavis.ca Toronto, ONT M3J 1P3 CANADA __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Using sample() to sample one value from a single value?
On Wed, Nov 3, 2010 at 3:54 PM, Henrik Bengtsson h...@biostat.ucsf.eduwrote: Hi, consider this one as an FYI, or a seed for further discussion. I am aware that many traps on sample() have been reported over the years. I know that these are also documents in help(sample). Still I got bitten by this while writing ... All of the above makes sense when one study the code of sample(), but sample() is indeed dangerous, e.g. imagine how many bootstrap estimates out there quietly gets incorrect. Nonparametric bootstrapping from a sample of size 1 is always incorrect. If you draw a single observation from a sample of size 1, you get that same observation back. This implies zero sampling variability, which is wrong. If this single sample represents one stratum or sample in a larger problem, this would contribute zero variability to the overall result, again wrong. In general, the ordinary bootstrap underestimates variability in small samples. For a sample mean, the ordinary bootstrap corresponds to using an estimate of variance equal to (1/n) sum((x - mean(x))^2), instead of a divisor of n-1. In stratified and multi-sample applications the downward bias is similarly (n-1)/n. Three remedies are: * draw bootstrap samples of size n-1 * bootknife sampling - omit one observation (a jackknife sample), then draw a bootstrap sample of size n from that * bootstrap from a kernel density estimate, with kernel covariance equal to empirical covariance (with divisor n-1) / n. The latter two are described in Hesterberg, Tim C. (2004), Unbiasing the Bootstrap-Bootknife Sampling vs. Smoothing, Proceedings of the Section on Statistics and the Environment, American Statistical Association, 2924-2930. http://home.comcast.net/~timhesterberg/articles/JSM04-bootknife.pdf All three are undefined for samples of size 1. You need to go to some other bootstrap, e.g. a parametric bootstrap with variability estimated from other data. Tim Hesterberg __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] suggest enhancement to segments and arrows to facilitate horizontal and vertical segments
I suggest a simple enhancement to segments() and arrows() to facilitate drawing horizontal and vertical segments -- set default values for the second x and y arguments equal to the first set. This is handy, especially when the expressions for coordinates are long. Compare: Segments: function (x0, y0, x1 = x0, y1 = y0, col = par(fg), lty = par(lty), --- function (x0, y0, x1, y1, col = par(fg), lty = par(lty), Arrows: function (x0, y0, x1 = x0, y1 = y0, length = 0.25, angle = 30, code = 2, --- function (x0, y0, x1, y1, length = 0.25, angle = 30, code = 2, Tim Hesterberg __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Faster as.data.frame save copy by doing names(x) - NULL only if needed
A number of as.data.frame methods do names(x) - NULL Replacing that with if(!is.null(names(x))) names(x) - NULL appears to save making one copy of the data (based on tracemem and Rprofmem in a copy of R compiled with --enable-memory-profiling) and gives a modest but consistent boost in speed, e.g.: # old new # user system elapseduser system elapsed # integer 3.412 0.060 3.472 2.788 0.020 2.809 # numeric 6.212 0.160 6.374 4.852 0.080 5.132 # logical 3.484 0.052 3.699 2.808 0.028 2.834 # factor4.433 0.020 4.547 2.929 0.020 2.964 These visible methods can be modified as noted above: as.data.frame.Date as.data.frame.POSIXct as.data.frame.complex as.data.frame.difftime as.data.frame.factor as.data.frame.integer as.data.frame.logical as.data.frame.numeric as.data.frame.numeric_version as.data.frame.ordered as.data.frame.raw as.data.frame.vector Here's the timing code (run in a copy of R without memory profiling): x - 1:10^4 # integer system.time(for(i in 1:10^4) y - as.data.frame(x), gc=TRUE) x - x + 0.0# numeric system.time(for(i in 1:10^4) y - as.data.frame(x), gc=TRUE) x - rep(c(TRUE,FALSE), length = 10^4) # logical system.time(for(i in 1:10^4) y - as.data.frame(x), gc=TRUE) x - factor(rep(letters[1:10], length=10^4))# factor system.time(for(i in 1:10^4) y - as.data.frame(x), gc=TRUE) I have not done timings where the inputs have names; that is rare in my experience. [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] (PR#8192) [ subscripting sometimes loses names
... Simon, no, the drop=FALSE argument has nothing to do with what Christian was talking about. The kind of thing he meant is PR# 8192, Subject: [ subscripting sometimes loses names: http://bugs.r-project.org/cgi-bin/R/wishlist?id=8192 In R, subscripting with [ USUALLY retains names, but R has various edge cases where it (IMNSHO) inappropriately discards them. This occurs with both .Primitive([) and [.data.frame. This has been known for years, but I have not yet tried digging into R's implementation to see where and how the names are actually getting lost. Incidentally, versions of S-Plus since approximately S-Plus 6.0 back in 2001 show similar buggy edge case behavior. Older versions of S-Plus, c. S-Plus 3.3 and earlier, had the correct, name preserving behavior. I presume that the original Bell Labs S had correct name-preserving behavior, and then the S-Plus developers broke it sometime along the way. (Later comments on the thread pointed out the difference between x[,1] for matrices and data frames.) I rewrote the S-PLUS data frame code around then, to fix various inconsistencies and improve efficiency. This was probably my change, and I would do it again. Note that the components of a data frame do not have names attached to them; the row names are a separate object. Extracting a component vector or matrix from a data frame should not attach names to the result, because of: * memory (attaching row names to an object can more than double the size of the object), * speed * some objects cannot take names, and attaching them could change the class and other behavior of an object, and * the names are usually/often (depending on the user) meaningless, artifacts of an early design decision that all data frames have row names. Data frames differ from matrices in two ways that matter here: * columns in matrices are all the same kind, and are simple objects (numeric, etc.), whereas components of data frames can be nearly arbitrary objects, and * row names get added to a data frame whether a user wants them or not, whereas row names on a matrix have to be specified. A historical note - unique row names on data frame were a design decision made when people worked with small data frames, and are convenient for small data frames. But they are a problem for large data frames. I was writing for all users, not just those with small data frames and meaningful names. I like R's 'automatic' row names. This is a big help working with huge data frames (and I do this often, at Google). But this doesn't go far enough; subscripting and other operations sometimes convert the automatic names to real names, and check/enforce uniqueness, which is a big waste of time when working with large data frames. I'll comment more on this in a new thread. Tim Hesterberg __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] non-duplicate names in data frames
I wrote on another thread (with subject [ subscripting sometimes loses names): I like R's 'automatic' row names. This is a big help working with huge data frames (and I do this often, at Google). But this doesn't go far enough; subscripting and other operations sometimes convert the automatic names to real names, and check/enforce uniqueness, which is a big waste of time when working with large data frames. I'll comment more on this in a new thread. I propose (and have begun writing, in my copious spare time): * an optional argument to data.frame and other data frame creation code * resulting in an attribute added to the data.frame * so that subscripting and other operations on the data frame * always keep artificial row names * do not have to check for unique row names in the result. My current thoughts, comments welcome: Argument name and component name 'dup.row.names' 0 or FALSE or NULL - current, require unique names 1 or TRUE - duplicates allowed (when subscripting etc.) 2 - always automatic (when subscripting etc.) Option maxRowNames, default say 10^4 Any data frames with more than this have dup.row.names default to 2. The name 'dup.row.names' is for consistency with S+; there the options are NULL, F or T. Tim Hesterberg __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] ifelse
Others have commented on why this holds. There is an alternative, 'ifelse1', part of the splus2R package, that does what you'd like here. Tim Hesterberg I find it slightly surprising, that ifelse(TRUE, character(0), ) returns NA instead of character(0). -- Heikki Kaskelma __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [.data.frame speedup
I made a couple of a changes from the previous version: - don't use functions anyMissing or notSorted (which aren't in base R) - don't check for dup.row.names attribute (need to modify other functions before that is useful) I have not tested this with a wide variety of inputs; I'm assuming that you have some regression tests. Here are the file differences. Let me know if you'd like a different format. $ diff -c dataframe.R dataframe2.R *** dataframe.RThu Jul 3 15:48:12 2008 --- dataframe2.RThu Jul 3 16:36:46 2008 *** *** 530,535 --- 530,541 x - .Call(R_copyDFattr, xx, x, PACKAGE=base) oldClass(x) - attr(x, row.names) - NULL + # Do not want to check for duplicates if don't need to + noDuplicateRowNames - (is.logical(i) || + length(i) 2 || + (is.numeric(i) min(i, 0, na.rm=TRUE) 0) || + (!any(is.na(i)) all(i[-length(i)]i[-1]))) + if(!missing(j)) { # df[i, j] x - x[j] cols - names(x) # needed for 'drop' *** *** 579,592 ## row names might have NAs. if(is.null(rows)) rows - attr(xx, row.names) rows - rows[i] ! if((ina - any(is.na(rows))) | (dup - any(duplicated(rows { ! ## both will coerce integer 'rows' to character: ! if (!dup is.character(rows)) dup - NA %in% rows ! if(ina) ! rows[is.na(rows)] - NA ! if(dup) ! rows - make.unique(as.character(rows)) ! } ## new in 1.8.0 -- might have duplicate columns if(any(duplicated(nm - names(x names(x) - make.unique(nm) if(is.null(rows)) rows - attr(xx, row.names)[i] --- 585,594 ## row names might have NAs. if(is.null(rows)) rows - attr(xx, row.names) rows - rows[i] ! if(any(is.na(rows))) ! rows[is.na(rows)] - NA # coerces to integer ! if(!noDuplicateRowNames any(duplicated(rows))) ! rows - make.unique(as.character(rows)) # coerces to integer ## new in 1.8.0 -- might have duplicate columns if(any(duplicated(nm - names(x names(x) - make.unique(nm) if(is.null(rows)) rows - attr(xx, row.names)[i] Here's some code for testing, and timings # Use: # R --no-init-file --no-site-file x - data.frame(a=1:4, b=2:5) # Run these commands with the default and new versions of [.data.frame trace(duplicated) trace(make.unique) x[2:1] x[1] x[1:2] x[1:3, ]# save one call to duplicated(rows) x[c(T,F,F,T), ] # save one call to duplicated(rows) x[-1,] # save one call to duplicated(rows) x[-(1:2),] # save one call to duplicated(rows) x[3:1, ] x[c(1,3,2,4,3), ] untrace(duplicated) untrace(make.unique) # Timings # Run one of these lines, then everything afterward n - 10^5 n - 10^6 n - 10^7 y - data.frame(a=1:n, b=1:n) i - 1:n system.time(temp - y[i, ]) # n old new # 10^5.128.052 # 10^6.237.591 # 10^73.102.882 i - rep(TRUE, n) system.time(temp - y[i, ]) # n old new # 10^5.157.053 # 10^6.787.449 # 10^73.799 2.138 i - -1 system.time(temp - y[i, ]) # n old new # 10^5.157.051 # 10^6.614.497 # 10^74.163 2.482 i - rep(1:(n/2), 2) # expect no speedup for this case system.time(temp - y[i, ]) # n old new # 10^5.559.782 # 10^66.066 6.078 # Times shown are the user times reported by system.time # The time savings are mostly quite substantial in the # cases I expect a savings. # I've noticed a lot of variability in results from system.time, # so I don't view these as very accurate, and I don't worry # much about the cases where the time appears worse. On Thu, Jul 3, 2008 at 1:08 PM, Martin Maechler [EMAIL PROTECTED] wrote: TH == Tim Hesterberg [EMAIL PROTECTED] on Tue, 1 Jul 2008 15:23:53 -0700 writes: TH There is a bug in the standard version of [.data.frame; TH it mixes up handling duplicates and NAs when subscripting rows. TH x - data.frame(x=1:3, y=2:4, row.names=c(a,b,NA)) TH y - x[c(2:3, NA),] TH y TH It creates a data frame with duplicate rows, but won't print. and that's a bug, indeed (introduced to R version 2.5.0, when the [.data.frame code was much optimized for speed, with quite some care), and I have commited a fix (and a regression test) to both R-devel and R-patched. Thanks a lot for the bug report, Tim! Now about your newly proposed code: I'm sorry to say that it looks so much different from the source code in https://svn.r-project.org/R/trunk/src/library/base/R/dataframe.R that I don't think we would accept it as a substitute, easily. Could you try to provide a minimal patch against the source code and also a selfcontained example
[Rd] [.data.frame speedup
Below is a version of [.data.frame that is faster for subscripting rows of large data frames; it avoids calling duplicated(rows) if there is no need to check for duplicate row names, when: i is logical attr(x, dup.row.names) is not NULL (S+ compatibility) i is numeric and negative i is strictly increasing [.data.frame - function (x, i, j, drop = if (missing(i)) TRUE else length(cols) == 1) { # This version of [.data.frame avoid wasting time enforcing unique # row names. mdrop - missing(drop) Narg - nargs() - (!mdrop) if (Narg 3) { if (!mdrop) warning(drop argument will be ignored) if (missing(i)) return(x) if (is.matrix(i)) return(as.matrix(x)[i]) y - NextMethod([) cols - names(y) if (!is.null(cols) any(is.na(cols))) stop(undefined columns selected) if (any(duplicated(cols))) names(y) - make.unique(cols) return(structure(y, class = oldClass(x), row.names = .row_names_info(x, 0L))) } if (missing(i)) { if (missing(j) drop length(x) == 1L) return(.subset2(x, 1L)) y - if (missing(j)) x else .subset(x, j) if (drop length(y) == 1L) return(.subset2(y, 1L)) cols - names(y) if (any(is.na(cols))) stop(undefined columns selected) if (any(duplicated(cols))) names(y) - make.unique(cols) nrow - .row_names_info(x, 2L) if (drop !mdrop nrow == 1L) return(structure(y, class = NULL, row.names = NULL)) else return(structure(y, class = oldClass(x), row.names = .row_names_info(x, 0L))) } xx - x cols - names(xx) x - vector(list, length(x)) x - .Call(R_copyDFattr, xx, x, PACKAGE = base) oldClass(x) - attr(x, row.names) - NULL # Do not want to check for duplicates if don't need to noDuplicateRowNames - (is.logical(i) || (!is.null(attr(x, dup.row.names))) || (is.numeric(i) min(i, 0, na.rm=TRUE) 0) || (!notSorted(i, strict = TRUE))) if (!missing(j)) { x - x[j] cols - names(x) if (drop length(x) == 1L) { if (is.character(i)) { rows - attr(xx, row.names) i - pmatch(i, rows, duplicates.ok = TRUE) } xj - .subset2(.subset(xx, j), 1L) return(if (length(dim(xj)) != 2L) xj[i] else xj[i, , drop = FALSE]) } if (any(is.na(cols))) stop(undefined columns selected) nxx - structure(seq_along(xx), names = names(xx)) sxx - match(nxx[j], seq_along(xx)) } else sxx - seq_along(x) rows - NULL if (is.character(i)) { rows - attr(xx, row.names) i - pmatch(i, rows, duplicates.ok = TRUE) } for (j in seq_along(x)) { xj - xx[[sxx[j]]] x[[j]] - if (length(dim(xj)) != 2L) xj[i] else xj[i, , drop = FALSE] } if (drop) { n - length(x) if (n == 1L) return(x[[1L]]) if (n 1L) { xj - x[[1L]] nrow - if (length(dim(xj)) == 2L) dim(xj)[1L] else length(xj) drop - !mdrop nrow == 1L } else drop - FALSE } if (!drop) { if (is.null(rows)) rows - attr(xx, row.names) rows - rows[i] if ((ina - any(is.na(rows))) | (dup - !noDuplicateRowNames any(duplicated(rows { if (ina) rows[is.na(rows)] - NA if (dup) rows - make.unique(as.character(rows)) } if (any(duplicated(nm - names(x names(x) - make.unique(nm) if (is.null(rows)) rows - attr(xx, row.names)[i] attr(x, row.names) - rows oldClass(x) - oldClass(xx) } x } [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [.data.frame speedup
There is a bug in the standard version of [.data.frame; it mixes up handling duplicates and NAs when subscripting rows. x - data.frame(x=1:3, y=2:4, row.names=c(a,b,NA)) y - x[c(2:3, NA),] y It creates a data frame with duplicate rows, but won't print. In the previous message I included a version of [.data.frame; it fails for the same example, for a different reason. Here is a fix. subscript.data.frame - function (x, i, j, drop = if (missing(i)) TRUE else length(cols) == 1) { # This version of [.data.frame avoid wasting time enforcing unique # row names if possible. mdrop - missing(drop) Narg - nargs() - (!mdrop) if (Narg 3) { if (!mdrop) warning(drop argument will be ignored) if (missing(i)) return(x) if (is.matrix(i)) return(as.matrix(x)[i]) y - NextMethod([) cols - names(y) if (!is.null(cols) any(is.na(cols))) stop(undefined columns selected) if (any(duplicated(cols))) names(y) - make.unique(cols) return(structure(y, class = oldClass(x), row.names = .row_names_info(x, 0L))) } if (missing(i)) { if (missing(j) drop length(x) == 1L) return(.subset2(x, 1L)) y - if (missing(j)) x else .subset(x, j) if (drop length(y) == 1L) return(.subset2(y, 1L)) cols - names(y) if (any(is.na(cols))) stop(undefined columns selected) if (any(duplicated(cols))) names(y) - make.unique(cols) nrow - .row_names_info(x, 2L) if (drop !mdrop nrow == 1L) return(structure(y, class = NULL, row.names = NULL)) else return(structure(y, class = oldClass(x), row.names = .row_names_info(x, 0L))) } xx - x cols - names(xx) x - vector(list, length(x)) x - .Call(R_copyDFattr, xx, x, PACKAGE = base) oldClass(x) - attr(x, row.names) - NULL # Do not want to check for duplicates if don't need to noDuplicateRowNames - (is.logical(i) || (!is.null(attr(x, dup.row.names))) || (is.numeric(i) min(i, 0, na.rm=TRUE) 0) || (!anyMissing(i) !notSorted(i, strict = TRUE))) if (!missing(j)) { x - x[j] cols - names(x) if (drop length(x) == 1L) { if (is.character(i)) { rows - attr(xx, row.names) i - pmatch(i, rows, duplicates.ok = TRUE) } xj - .subset2(.subset(xx, j), 1L) return(if (length(dim(xj)) != 2L) xj[i] else xj[i, , drop = FALSE]) } if (any(is.na(cols))) stop(undefined columns selected) nxx - structure(seq_along(xx), names = names(xx)) sxx - match(nxx[j], seq_along(xx)) } else sxx - seq_along(x) rows - NULL if (is.character(i)) { rows - attr(xx, row.names) i - pmatch(i, rows, duplicates.ok = TRUE) } for (j in seq_along(x)) { xj - xx[[sxx[j]]] x[[j]] - if (length(dim(xj)) != 2L) xj[i] else xj[i, , drop = FALSE] } if (drop) { n - length(x) if (n == 1L) return(x[[1L]]) if (n 1L) { xj - x[[1L]] nrow - if (length(dim(xj)) == 2L) dim(xj)[1L] else length(xj) drop - !mdrop nrow == 1L } else drop - FALSE } if (!drop) { if (is.null(rows)) rows - attr(xx, row.names) rows - rows[i] if(any(is.na(rows))) rows[is.na(rows)] - NA if(!noDuplicateRowNames any(duplicated(rows))) rows - make.unique(as.character(rows)) if (any(duplicated(nm - names(x names(x) - make.unique(nm) if (is.null(rows)) rows - attr(xx, row.names)[i] attr(x, row.names) - rows oldClass(x) - oldClass(xx) } x } # That requires anyMissing from the splus2R package, # plus notSorted (or a version of is.unsorted with argument 'strict' added). notSorted - function(x, decreasing = FALSE, strict = FALSE, na.rm = FALSE){ # return TRUE if x is not sorted # If decreasing=FALSE, check for sort in increasing order # If strict=TRUE, ties correspond to not being sorted n - length(x) if(length(n) 2) return(FALSE) if(!is.atomic(x) || (!na.rm any(is.na(x return(NA) if(na.rm any(ii - is.na(x))) x - x[!ii] if(decreasing){ ifelse1(strict, any(x[-1] = x[-n]), any(x[-1] x[-n])) } else { # check for sort in increasing order ifelse1(strict, any(x[-1] = x[-n]), any(x[-1] x[-n])) } } On Tue, Jul 1, 2008 at 11:20 AM, Tim Hesterberg [EMAIL PROTECTED] wrote: Below is a version of [.data.frame that is faster for subscripting rows of large data frames; it avoids calling duplicated(rows) if there is no need to check for duplicate row names, when: i is logical attr(x, dup.row.names) is not NULL (S+ compatibility) i is numeric and negative i is strictly increasing [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch
Re: [Rd] [.data.frame speedup
Here is a revised version of notSorted; change argument order (to be more like is.unsorted) and fix blunder. notSorted - function(x, na.rm = FALSE, decreasing = FALSE, strict = FALSE){ # return TRUE if x is not sorted # If decreasing=FALSE, check for sort in increasing order # If strict=TRUE, ties correspond to not being sorted n - length(x) if(n 2) return(FALSE) if(!is.atomic(x) || (!na.rm any(is.na(x return(NA) if(na.rm any(ii - is.na(x))){ x - x[!ii] n - length(x) } if(decreasing){ ifelse1(strict, any(x[-1] = x[-n]), any(x[-1] x[-n])) } else { # check for sort in increasing order ifelse1(strict, any(x[-1] = x[-n]), any(x[-1] x[-n])) } } On Tue, Jul 1, 2008 at 3:23 PM, Tim Hesterberg [EMAIL PROTECTED] wrote: There is a bug in the standard version of [.data.frame; it mixes up handling duplicates and NAs when subscripting rows. x - data.frame(x=1:3, y=2:4, row.names=c(a,b,NA)) y - x[c(2:3, NA),] y It creates a data frame with duplicate rows, but won't print. In the previous message I included a version of [.data.frame; it fails for the same example, for a different reason. Here is a fix. subscript.data.frame - function (x, i, j, drop = if (missing(i)) TRUE else length(cols) == 1) { # This version of [.data.frame avoid wasting time enforcing unique # row names if possible. mdrop - missing(drop) Narg - nargs() - (!mdrop) if (Narg 3) { if (!mdrop) warning(drop argument will be ignored) if (missing(i)) return(x) if (is.matrix(i)) return(as.matrix(x)[i]) y - NextMethod([) cols - names(y) if (!is.null(cols) any(is.na(cols))) stop(undefined columns selected) if (any(duplicated(cols))) names(y) - make.unique(cols) return(structure(y, class = oldClass(x), row.names = .row_names_info(x, 0L))) } if (missing(i)) { if (missing(j) drop length(x) == 1L) return(.subset2(x, 1L)) y - if (missing(j)) x else .subset(x, j) if (drop length(y) == 1L) return(.subset2(y, 1L)) cols - names(y) if (any(is.na(cols))) stop(undefined columns selected) if (any(duplicated(cols))) names(y) - make.unique(cols) nrow - .row_names_info(x, 2L) if (drop !mdrop nrow == 1L) return(structure(y, class = NULL, row.names = NULL)) else return(structure(y, class = oldClass(x), row.names = .row_names_info(x, 0L))) } xx - x cols - names(xx) x - vector(list, length(x)) x - .Call(R_copyDFattr, xx, x, PACKAGE = base) oldClass(x) - attr(x, row.names) - NULL # Do not want to check for duplicates if don't need to noDuplicateRowNames - (is.logical(i) || (!is.null(attr(x, dup.row.names))) || (is.numeric(i) min(i, 0, na.rm=TRUE) 0) || (!anyMissing(i) !notSorted(i, strict = TRUE))) if (!missing(j)) { x - x[j] cols - names(x) if (drop length(x) == 1L) { if (is.character(i)) { rows - attr(xx, row.names) i - pmatch(i, rows, duplicates.ok = TRUE) } xj - .subset2(.subset(xx, j), 1L) return(if (length(dim(xj)) != 2L) xj[i] else xj[i, , drop = FALSE]) } if (any(is.na(cols))) stop(undefined columns selected) nxx - structure(seq_along(xx), names = names(xx)) sxx - match(nxx[j], seq_along(xx)) } else sxx - seq_along(x) rows - NULL if (is.character(i)) { rows - attr(xx, row.names) i - pmatch(i, rows, duplicates.ok = TRUE) } for (j in seq_along(x)) { xj - xx[[sxx[j]]] x[[j]] - if (length(dim(xj)) != 2L) xj[i] else xj[i, , drop = FALSE] } if (drop) { n - length(x) if (n == 1L) return(x[[1L]]) if (n 1L) { xj - x[[1L]] nrow - if (length(dim(xj)) == 2L) dim(xj)[1L] else length(xj) drop - !mdrop nrow == 1L } else drop - FALSE } if (!drop) { if (is.null(rows)) rows - attr(xx, row.names) rows - rows[i] if(any(is.na(rows))) rows[is.na(rows)] - NA if(!noDuplicateRowNames any(duplicated(rows))) rows - make.unique(as.character(rows)) if (any(duplicated(nm - names(x names(x) - make.unique(nm) if (is.null(rows)) rows - attr(xx, row.names)[i] attr(x, row.names) - rows oldClass(x) - oldClass(xx) } x } # That requires anyMissing from the splus2R package, # plus notSorted (or a version of is.unsorted with argument 'strict' added). (first version of notSorted is omitted) On Tue, Jul 1, 2008 at 11:20 AM, Tim Hesterberg [EMAIL PROTECTED] wrote: Below is a version of [.data.frame that is faster for subscripting rows of large data frames; it avoids calling
Re: [Rd] (PR#11537) help (using ?) does not handle trailing whitespace
By whitespace, I mean either a space or tab (preceding the newline). I'm using ESS: ess-version's value is 5.3.6 GNU Emacs 21.4.1 (i486-pc-linux-gnu, X toolkit, Xaw3d scroll bars) of 2007-08-28 on terranova, modified by Debian I have the following in my .emacs: (load ess-5.3.6/lisp/ess-site) (setq ess-tab-always-indent nil) (setq ess-fancy-comments nil) I have not edited ess-site.el On Fri, May 30, 2008 at 12:26 PM, Prof Brian Ripley [EMAIL PROTECTED] wrote: We don't know how to reproduce this: 'whitespace' is not specific enough. R's tokenizer breaks input at spaces, so a space would never be part of that expression. And tabs don't even get to the parser in interactive use, and you cannot mean a newline. So exactly what do you mean by 'whitespace'? The character in your email as received here is an ASCII space, and that is used to end the token on all my systems. That's not to say that you didn't type something else that looks like a space (e.g. a nbspace) since email systems are fickle. None of my guesses worked, so we need precise reproduction instructions. On Thu, 29 May 2008, [EMAIL PROTECTED] wrote: ?agrep Results in: No documentation for 'agrep ' in specified packages and libraries: you could try 'help.search(agrep )' There is white space after agrep, that ? doesn't ignore. --please do not edit the information below-- Version: platform = i486-pc-linux-gnu arch = i486 os = linux-gnu system = i486, linux-gnu status = major = 2 minor = 7.0 year = 2008 month = 04 day = 22 svn rev = 45424 language = R version.string = R version 2.7.0 (2008-04-22) Locale: LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=C;LC_COLLATE=C;LC_MONETARY=C;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C Search Path: .GlobalEnv, package:stats, package:graphics, package:grDevices, package:utils, package:datasets, package:showStructure, package:Rcode, package:splus2R, package:methods, Autoloads, package:base __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Standard method for S4 object
It depends on what the object is to be used for. If you want users to be able to operate with the object as if it were a normal vector, to do things like mean(x), cos(x), etc. then the list would be very long indeed; for example, there are 225 methods for the S4 'bdVector' class (in S-PLUS), plus additional methods defined for inheriting classes. In cases like this you might prefer using an S3 class, using attributes rather than slots for auxiliary information, so that you don't need to write so many methods. Tim Hesterberg I am defining a new class. Shortly, I will submit a package with it. Before, I would like to know if there is a kind of non official list of what method a new S4 object should have. More precisely, personally, I use 'print', 'summary' and 'plot' a lot. So for my new class, I define these 3 methods and of course, a get and a set for each slot. What else? Is there some other methods that a R user can reasonably expect? Some minimum basic tools... Thanks Christophe __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Standard method for S4 object
Tim Hesterberg wrote: It depends on what the object is to be used for. If you want users to be able to operate with the object as if it were a normal vector, to do things like mean(x), cos(x), etc. then the list would be very long indeed; for example, there are 225 methods for the S4 'bdVector' class (in S-PLUS), plus additional methods defined for inheriting classes. This somehow undermines the whole idea of inheritance. If you do not inherit, then you are just implementing a class that mimics another one from scratch. However, the question then is not about standard methods any more, it's about the methods of the class that you mimic. My experience with S4 classes is primarily with classes that had to be implemented from scratch, there was nothing one could inherit from - bdFrame and bdVector in library(bigdata), miVariable in library(missing) (sorry, these are S-PLUS only). Actually, for miVariable we considered S3 class + attributes, but in this case we decided that we did NOT want operations like mean(x) to work without going through a method specifically for the class. ... example of Image class omitted here In cases like this you might prefer using an S3 class, using attributes rather than slots for auxiliary information, so that you don't need to write so many methods. The reasoning here is not really clear. Could you please explain why is this better? Three examples. First is bs: library(splines) bsx - bs(1:99, knots = 10 * 2:6) showStructure(bsx) numeric[99,8] S3 class: bs basis attributes: dimnames degree scalar class: integer knotsnumeric[ length 5] class: numeric Boundary.knots numeric[ length 2] class: integer interceptlogical[ length 1] class: logical (I plan to add showStructure to library(splus2R) shortly.) This is an S3 class, a matrix plus some additional attributes. Everything that works for a matrix works for this object, without needing additional classes. A second example is label: library(Hmisc) age - c(21,65,43) label(age) - Age in Years showStructure(age) numeric[ length 3] S3 class: labelled label character[ length 1] class: character cos(age) Age in Years [1] -0.5477293 -0.5624539 0.5551133 Another S3 class, basically any object plus a label attribute. There are a few methods for this class, otherwise it works out of the box. The third is lm - a list with an S3 class. Functions that operate on lists work fine without extra methods. And you can add extra components without needing to define a new class (I've done this in library(resample)). __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Standard method for S4 object
Hi Oleg, If there as a class to inherit from, then my point about an S4 class requiring lots of methods is moot. I think it would come down then to whether one prefers flexibility (advantage S3) or a definite structure for use with C/C++ (advantage S4). Tim well, I am not arguing that there are situation when one needs to rewrite everything from scratch. However it is always worth at least considering inheritance if there is a candidate to inherit from. It saves a lot of work. Anyway, your examples of S3 class usage are obviously valid in sense that they are indeed S3 methods providing desired functionality. However, I still do not see WHY using attributes with S3 is better than slots and S4 for structures like those inherited from 'array' or similar. S3 gives more freedom in assigning new attributes, but this freedom also means that one has little control over the structure of an object making it, for example, more difficult to use with C/C++ code. Are there any specific benefits in not using S4 and slots (apart from some known performance issues)? __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Sampling with unequal probabilities
This is in followup to a thread on R-help with subject Sampling. I claim that R does the wrong thing by default when sampling with unequal probabilities without replacement - the selection probabilities are not proportional to 'prob', for any draw after the first: I suggest that R do what S-PLUS now does (though you're free to choose a better implementation). What S-PLUS now does is: If 'replace==TRUE' then sample with replacement. Otherwise, sample without replacement or with minimal replacement, according to the value of argument 'minimal'. The default is 'minimal = length(prob) 1'. One can specify 'minimal = FALSE' for backward compatibility. In the case of sampling with minimal replacement, duplicates may occur whenever 'max(size*prob) 1', and are guaranteed if 'max(size*prob) = 2'. You can think of drawing 'trunc(size*prob)' observations deterministically, then drawing the remaining 'size - sum(trunc(size*prob))' observations without replacement, with an adjusted prob vector. The algorithm I used is relatively simple. It is one of the Brewer and Hanif algorithms (though I don't recall if they used the final random shuffle). Here's one description, or you may prefer the description in Pedro J. Saavedra (2005) Comparison of Two Weighting Schemes for Sampling with Minimal Replacement http://www.amstat.org/Sections/Srms/Proceedings/y2005/Files/JSM2005-000882.pdf In the case of minimal = TRUE (sampling with minimal replacement), with unequal probabilities: * scale prob to sum to 1, * randomly sort the observations along with prob * let cprob = cumsum(prob), * draw a systematic sample of size 'size' in (0,1): uniformVector - (1:size - runif(1))/size * observation i is selected if cprob[i-1] uniformVector[j] = cprob[i] for any j In the case (size*max(prob) 1), the number of times the observation is selected is the number of j's for which the inequalities hold. * the selected observations are randomly sorted again. Tim Hesterberg Disclaimer - these are my opinions, not Insightful's. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] S3 vs S4 for a simple package
Would you like existing functions such as mean, range, sum, colSums, dim, apply, length, and many more to operate on the array of numbers? If so use an S3 class. If you would like to effectively disable such functions, to prevent them from working on the object unless you write a method that specifies exactly how the function should operate on the class, then either use an S4 class, or an S3 class where the array is one component of a list. An S3 class also allows for flexibility - you can add attributes, or list components, without breaking things. As for reassurance - I use S3 classes for almost everything, happily. The one time I chose to use an S4 class I later regretted it. This was for objects containing multiple imputations, where I wanted to prevent functions like mean() from working on the original data, without filling in imputations. The regret was because we later realized that in some cases we wanted to add a call attribute or component/slot so that update() would work. If it had been an S3 object we could have done so, but as an S4 object we would have broken existing objects of the class. Tim Hesterberg Disclaimer - this is my personal opinion, not my employer's. I am writing a package and need to decide whether to use S3 or S4. I have a single class, multipol; this needs methods for [ and [- and I also need a print (or show) method and methods for arithmetic +- */^. In S4, an object of class multipol has one slot that holds an array. Objects of class multipol require specific arithmetic operations; a,b being multipols means that a+b and a*b are defined in peculiar ways that make sense in the context of the package. I can also add and multiply by scalars (vectors of length one). My impression is that S3 is perfectly adequate for this task, although I've not yet finalized the coding. S4 seems to be overkill for such a simple system. Can anyone give me some motivation for persisting with S4? Or indeed reassure me that S3 is a good design decision? __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] hasNA() / anyNA()?
S-PLUS has an anyMissing() function, for which the default is: anyMissing.default - function(x){ (length(which.na(x)) 0) } This is more efficient than any(is.na(x)) in the usual case that there are few or no missing values. There are methods for vectors that drop to C code, and methods for data frames and other classes. The code below seems to presume a list, and would be very slow for vectors. For reasons of consistency between S-PLUS and R, I would ask that an R function be called anyMissing rather than hasNA or anyNA. Tim Hesterberg is there a hasNA() / an anyNA() function in R? Of course, hasNA - function(x) { any(is.na(x)); } would do, but that would scan all elements in 'x' and then do the test. I'm looking for a more efficient implementation that returns TRUE at the first NA, e.g. hasNA - function(x) { for (kk in seq(along=x)) { if (is.na(x[kk])) return(TRUE); } FALSE; } Cheers Henrik __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] comment causes browser() to exit (PR#9063)
If I'm not mistaken, this works as documented. ... Thanks for the response. The behavior with return is as documented -- hence my earlier enhancement request. The behavior with a comment is contrary to the documentation, hence this as a bug report. On the second point -- help(browser) says: Anything else entered at the browser prompt is interpreted as an R expression to be evaluated in the calling environment: ... whereas the actual behavior is to interpret a comment as equivalent to c, and to exit the browser. On the first point -- I would like to argue that the behavior of return (and comments) should be changed, at least as a user option. Here is a typical piece of code from a function; note the use of comments and blank lines to improve readability: # If statistic isn't a function or name of function or expression, # store it as an expression to pass to fit.func. substitute.stat - substitute(statistic) if(!is.element(mode(substitute.stat), c(name, function))) statistic - substitute.stat # Get name of data. data.name - substitute(data) if(!is.name(data.name)) data.name - data is.df.data - is.data.frame(data) # How many observations, or subjects if sampling by subject? n - nObservations - numRows(data) # n will change later if by subject # Save group or subject arguments? if(is.null(save.subject)) save.subject - (n = 1) if(is.null(save.group)) save.group - (n = 1) If I now stick a browser() in that function, and throw a line at a time from the source file to R, it exits whenever I throw a blank line or comment. I try to remember to skip the blank lines and comments, but I sometimes forget, and get very annoyed when I have to start over. I could use c in some contexts, but not others: * I often want to evaluate code that is not part of the defined function. * I sometimes change objects and want to go evaluate some lines that were previously evaluated. In the enhancement request I requested an option to turn off the current behavior of return. I personally would just change the default behavior, and have both blank lines and comments do nothing. This is simpler, and people can always use c to quit the browser. This behavior of browser() is the most annoying thing I've found about using R. As I anticipate using R a lot in the future, I would appreciate very much if it is changed. I spent a fair amount of time trying to see if I could change it myself, but gave up. Tim Hesterberg Andy Liaw wrote: If I'm not mistaken, this works as documented. As an example (typed directly into the Rgui console on WinXP): R f - function() { + browser() + cat(I'm here!\n) + cat(I'm still here!\n) + } R f() Called from: f() Browse[1] ## where to? I'm here! I'm still here! which I think is what you saw. However: R f() Called from: f() Browse[1] n debug: cat(I'm here!\n) Browse[1] ## I'm here! debug: cat(I'm still here!\n) Browse[1] I'm still here! From ?browser: c (or just return) exit the browser and continue execution at the next statement. cont synonym for c. n enter the step-through debugger. This changes the meaning of c: see the documentation for debug. My interpretation of this is that, if the first thing typed (or pasted) in is something like a null statement (e.g., return, empty line, or comment), it's the same as 'c', but if the null statement follows 'n', then it behaves differently. _ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major 2 minor 3.1 year 2006 month 06 day01 svn rev38247 language R version.string Version 2.3.1 (2006-06-01) Andy From: [EMAIL PROTECTED] I'm trying to step through some code using browser(), executing one line at a time. Unfortunately, whenever I execute a comment line, the browser exits. I previously reported a similar problem with blank lines. These problems are a strong incentive to write poor code -- uncommented code with no blank lines to improve readability -- so that I can use browser() without it exiting at inconvenient times. Additional detail: (1) I'm running R inside emacs, with R in one buffer and a file containing code in another, using a macro to copy and paste a line at a time from the file to the R buffer. (2) The browser() call is inside a function. Right now the lines I'm sending to the browser are not part of the function, though usually they are. --please do not edit the information below-- Version: platform = i386-pc-mingw32 arch = i386 os = mingw32 system = i386, mingw32 status = major
Re: [Rd] Open .ssc .S ... files in R (PR#8690)
I think it would be good to make the change in the Mac gui too. This would help people on the mac who work on multiple platforms, or try scripts from other people. I forgot to mention one other extension, .t, an extension often used for tests to be processed using do.test(). However, this is less common, could easily be excluded; people can use all files for this. Thanks, Tim On 3/17/2006 2:19 PM, [EMAIL PROTECTED] wrote: - Quick summary: In the File:Open dialog, please change S files (*.q) to S files (*.q, *.ssc, *.S) and show the corresponding files (including .SSC and .s files). I'll make this change in the Windows Rgui. Is this an issue in the Mac gui too? Duncan Murdoch - Background This is motivated by the following query to R-help: Date: Thu, 16 Mar 2006 22:44:11 -0600 From: xpRt.wannabe [EMAIL PROTECTED] Subject: [R] Is there a way to view S-PLUS script files in R To: r-help@stat.math.ethz.ch Dear List, I have some S-PLUS script files (.ssc). Does there exist an R function/command that can read such files? I simply want to view the code and practice in R to help me learn the subject matter. Any help would be greatly appreciated. platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major2 minor2.1 year 2005 month12 day 20 svn rev 36812 language R I responded: You can open them in R. On Windows, File:Open Script, change Files of type to All Files, then open the .ssc file. So there is a workaround. But it is odd that the S files option doesn't actually include what are probably the most common S files. Thanks, Tim Hesterberg --please do not edit the information below-- Version: platform = i386-pc-mingw32 arch = i386 os = mingw32 system = i386, mingw32 status = major = 2 minor = 2.1 year = 2005 month = 12 day = 20 svn rev = 36812 language = R Windows XP Professional (build 2600) Service Pack 2.0 Locale: LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 Search Path: .GlobalEnv, package:glmpath, package:survival, package:splines, package:methods, package:stats, package:graphics, package:grDevices, package:utils, package:datasets, Autoloads, package:base __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel