Re: [Rd] suggestion for extending ?as.factor
> "PS" == Petr Savicky > on Fri, 8 May 2009 18:10:56 +0200 writes: PS> On Fri, May 08, 2009 at 05:14:48PM +0200, Petr Savicky wrote: >> Let me suggest to consider the following modification, where match() is done >> on the strings, not on the original values. >> levels <- unique(as.character(sort(unique(x >> x <- as.character(x) >> f <- match(x, levels) PS> An alternative solution is PS> ind <- order(x) PS> x <- as.character(x) # or any other conversion to character PS> levels <- unique(x[ind]) # get unique levels ordered by the original values PS> f <- match(x, levels) (slightly but not much more complicated though). Yes, indeed that brings us back to (something like) the original "use factor(format(x)) ..." suggestion which would have been fine if there hadn't been the issue of ordering, exactly what you've addressed before. PS> The advantage of this over the suggestion from my previous email is that PS> the string conversion is applied only once. The conversion need not be only PS> as.character(). There may be other choices specified by a parametr. I have PS> strong objections against the existing implementation of as.character(), PS> but still i think that as.character() should be the default for factor() PS> for the sake of consistency of the R language. The biggest advantage to reverting to something simple like that, would be that it is really simple. My first tests with (a variation of) the above indicate favorable results. More on this on Monday. If'd revert to such a solution, we'd have to get back to Peter's point about the issue that he'd think table(.) should be more tolerant than as.character() about "almost equality". For compatibility reasons, we could also return back to the reasoning that useR should use {something like} table(signif(x, 14)) instead of table(x) for numeric x in "typical" cases. Martin __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] unsplit list of data.frames with one column
Peter Dalgaard wrote: > Will Gray wrote: >> >> Perhaps this is the intended behavior, but I discovered that unsplit >> throws an error when it tries to set rownames of a variable that has >> no dimension. This occurs when unsplit is passed a list of >> data.frames that have only a single column. >> >> An example: >> >> df <- data.frame(letters[seq(25)]) >> fac <- rep(seq(5), 5) >> unsplit(split(df, fac), fac) >> >> For reference, I'm using R version 2.9.0 (2009-04-17), subversion >> revision 48333, on Ubuntu 8.10. >> > > That's a bug. The line > > x <- value[[1L]][rep(NA, len), ] > > should be > > x <- value[[1L]][rep(NA, len), , drop=FALSE] > looks like someone got caught by the drop=TRUE design...? vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] anyDuplicated(incomp=NA) fails
> William Dunlap > on Fri, 8 May 2009 16:16:56 -0700 writes: > With today's R 2.10.0(devel) I get: >> anyDuplicated(c(1,NA,3,NA,5), incomp=NA) # expect 0 > Warning: stack imbalance in 'anyDuplicated', 20 then 21 > Warning: stack imbalance in '.Internal', 19 then 20 > Warning: stack imbalance in '{', 17 then 18 [1] 0 >> anyDuplicated(c(1,NA,3,NA,3), incomp=NA) # expect 5 > Warning: stack imbalance in 'anyDuplicated', 20 then 21 > Warning: stack imbalance in '.Internal', 19 then 20 > Warning: stack imbalance in '{', 17 then 18 [1] 0 >> anyDuplicated(c(1,NA,3,NA,3), incomp=3) # expect 4 > Warning: stack imbalance in 'anyDuplicated', 20 then 21 > Warning: stack imbalance in '.Internal', 19 then 20 > Warning: stack imbalance in '{', 17 then 18 [1] 0 >> anyDuplicated(c(1,NA,3,NA,3), incomp=c(3,NA)) # exect 0 > Warning: stack imbalance in 'anyDuplicated', 20 then 21 > Warning: stack imbalance in '.Internal', 19 then 20 > Warning: stack imbalance in '{', 17 then 18 [1] 0 >> version$svn > [1] "48493" > After applying the attached patch I get >> anyDuplicated(c(1,NA,3,NA,5), incomp=NA) > [1] 0 >> anyDuplicated(c(1,NA,3,NA,3), incomp=NA) > [1] 5 >> anyDuplicated(c(1,NA,3,NA,3), incomp=3) > [1] 4 >> anyDuplicated(c(1,NA,3,NA,3), incomp=c(3,NA)) > [1] 0 > Calls to UNPROTECT() were missing an a macro definition > did nothing because there were no backslashes at the ends > of lines. I didn't check the results very carefully. Thank you, very much Bill! Somewhat embarrassing... Note that the patch "in theory" needs to be modified to only UNPROTECT() when PROTECT() was called, which "in practice" is always ;-), but in any case, I've slightly modified your patch and also applied to R-patched. Thanks once more, Martin > Bill Dunlap TIBCO Software Inc - Spotfire Division wdunlap > tibco.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion for extending ?as.factor
At 14:18 08/05/2009, Martin Maechler wrote: > "PS" == Petr Savicky > on Fri, 8 May 2009 11:01:55 +0200 writes: Somewhere below Martin asks for alternatives from list readers. I do not have alternatives, but I do have two comments, one immediately below this, the other embedded in-line. This whole thread reminds me just why I have spent the best part of a decade climbing the virtual Matterhorn called 'Learning R' and why it is such a pleasure to use. It is the fact that somebody, somewhere cares enough about consistency, usability and accuracy to devote hours to getting even obscure details just right. PS> On Wed, May 06, 2009 at 10:41:58AM +0200, Martin Maechler wrote: PD> I think that the real issue is that we actually do want almost-equal PD> numbers to be folded together. >> >> yes, this now (revision 48469) will happen by default, using signif(x, 15) >> where '15' is the default for the new optional argument 'digitsLabels' >> {better argument name? (but must nost start with 'label')} PS> Let me analyze the current behavior of factor(x) for numeric x with missing(levels) PS> and missing(labels). In this situation, levels are computed as sort(unique(x)) PS> from possibly transformed x. Then, labels are constructed by a conversion of the PS> levels to strings. PS> I understand the current (R 2.10.0, 2009-05-07 r48492) behavior as follows. PS> If keepUnique is FALSE (the default), then PS> - values x are transformed by signif(x, digitsLabels) PS> - labels are computed using as.character(levels) PS> - digitsLabels defaults to 15, but may be set to any integer value PS> If keepUnique is TRUE, then PS> - values x are preserved PS> - labels are computed using sprintf("%.*g", digitsLabels, levels) PS> - digitsLabels defaults to 17, but may be set to any integer value (in theory; in practice, I think I've suggested somewhere that it should be >= 17; but see below.) Your summary seems correct to me. PS> There are several situations, when this approach produces duplicated levels. PS> Besides the one described in my previous email, there are also others PS> factor(c(0.3, 0.1+0.2), keepUnique=TRUE, digitsLabels=15) yes, but this is not much sensical; I've already contemplated to produce a warning in such cases, something like if(keepUnique && digitsLabels < 17) warning(gettextf( "'digitsLabels = %d' is typically too small when 'keepUnique' is true", digitsLabels)) PS> factor(1 + 0:5 * 1e-16, digitsLabels=17) again, this does not make much sense; but why disallow the useR to shoot into his foot? I agree. As a useR I do not want to be stopped from doing anything. I would appreciate a warning just before I shoot myself in the foot and I definitely want one if it looks like I am going to aim for my head. PS> I would like to suggest a modification. It eliminates most of the cases, where PS> we get duplicated levels. It would eliminate all such cases, if the function PS> signif() works as expected. Unfortunately, if signif() works as it does in the PS> current versions of R, we still get duplicated levels. PS> The suggested modification is as follows. PS> If keepUnique is FALSE (the default), then PS> - values x are transformed by signif(x, digitsLabels) PS> - labels are computed using sprintf("%.*g", digitsLabels, levels) PS> - digitsLabels defaults to 15, but may be set to any integer value I tend like this change, given -- as you found yesterday -- that as.character() is not even preserving 15 digits. OTOH, as.character() has been in use for a very long history of S (and R), whereas using sprintf() is not back compatible with it and actually depends on the LIBC implementation of the system-sprintf. For that reason as.character() would be preferable. Hmm PS> If keepUnique is TRUE, then PS> - values x are preserved PS> - labels are computed using sprintf("%.*g", 17, levels) PS> - digitsLabels is ignored I had originally planned to do exactly the above. However, e.g., digitsLabels = 18 may be desired in some cases, and that's why I also left the possibility to apply it in the keepUnique case. PS> Arguments for the modification are the following. PS> 1. If keepUnique is FALSE, then computing labels using as.character() leads PS> to duplicated labels as demonstrated in my previous email. So, i suggest to PS> use sprintf("%.*g", digitsLabels, levels) instead of as.character(). {as said above, that seems sensible, though unfurtunately quite a bit less back-compatible!} PS> 2. If keepUnique is TRUE and we allow digitsLabels less than 17, then we get PS> duplicated labels. So, i suggest to force digitsLabels=17, if keepUnique=TRUE. PS> If signif(,digitsLabels) works as expected, than the above approach should not PS> produce duplicated labels
Re: [Rd] Improve aggregate.default ...?
On Sat, 2009-05-09 at 08:23 -0400, Gabor Grothendieck wrote: > Try this: > > > aggregate(dat["A"], dat["Group"], mean) > Group A > 1 1 0.4944810 > 2 2 0.4765412 > 3 3 0.4521068 > 4 4 0.4989000 Thanks Gabor. Ideally, aggregate.default should "work" whatever indexing one uses - here you are using the fact that a data.frame is a special case of a list, and is not the way most help resources introduce subsetting for data frames. For personal use, I can use my own version of aggregate.default and as I dislike using `$`, prefering with(), I don't run the risk of non syntactic names being produced. I was really looking for ideas for improving aggregate.default in general. The solution I posted has its own infelicities... Cheers, G > > On Sat, May 9, 2009 at 8:14 AM, Gavin Simpson wrote: > > Hi, > > > > I find it a bit annoying that aggregate.default forces the returned > > object to loose the 'name' of the variable aggregated, replacing it with > > 'x'. > > > > A brief example: > > > >> dat <- data.frame(A = runif(100), B = rnorm(100), > > + Group = gl(4, 25)) > >> with(dat, aggregate(A, by = list(Group = Group), FUN = mean)) > > Group x > > 1 1 0.6523228 > > 2 2 0.4544317 > > 3 3 0.4619624 > > 4 4 0.4703156 > > > > This arises because aggregate default has: > > > > function (x, ...) > > { > >if (is.ts(x)) > >aggregate.ts(as.ts(x), ...) > >else aggregate.data.frame(as.data.frame(x), ...) > > } > > > > which recasts x as a data frame, but doesn't make any effort to supply a > > name. Can we do a better job of supplying a useful name? > > > > My first attempt is: > > > > aggregate.default <- function(x, ...) { > >if (is.ts(x)) > >aggregate.ts(as.ts(x), ...) > >else { > >nam <- deparse(substitute(x)) > >x <- as.data.frame(x) > >names(x) <- nam > >aggregate.data.frame(x, ...) > >} > > } > > > > Which works for the brief example above: > > > >> with(dat, aggregate(A, by = list(Group = Group), FUN = mean)) > > Group A > > 1 1 0.4269715 > > 2 2 0.5479352 > > 3 3 0.5091543 > > 4 4 0.4926412 > > > > However, it fails make check-all because examples have relied on > > returned object having 'x'. I also note that this might have the > > annoying side effect of producing odd names if we use the following > > incantation: > > > >> res <- aggregate(dat$A, by = list(Group = dat$Group), FUN = mean) > >> str(res) > > 'data.frame': 4 obs. of 2 variables: > > $ Group: Factor w/ 4 levels "1","2","3","4": 1 2 3 4 > > $ dat$A: num 0.427 0.548 0.509 0.493 > >> res$dat$A > > Error in res$dat$A : $ operator is invalid for atomic vectors > >> res$`dat$A` > > [1] 0.4269715 0.5479352 0.5091543 0.4926412 > > > > Is there a way of coming up with a better way to name the aggregated > > variable? Would a change of this kind be something R Core would consider > > making to aggregate.default if a good solution is found? > > > > Thanks in advance, > > > > G > > -- > > %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% > > Dr. Gavin Simpson [t] +44 (0)20 7679 0522 > > ECRC, UCL Geography, [f] +44 (0)20 7679 0565 > > Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk > > Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ > > UK. WC1E 6BT. [w] http://www.freshwaters.org.uk > > %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% > > > > __ > > R-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Dr. Gavin Simpson [t] +44 (0)20 7679 0522 ECRC, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Improve aggregate.default ...?
Try this: > aggregate(dat["A"], dat["Group"], mean) Group A 1 1 0.4944810 2 2 0.4765412 3 3 0.4521068 4 4 0.4989000 On Sat, May 9, 2009 at 8:14 AM, Gavin Simpson wrote: > Hi, > > I find it a bit annoying that aggregate.default forces the returned > object to loose the 'name' of the variable aggregated, replacing it with > 'x'. > > A brief example: > >> dat <- data.frame(A = runif(100), B = rnorm(100), > + Group = gl(4, 25)) >> with(dat, aggregate(A, by = list(Group = Group), FUN = mean)) > Group x > 1 1 0.6523228 > 2 2 0.4544317 > 3 3 0.4619624 > 4 4 0.4703156 > > This arises because aggregate default has: > > function (x, ...) > { > if (is.ts(x)) > aggregate.ts(as.ts(x), ...) > else aggregate.data.frame(as.data.frame(x), ...) > } > > which recasts x as a data frame, but doesn't make any effort to supply a > name. Can we do a better job of supplying a useful name? > > My first attempt is: > > aggregate.default <- function(x, ...) { > if (is.ts(x)) > aggregate.ts(as.ts(x), ...) > else { > nam <- deparse(substitute(x)) > x <- as.data.frame(x) > names(x) <- nam > aggregate.data.frame(x, ...) > } > } > > Which works for the brief example above: > >> with(dat, aggregate(A, by = list(Group = Group), FUN = mean)) > Group A > 1 1 0.4269715 > 2 2 0.5479352 > 3 3 0.5091543 > 4 4 0.4926412 > > However, it fails make check-all because examples have relied on > returned object having 'x'. I also note that this might have the > annoying side effect of producing odd names if we use the following > incantation: > >> res <- aggregate(dat$A, by = list(Group = dat$Group), FUN = mean) >> str(res) > 'data.frame': 4 obs. of 2 variables: > $ Group: Factor w/ 4 levels "1","2","3","4": 1 2 3 4 > $ dat$A: num 0.427 0.548 0.509 0.493 >> res$dat$A > Error in res$dat$A : $ operator is invalid for atomic vectors >> res$`dat$A` > [1] 0.4269715 0.5479352 0.5091543 0.4926412 > > Is there a way of coming up with a better way to name the aggregated > variable? Would a change of this kind be something R Core would consider > making to aggregate.default if a good solution is found? > > Thanks in advance, > > G > -- > %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% > Dr. Gavin Simpson [t] +44 (0)20 7679 0522 > ECRC, UCL Geography, [f] +44 (0)20 7679 0565 > Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk > Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ > UK. WC1E 6BT. [w] http://www.freshwaters.org.uk > %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Improve aggregate.default ...?
Hi, I find it a bit annoying that aggregate.default forces the returned object to loose the 'name' of the variable aggregated, replacing it with 'x'. A brief example: > dat <- data.frame(A = runif(100), B = rnorm(100), + Group = gl(4, 25)) > with(dat, aggregate(A, by = list(Group = Group), FUN = mean)) Group x 1 1 0.6523228 2 2 0.4544317 3 3 0.4619624 4 4 0.4703156 This arises because aggregate default has: function (x, ...) { if (is.ts(x)) aggregate.ts(as.ts(x), ...) else aggregate.data.frame(as.data.frame(x), ...) } which recasts x as a data frame, but doesn't make any effort to supply a name. Can we do a better job of supplying a useful name? My first attempt is: aggregate.default <- function(x, ...) { if (is.ts(x)) aggregate.ts(as.ts(x), ...) else { nam <- deparse(substitute(x)) x <- as.data.frame(x) names(x) <- nam aggregate.data.frame(x, ...) } } Which works for the brief example above: > with(dat, aggregate(A, by = list(Group = Group), FUN = mean)) Group A 1 1 0.4269715 2 2 0.5479352 3 3 0.5091543 4 4 0.4926412 However, it fails make check-all because examples have relied on returned object having 'x'. I also note that this might have the annoying side effect of producing odd names if we use the following incantation: > res <- aggregate(dat$A, by = list(Group = dat$Group), FUN = mean) > str(res) 'data.frame': 4 obs. of 2 variables: $ Group: Factor w/ 4 levels "1","2","3","4": 1 2 3 4 $ dat$A: num 0.427 0.548 0.509 0.493 > res$dat$A Error in res$dat$A : $ operator is invalid for atomic vectors > res$`dat$A` [1] 0.4269715 0.5479352 0.5091543 0.4926412 Is there a way of coming up with a better way to name the aggregated variable? Would a change of this kind be something R Core would consider making to aggregate.default if a good solution is found? Thanks in advance, G -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Dr. Gavin Simpson [t] +44 (0)20 7679 0522 ECRC, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] unsplit list of data.frames with one column
Will Gray wrote: Perhaps this is the intended behavior, but I discovered that unsplit throws an error when it tries to set rownames of a variable that has no dimension. This occurs when unsplit is passed a list of data.frames that have only a single column. An example: df <- data.frame(letters[seq(25)]) fac <- rep(seq(5), 5) unsplit(split(df, fac), fac) For reference, I'm using R version 2.9.0 (2009-04-17), subversion revision 48333, on Ubuntu 8.10. That's a bug. The line x <- value[[1L]][rep(NA, len), ] should be x <- value[[1L]][rep(NA, len), , drop=FALSE] -- O__ Peter Dalgaard Øster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - (p.dalga...@biostat.ku.dk) FAX: (+45) 35327907 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel