[R] tapply and error bars: Problem Fixed
HI Jim, This is great!! It is also tricky!!! The problem lies in the choice of ylim. And looking at the data and choosing ylim based on the maximum and minimum values of y is a waste of time. And choosing it by other means was yet much more difficult. I had to start plotting part of the data with incremental step of 80 data points and manually varying ylim till I got to the last data point 1136, where I finally used ylim=c(15,162000) which has nothing to do with the raw data. Many, many thanks. Best wishes Ogbos On Sun, Jun 24, 2018 at 9:51 PM, Jim Lemon wrote: > Hi Ogbos, > The problem is almost certainly with the data. I get the plot I expect > with the sample data that you first posted, so I know that the code > works. If you try thIs what do you get? > > oodf<-read.table(text="S/N AB > 1-5 64833 > 2-4 95864 > 3-3 82322 > 4-2 95591 > 5-1 69378 > 6 0 74281 > 7 1 103261 > 8 2 92473 > 9 3 84344 > 104 127415 > 115 123826 > 126 100029 > 137 76205 > 148 105162 > 159 119533 > 16 10 106490 > 17 -5 82322 > 18 -4 95591 > 19 -3 69378 > 20 -2 74281 > 21 -1 103261 > 220 92473 > 231 84344 > 242 127415 > 253 123826 > 264 100029 > 275 76205 > 286 105162 > 297 119533 > 308 106490 > 319 114771 > 32 10 55593 > 33 -5 85694 > 34 -4 65205 > 35 -3 80995 > 36 -2 51723 > 37 -1 62310 > 380 53401 > 391 65677 > 402 76094 > 413 64035 > 424 68290 > 435 73306 > 446 82176 > 457 75566 > 468 89762 > 479 88063 > 48 10 94395 > 49 -5 80651 > 50 -4 81291 > 51 -3 63702 > 52 -2 70297 > 53 -1 64117 > 540 71219 > 551 57354 > 562 62111 > 573 42252 > 584 35454 > 595 33469 > 606 38899 > 617 64981 > 628 85694 > 639 79452 > 64 10 85216 > 65 -5 71219 > 66 -4 57354 > 67 -3 62111 > 68 -2 42252 > 69 -1 35454 > 700 33469 > 711 38899 > 722 64981 > 733 85694 > 744 79452 > 755 85216 > 766 81721 > 777 91231 > 788 107074 > 799 108103 > 80 10 7576", > header=TRUE) > library(plotrix) > std.error<-function(x) return(sd(x)/(sum(!is.na(x > oomean<-as.vector(by(oodf$B,oodf$A,mean)) > oose<-as.vector(by(oodf$B,oodf$A,std.error)) > plot(-5:10,oomean,type="b",ylim=c(5,11), > xlab="days (epoch is the day of Fd)",ylab="strikes/km2/day") > dispersion(-5:10,oomean,oose) > > I get the attached plot; > > Jim > > On Mon, Jun 25, 2018 at 1:58 AM, Ogbos Okike > wrote: > > Hi Jim > > > > Thanks again for returning to this. > > please not that the line "oomean<-as.vector(by(oodf$B,oodf$A,mean))" was > > omitted (not sure whether deliberate) after you introduced the standard > > error function. > > When I used it, empty plot window with the correct axes were generated > but > > no data was displayed. No error too. > > > > library(plotrix) > > std.error<-function(x) return(sd(x)/(sum(!is.na(x > > oose<-as.vector(by(oodf$B,oodf$A,std.error)) > > plot(-5:10,oomean,type="b",ylim=c(5,11), > > xlab="days (epoch is the day of Fd)",ylab="strikes/km2/day") > > dispersion(-5:10,oomean,oose) > > > > When I included the line, the same empty graph window was generated but > with > > the former error "Error in FUN(X[[1L]], ...) : could not find function > > "FUN"" > > library(plotrix) > > std.error<-function(x) return(sd(x)/(sum(!is.na(x > > oomean<-as.vector(by(oodf$B,oodf$A,mean)) > > oose<-as.vector(by(oodf$B,oodf$A,std.error)) > > plot(-5:10,oomean,type="b",ylim=c(5,11), > > xlab="days (epoch is the day of Fd)",ylab="strikes/km2/day") > > dispersion(-5:10,oomean,oose) > > > > I am sure am missing something but can't place it. Please have a look > again > > to track my mistake. > > > > Warmest regards > > Ogbos > > > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply and error bars
Hi Ogbos, The problem is almost certainly with the data. I get the plot I expect with the sample data that you first posted, so I know that the code works. If you try thIs what do you get? oodf<-read.table(text="S/N AB 1-5 64833 2-4 95864 3-3 82322 4-2 95591 5-1 69378 6 0 74281 7 1 103261 8 2 92473 9 3 84344 104 127415 115 123826 126 100029 137 76205 148 105162 159 119533 16 10 106490 17 -5 82322 18 -4 95591 19 -3 69378 20 -2 74281 21 -1 103261 220 92473 231 84344 242 127415 253 123826 264 100029 275 76205 286 105162 297 119533 308 106490 319 114771 32 10 55593 33 -5 85694 34 -4 65205 35 -3 80995 36 -2 51723 37 -1 62310 380 53401 391 65677 402 76094 413 64035 424 68290 435 73306 446 82176 457 75566 468 89762 479 88063 48 10 94395 49 -5 80651 50 -4 81291 51 -3 63702 52 -2 70297 53 -1 64117 540 71219 551 57354 562 62111 573 42252 584 35454 595 33469 606 38899 617 64981 628 85694 639 79452 64 10 85216 65 -5 71219 66 -4 57354 67 -3 62111 68 -2 42252 69 -1 35454 700 33469 711 38899 722 64981 733 85694 744 79452 755 85216 766 81721 777 91231 788 107074 799 108103 80 10 7576", header=TRUE) library(plotrix) std.error<-function(x) return(sd(x)/(sum(!is.na(x oomean<-as.vector(by(oodf$B,oodf$A,mean)) oose<-as.vector(by(oodf$B,oodf$A,std.error)) plot(-5:10,oomean,type="b",ylim=c(5,11), xlab="days (epoch is the day of Fd)",ylab="strikes/km2/day") dispersion(-5:10,oomean,oose) I get the attached plot; Jim On Mon, Jun 25, 2018 at 1:58 AM, Ogbos Okike wrote: > Hi Jim > > Thanks again for returning to this. > please not that the line "oomean<-as.vector(by(oodf$B,oodf$A,mean))" was > omitted (not sure whether deliberate) after you introduced the standard > error function. > When I used it, empty plot window with the correct axes were generated but > no data was displayed. No error too. > > library(plotrix) > std.error<-function(x) return(sd(x)/(sum(!is.na(x > oose<-as.vector(by(oodf$B,oodf$A,std.error)) > plot(-5:10,oomean,type="b",ylim=c(5,11), > xlab="days (epoch is the day of Fd)",ylab="strikes/km2/day") > dispersion(-5:10,oomean,oose) > > When I included the line, the same empty graph window was generated but with > the former error "Error in FUN(X[[1L]], ...) : could not find function > "FUN"" > library(plotrix) > std.error<-function(x) return(sd(x)/(sum(!is.na(x > oomean<-as.vector(by(oodf$B,oodf$A,mean)) > oose<-as.vector(by(oodf$B,oodf$A,std.error)) > plot(-5:10,oomean,type="b",ylim=c(5,11), > xlab="days (epoch is the day of Fd)",ylab="strikes/km2/day") > dispersion(-5:10,oomean,oose) > > I am sure am missing something but can't place it. Please have a look again > to track my mistake. > > Warmest regards > Ogbos > ooplot.pdf Description: Adobe PDF document __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply and error bars
Hi Jim Thanks again for returning to this. please not that the line "oomean<-as.vector(by(oodf$B,oodf$A,mean))" was omitted (not sure whether deliberate) after you introduced the standard error function. When I used it, empty plot window with the correct axes were generated but no data was displayed. No error too. library(plotrix) std.error<-function(x) return(sd(x)/(sum(!is.na(x oose<-as.vector(by(oodf$B,oodf$A,std.error)) plot(-5:10,oomean,type="b",ylim=c(5,11), xlab="days (epoch is the day of Fd)",ylab="strikes/km2/day") dispersion(-5:10,oomean,oose) When I included the line, the same empty graph window was generated but with the former error "Error in FUN(X[[1L]], ...) : could not find function "FUN"" library(plotrix) std.error<-function(x) return(sd(x)/(sum(!is.na(x oomean<-as.vector(by(oodf$B,oodf$A,mean)) oose<-as.vector(by(oodf$B,oodf$A,std.error)) plot(-5:10,oomean,type="b",ylim=c(5,11), xlab="days (epoch is the day of Fd)",ylab="strikes/km2/day") dispersion(-5:10,oomean,oose) I am sure am missing something but can't place it. Please have a look again to track my mistake. Warmest regards Ogbos On Sun, Jun 24, 2018 at 11:24 AM, Jim Lemon wrote: > Hi Ogbos, > If I use the example data that you sent, I get the error after this line: > > oose<-as.vector(by(oodf$B,oodf$A,std.error)) > Error in FUN(X[[i]], ...) : object 'std.error' not found > > The reason is that you have not defined std.error as a function, but > as the result of a calculation. When I rewrite it like this: > > std.error<-function(x) return(sd(x)/(sum(!is.na(x > oose<-as.vector(by(oodf$B,oodf$A,std.error)) > plot(-5:10,oomean,type="b",ylim=c(5,11), > xlab="days (epoch is the day of Fd)",ylab="strikes/km2/day") > dispersion(-5:10,oomean,oose) > > I get the expected plot. > > Jim > > > On Sat, Jun 23, 2018 at 9:36 PM, Ogbos Okike > wrote: > > Hi Jim, > > > > Thanks for assisting. Here is what I did: > > > > A<-matrix(rep(-5:10,71)) > > B<-matrix(data) > > std.error = sd(B)/sqrt(sum(!is.na(B))) > > oodf<-data.frame(A,B) > > > > oomean<-as.vector(by(oodf$B,oodf$A,mean)) > > oose<-as.vector(by(oodf$B,oodf$A,std.error)) > > plot(-5:10,oomean,type="b",ylim=c(5,11), > > xlab="days (epoch is the day of Fd)",ylab="strikes/km2/day") > > dispersion(-5:10,oomean,oose) > > > > And the error says: > > Error in FUN(X[[1L]], ...) : could not find function "FUN" > > > > Please note that I use: > > std.error = sd(B)/sqrt(sum(!is.na(B))) > > to calculate the standard error as it requested for it. > > > > Thanks > > Ogbos > > > > On Sat, Jun 23, 2018 at 10:09 AM, Jim Lemon > wrote: > >> > >> Hi Ogbos, > >> This may help: > >> > >> # assume your data frame is named "oodf" > >> oomean<-as.vector(by(oodf$B,oodf$A,mean)) > >> oose<-as.vector(by(oodf$B,oodf$A,std.error)) > >> plot(-5:10,oomean,type="b",ylim=c(5,11), > >> xlab="days (epoch is the day of Fd)",ylab="strikes/km2/day") > >> dispersion(-5:10,oomean,oose) > >> > >> Jim > >> > >> On Sat, Jun 23, 2018 at 4:35 PM, Ogbos Okike > >> wrote: > >> > Dear workers, > >> > I have a data of length 1136. Below is the code I use to get the means > >> > B. > >> > It worked fine and I had the mean calculated and plotted. > >> > > >> > I wish to plot the error bars as well. I already plotted such means > with > >> > error bars before. Please see attached for example. > >> > > >> > I tried to redo the same plot but unlikely could not get around it as > I > >> > lost my system containing the script. > >> > Among many attempts, I tried: > >> > library(gplots) > >> > > >> > plotmeans(errors~AB,xlab="Factor A",ylab="mean errors", p=.68, > >> > main="Main > >> > effect Plot",barcol="black") > >> > Nothing worked. > >> > > >> > I would really be thankful should somebody return me to the track. > >> > Many, many thanks for your time. > >> > Ogbos > >> > > >> > A sample of the data is: > >> > S/N AB > >> > 1-5 64833 > >> > 2-4 95864 > >> > 3-3 82322 > >> > 4-2 95591 > >> > 5-1 69378 > >> > 6 0 74281 > >> > 7 1 103261 > >> > 8 2 92473 > >> > 9 3 84344 > >> > 104 127415 > >> > 115 123826 > >> > 126 100029 > >> > 137 76205 > >> > 148 105162 > >> > 159 119533 > >> > 16 10 106490 > >> > 17 -5 82322 > >> > 18 -4 95591 > >> > 19 -3 69378 > >> > 20 -2 74281 > >> > 21 -1 103261 > >> > 220 92473 > >> > 231 84344 > >> > 242 127415 > >> > 253 123826 > >> > 264 100029 > >> > 275 76205 > >> > 286 105162 > >> > 297 119533 > >> > 308 106490 > >> > 319 114771 > >> > 32 10 55593 > >> > 33 -5 85694 > >> > 34 -4 65205 > >> > 35 -3 80995 > >> > 36 -2 51723 > >> > 37 -1 62310 > >> > 380 53401 > >> > 391 65677 > >> > 402 76094 > >> > 413 64035 > >> > 424 68290 > >> > 435 73306 > >> > 446 82176 > >> > 457 75566 > >> > 468 89762 > >> > 479 88063 > >> > 48 10 94395 > >>
Re: [R] tapply and error bars
Hi Ogbos, If I use the example data that you sent, I get the error after this line: oose<-as.vector(by(oodf$B,oodf$A,std.error)) Error in FUN(X[[i]], ...) : object 'std.error' not found The reason is that you have not defined std.error as a function, but as the result of a calculation. When I rewrite it like this: std.error<-function(x) return(sd(x)/(sum(!is.na(x oose<-as.vector(by(oodf$B,oodf$A,std.error)) plot(-5:10,oomean,type="b",ylim=c(5,11), xlab="days (epoch is the day of Fd)",ylab="strikes/km2/day") dispersion(-5:10,oomean,oose) I get the expected plot. Jim On Sat, Jun 23, 2018 at 9:36 PM, Ogbos Okike wrote: > Hi Jim, > > Thanks for assisting. Here is what I did: > > A<-matrix(rep(-5:10,71)) > B<-matrix(data) > std.error = sd(B)/sqrt(sum(!is.na(B))) > oodf<-data.frame(A,B) > > oomean<-as.vector(by(oodf$B,oodf$A,mean)) > oose<-as.vector(by(oodf$B,oodf$A,std.error)) > plot(-5:10,oomean,type="b",ylim=c(5,11), > xlab="days (epoch is the day of Fd)",ylab="strikes/km2/day") > dispersion(-5:10,oomean,oose) > > And the error says: > Error in FUN(X[[1L]], ...) : could not find function "FUN" > > Please note that I use: > std.error = sd(B)/sqrt(sum(!is.na(B))) > to calculate the standard error as it requested for it. > > Thanks > Ogbos > > On Sat, Jun 23, 2018 at 10:09 AM, Jim Lemon wrote: >> >> Hi Ogbos, >> This may help: >> >> # assume your data frame is named "oodf" >> oomean<-as.vector(by(oodf$B,oodf$A,mean)) >> oose<-as.vector(by(oodf$B,oodf$A,std.error)) >> plot(-5:10,oomean,type="b",ylim=c(5,11), >> xlab="days (epoch is the day of Fd)",ylab="strikes/km2/day") >> dispersion(-5:10,oomean,oose) >> >> Jim >> >> On Sat, Jun 23, 2018 at 4:35 PM, Ogbos Okike >> wrote: >> > Dear workers, >> > I have a data of length 1136. Below is the code I use to get the means >> > B. >> > It worked fine and I had the mean calculated and plotted. >> > >> > I wish to plot the error bars as well. I already plotted such means with >> > error bars before. Please see attached for example. >> > >> > I tried to redo the same plot but unlikely could not get around it as I >> > lost my system containing the script. >> > Among many attempts, I tried: >> > library(gplots) >> > >> > plotmeans(errors~AB,xlab="Factor A",ylab="mean errors", p=.68, >> > main="Main >> > effect Plot",barcol="black") >> > Nothing worked. >> > >> > I would really be thankful should somebody return me to the track. >> > Many, many thanks for your time. >> > Ogbos >> > >> > A sample of the data is: >> > S/N AB >> > 1-5 64833 >> > 2-4 95864 >> > 3-3 82322 >> > 4-2 95591 >> > 5-1 69378 >> > 6 0 74281 >> > 7 1 103261 >> > 8 2 92473 >> > 9 3 84344 >> > 104 127415 >> > 115 123826 >> > 126 100029 >> > 137 76205 >> > 148 105162 >> > 159 119533 >> > 16 10 106490 >> > 17 -5 82322 >> > 18 -4 95591 >> > 19 -3 69378 >> > 20 -2 74281 >> > 21 -1 103261 >> > 220 92473 >> > 231 84344 >> > 242 127415 >> > 253 123826 >> > 264 100029 >> > 275 76205 >> > 286 105162 >> > 297 119533 >> > 308 106490 >> > 319 114771 >> > 32 10 55593 >> > 33 -5 85694 >> > 34 -4 65205 >> > 35 -3 80995 >> > 36 -2 51723 >> > 37 -1 62310 >> > 380 53401 >> > 391 65677 >> > 402 76094 >> > 413 64035 >> > 424 68290 >> > 435 73306 >> > 446 82176 >> > 457 75566 >> > 468 89762 >> > 479 88063 >> > 48 10 94395 >> > 49 -5 80651 >> > 50 -4 81291 >> > 51 -3 63702 >> > 52 -2 70297 >> > 53 -1 64117 >> > 540 71219 >> > 551 57354 >> > 562 62111 >> > 573 42252 >> > 584 35454 >> > 595 33469 >> > 606 38899 >> > 617 64981 >> > 628 85694 >> > 639 79452 >> > 64 10 85216 >> > 65 -5 71219 >> > 66 -4 57354 >> > 67 -3 62111 >> > 68 -2 42252 >> > 69 -1 35454 >> > 700 33469 >> > 711 38899 >> > 722 64981 >> > 733 85694 >> > 744 79452 >> > 755 85216 >> > 766 81721 >> > 777 91231 >> > 788 107074 >> > 799 108103 >> > 80 10 7576 >> > >> > A<-matrix(rep(-5:10,71)) >> > B<-matrix(data) >> > AB<-data.frame(A,B) >> > >> > x= B >> > >> > f<-factor(A) >> > AB<- tapply(x,f,mean) >> > x<--5:10 >> > plot(x,AB,type="l") >> > >> > __ >> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide >> > http://www.R-project.org/posting-guide.html >> > and provide commented, minimal, self-contained, reproducible code. >> > > > __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply and error bars
Hi Ogbos, This may help: # assume your data frame is named "oodf" oomean<-as.vector(by(oodf$B,oodf$A,mean)) oose<-as.vector(by(oodf$B,oodf$A,std.error)) plot(-5:10,oomean,type="b",ylim=c(5,11), xlab="days (epoch is the day of Fd)",ylab="strikes/km2/day") dispersion(-5:10,oomean,oose) Jim On Sat, Jun 23, 2018 at 4:35 PM, Ogbos Okike wrote: > Dear workers, > I have a data of length 1136. Below is the code I use to get the means B. > It worked fine and I had the mean calculated and plotted. > > I wish to plot the error bars as well. I already plotted such means with > error bars before. Please see attached for example. > > I tried to redo the same plot but unlikely could not get around it as I > lost my system containing the script. > Among many attempts, I tried: > library(gplots) > > plotmeans(errors~AB,xlab="Factor A",ylab="mean errors", p=.68, main="Main > effect Plot",barcol="black") > Nothing worked. > > I would really be thankful should somebody return me to the track. > Many, many thanks for your time. > Ogbos > > A sample of the data is: > S/N AB > 1-5 64833 > 2-4 95864 > 3-3 82322 > 4-2 95591 > 5-1 69378 > 6 0 74281 > 7 1 103261 > 8 2 92473 > 9 3 84344 > 104 127415 > 115 123826 > 126 100029 > 137 76205 > 148 105162 > 159 119533 > 16 10 106490 > 17 -5 82322 > 18 -4 95591 > 19 -3 69378 > 20 -2 74281 > 21 -1 103261 > 220 92473 > 231 84344 > 242 127415 > 253 123826 > 264 100029 > 275 76205 > 286 105162 > 297 119533 > 308 106490 > 319 114771 > 32 10 55593 > 33 -5 85694 > 34 -4 65205 > 35 -3 80995 > 36 -2 51723 > 37 -1 62310 > 380 53401 > 391 65677 > 402 76094 > 413 64035 > 424 68290 > 435 73306 > 446 82176 > 457 75566 > 468 89762 > 479 88063 > 48 10 94395 > 49 -5 80651 > 50 -4 81291 > 51 -3 63702 > 52 -2 70297 > 53 -1 64117 > 540 71219 > 551 57354 > 562 62111 > 573 42252 > 584 35454 > 595 33469 > 606 38899 > 617 64981 > 628 85694 > 639 79452 > 64 10 85216 > 65 -5 71219 > 66 -4 57354 > 67 -3 62111 > 68 -2 42252 > 69 -1 35454 > 700 33469 > 711 38899 > 722 64981 > 733 85694 > 744 79452 > 755 85216 > 766 81721 > 777 91231 > 788 107074 > 799 108103 > 80 10 7576 > > A<-matrix(rep(-5:10,71)) > B<-matrix(data) > AB<-data.frame(A,B) > > x= B > > f<-factor(A) > AB<- tapply(x,f,mean) > x<--5:10 > plot(x,AB,type="l") > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] tapply and error bars
Dear workers, I have a data of length 1136. Below is the code I use to get the means B. It worked fine and I had the mean calculated and plotted. I wish to plot the error bars as well. I already plotted such means with error bars before. Please see attached for example. I tried to redo the same plot but unlikely could not get around it as I lost my system containing the script. Among many attempts, I tried: library(gplots) plotmeans(errors~AB,xlab="Factor A",ylab="mean errors", p=.68, main="Main effect Plot",barcol="black") Nothing worked. I would really be thankful should somebody return me to the track. Many, many thanks for your time. Ogbos A sample of the data is: S/N AB 1-5 64833 2-4 95864 3-3 82322 4-2 95591 5-1 69378 6 0 74281 7 1 103261 8 2 92473 9 3 84344 104 127415 115 123826 126 100029 137 76205 148 105162 159 119533 16 10 106490 17 -5 82322 18 -4 95591 19 -3 69378 20 -2 74281 21 -1 103261 220 92473 231 84344 242 127415 253 123826 264 100029 275 76205 286 105162 297 119533 308 106490 319 114771 32 10 55593 33 -5 85694 34 -4 65205 35 -3 80995 36 -2 51723 37 -1 62310 380 53401 391 65677 402 76094 413 64035 424 68290 435 73306 446 82176 457 75566 468 89762 479 88063 48 10 94395 49 -5 80651 50 -4 81291 51 -3 63702 52 -2 70297 53 -1 64117 540 71219 551 57354 562 62111 573 42252 584 35454 595 33469 606 38899 617 64981 628 85694 639 79452 64 10 85216 65 -5 71219 66 -4 57354 67 -3 62111 68 -2 42252 69 -1 35454 700 33469 711 38899 722 64981 733 85694 744 79452 755 85216 766 81721 777 91231 788 107074 799 108103 80 10 7576 A<-matrix(rep(-5:10,71)) B<-matrix(data) AB<-data.frame(A,B) x= B f<-factor(A) AB<- tapply(x,f,mean) x<--5:10 plot(x,AB,type="l") __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] tapply error svyby function survey package
Hi. I'm trying to calculate the weighted mean score of a quality of life measure (ovt) in patients with irritable bowel syndrome by their marital status (d7). This is a summary of the structure of the dataset: str(sii.tesis) 'data.frame':1063 obs. of 75 variables: $ id : int 51 52 53 54 55 56 57 58 59 60 ... $ stratum: Factor w/ 6 levels MEst,MAcad,..: 1 4 NA 4 4 1 6 NA 4 4 ... $ expfc : num 22.8 17.1 NA 17.1 17.1 ... $ d6 : Factor w/ 3 levels Estudiante,Profesor,..: 1 1 NA 1 1 1 3 NA 1 1 ... $ d7 : Factor w/ 6 levels Soltero,Casado,..: 1 1 NA 1 1 1 1 NA 1 1 ... $ d7c: Factor w/ 2 levels No estable,Estable: 1 1 NA 1 1 1 1 NA 1 1 ... $ s1cm : Factor w/ 2 levels No,Si: 1 2 NA 1 1 1 2 NA 1 1 ... $ ovt: num NA 93.4 NA NA NA ... I declared the sampling design: sii.design - svydesign( id = ~1, strata = ~stratum, weights = ~expfc, data = subset(sii.tesis, !is.na(stratum))) Then I tried to get the result: svyby(~ovt, ~d7, sii.design, svymean, na.rm = TRUE, level = 0.95) but i get the error: Error in tapply(1:NROW(x), list(factor(strata)), function(index) { : arguments must have same length The length of both variables is the same. If the variable ovt exists, there is a d7 match in the data frame. I try the same thing using another variable instead - role (d6) - and it works. svyby(~ovt, ~d6, sii.design, svymean, na.rm = TRUE, level = 0.95) d6 ovt se Estudiante Estudiante 71.01805 1.370569 Profesor Profesor 72.30923 6.518378 Administrativo Administrativo 75.69102 3.715050 If I use the recategorized d7 variable (d7c, two levels only) it works too: svyby(~ovt, ~d7c, sii.design, svymean, na.rm = TRUE, level = 0.95) d7c ovt se No estable No estable 70.92344 1.37460 Estable Estable 74.53719 4.16954 What could be the problem? Regards. Martin Canon Colombia, South America __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply error svyby function survey package
try resetting your levels? if that doesn't work, please dput() an example data set that we can test with :) thanks! sii.design - update( sii.design , d6 = factor( d6 ) ) On Wed, Nov 12, 2014 at 7:59 AM, Martin Canon martin.ca...@gmail.com wrote: Hi. I'm trying to calculate the weighted mean score of a quality of life measure (ovt) in patients with irritable bowel syndrome by their marital status (d7). This is a summary of the structure of the dataset: str(sii.tesis) 'data.frame':1063 obs. of 75 variables: $ id : int 51 52 53 54 55 56 57 58 59 60 ... $ stratum: Factor w/ 6 levels MEst,MAcad,..: 1 4 NA 4 4 1 6 NA 4 4 ... $ expfc : num 22.8 17.1 NA 17.1 17.1 ... $ d6 : Factor w/ 3 levels Estudiante,Profesor,..: 1 1 NA 1 1 1 3 NA 1 1 ... $ d7 : Factor w/ 6 levels Soltero,Casado,..: 1 1 NA 1 1 1 1 NA 1 1 ... $ d7c: Factor w/ 2 levels No estable,Estable: 1 1 NA 1 1 1 1 NA 1 1 ... $ s1cm : Factor w/ 2 levels No,Si: 1 2 NA 1 1 1 2 NA 1 1 ... $ ovt: num NA 93.4 NA NA NA ... I declared the sampling design: sii.design - svydesign( id = ~1, strata = ~stratum, weights = ~expfc, data = subset(sii.tesis, !is.na(stratum))) Then I tried to get the result: svyby(~ovt, ~d7, sii.design, svymean, na.rm = TRUE, level = 0.95) but i get the error: Error in tapply(1:NROW(x), list(factor(strata)), function(index) { : arguments must have same length The length of both variables is the same. If the variable ovt exists, there is a d7 match in the data frame. I try the same thing using another variable instead - role (d6) - and it works. svyby(~ovt, ~d6, sii.design, svymean, na.rm = TRUE, level = 0.95) d6 ovt se Estudiante Estudiante 71.01805 1.370569 Profesor Profesor 72.30923 6.518378 Administrativo Administrativo 75.69102 3.715050 If I use the recategorized d7 variable (d7c, two levels only) it works too: svyby(~ovt, ~d7c, sii.design, svymean, na.rm = TRUE, level = 0.95) d7c ovt se No estable No estable 70.92344 1.37460 Estable Estable 74.53719 4.16954 What could be the problem? Regards. Martin Canon Colombia, South America __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply error svyby function survey package
hi martin, sending the first 25 rows does not help if it does not re-create the problem.. when i run the data you have provided, i do not encounter your problem (see below). someone else may be able to guess the issue, but this would be a lot easier to solve if you can create a minimal reproducible example http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example sii.tesis - structure(list(id = c(51L, 52L, 53L, 54L, 55L, 56L, 57L, 58L, 59L, 60L, 61L, 62L, 63L, 64L, 65L, 66L, 67L, 68L, 69L, 70L, 71L, 73L, 74L, 75L, 76L), stratum = structure(c(1L, 4L, NA, 4L, 4L, 1L, 6L, NA, 4L, 4L, 1L, 1L, 1L, 6L, 6L, 3L, 3L, 6L, NA, 1L, 1L, 6L, 4L, 3L, 6L), .Label = c(MEst, MAcad, MAdm, FEst, FAcad, FAdm), class = factor), expfc = c(22.8195266723633, 17.0644626617432, NA, 17.0644626617432, 17.0644626617432, 22.8195266723633, 5.1702127456665, NA, 17.0644626617432, 17.0644626617432, 22.8195266723633, 22.8195266723633, 22.8195266723633, 5.1702127456665, 5.1702127456665, 6.24137926101685, 6.24137926101685, 5.1702127456665, NA, 22.8195266723633, 22.8195266723633, 5.1702127456665, 17.0644626617432, 6.24137926101685, 5.1702127456665), d7 = structure(c(1L, 1L, NA, 1L, 1L, 1L, 1L, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, NA, 1L, 1L, 6L, 1L, 6L, 6L), .Label = c(Soltero, Casado, Separado, Divorciado, Viudo, Union libre), class = factor), ovt = c(NA, 93.3823547363281, NA, NA, NA, NA, 83.8235321044922, NA, NA, NA, NA, NA, NA, NA, 79.4117660522461, NA, NA, 19.1176471710205, NA, NA, NA, 85.2941207885742, NA, NA, NA)), .Names = c(id, stratum, expfc, d7, ovt ), row.names = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25), class = data.frame) sii.design - svydesign( id = ~1, strata = ~stratum, weights = ~expfc, data = subset(sii.tesis, !is.na(stratum))) svyby(~ovt, ~d7, sii.design, svymean, na.rm = TRUE, level = 0.95) # works fine--- svyby(~ovt, ~d7, sii.design, svymean, na.rm = TRUE, level = 0.95) d7 ovt se Soltero Soltero 88.94329 3.333485 Casado Casado 19.11765 0.00 Union libre Union libre 85.29412 0.00 On Wed, Nov 12, 2014 at 5:25 PM, Martin Canon martin.ca...@gmail.com wrote: Anthony, thanks for your reply. Resetting the levels didn't work. These are the first 25 rows of the dataset: structure(list(id = c(51L, 52L, 53L, 54L, 55L, 56L, 57L, 58L, 59L, 60L, 61L, 62L, 63L, 64L, 65L, 66L, 67L, 68L, 69L, 70L, 71L, 73L, 74L, 75L, 76L), stratum = structure(c(1L, 4L, NA, 4L, 4L, 1L, 6L, NA, 4L, 4L, 1L, 1L, 1L, 6L, 6L, 3L, 3L, 6L, NA, 1L, 1L, 6L, 4L, 3L, 6L), .Label = c(MEst, MAcad, MAdm, FEst, FAcad, FAdm), class = factor), expfc = c(22.8195266723633, 17.0644626617432, NA, 17.0644626617432, 17.0644626617432, 22.8195266723633, 5.1702127456665, NA, 17.0644626617432, 17.0644626617432, 22.8195266723633, 22.8195266723633, 22.8195266723633, 5.1702127456665, 5.1702127456665, 6.24137926101685, 6.24137926101685, 5.1702127456665, NA, 22.8195266723633, 22.8195266723633, 5.1702127456665, 17.0644626617432, 6.24137926101685, 5.1702127456665), d7 = structure(c(1L, 1L, NA, 1L, 1L, 1L, 1L, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, NA, 1L, 1L, 6L, 1L, 6L, 6L), .Label = c(Soltero, Casado, Separado, Divorciado, Viudo, Union libre), class = factor), ovt = c(NA, 93.3823547363281, NA, NA, NA, NA, 83.8235321044922, NA, NA, NA, NA, NA, NA, NA, 79.4117660522461, NA, NA, 19.1176471710205, NA, NA, NA, 85.2941207885742, NA, NA, NA)), .Names = c(id, stratum, expfc, d7, ovt ), row.names = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25), class = data.frame) Regards. Martin On Wed, Nov 12, 2014 at 1:39 PM, Anthony Damico ajdam...@gmail.com wrote: try resetting your levels? if that doesn't work, please dput() an example data set that we can test with :) thanks! sii.design - update( sii.design , d6 = factor( d6 ) ) On Wed, Nov 12, 2014 at 7:59 AM, Martin Canon martin.ca...@gmail.com wrote: Hi. I'm trying to calculate the weighted mean score of a quality of life measure (ovt) in patients with irritable bowel syndrome by their marital status (d7). This is a summary of the structure of the dataset: str(sii.tesis) 'data.frame':1063 obs. of 75 variables: $ id : int 51 52 53 54 55 56 57 58 59 60 ... $ stratum: Factor w/ 6 levels MEst,MAcad,..: 1 4 NA 4 4 1 6 NA 4 4 ... $ expfc : num 22.8 17.1 NA 17.1 17.1 ... $ d6 : Factor w/ 3 levels Estudiante,Profesor,..: 1 1 NA 1 1 1 3 NA 1 1 ... $ d7 : Factor w/ 6 levels Soltero,Casado,..: 1 1 NA 1 1 1 1 NA 1 1 ... $ d7c: Factor w/ 2 levels No estable,Estable: 1 1 NA 1 1 1 1 NA 1 1 ... $ s1cm : Factor w/ 2 levels No,Si: 1 2 NA 1 1 1 2 NA 1 1 ... $ ovt: num NA 93.4 NA NA NA ... I declared the sampling design: sii.design - svydesign( id = ~1, strata =
Re: [R] tapply error svyby function survey package
Anthony, thanks for your reply. Resetting the levels didn't work. These are the first 25 rows of the dataset: structure(list(id = c(51L, 52L, 53L, 54L, 55L, 56L, 57L, 58L, 59L, 60L, 61L, 62L, 63L, 64L, 65L, 66L, 67L, 68L, 69L, 70L, 71L, 73L, 74L, 75L, 76L), stratum = structure(c(1L, 4L, NA, 4L, 4L, 1L, 6L, NA, 4L, 4L, 1L, 1L, 1L, 6L, 6L, 3L, 3L, 6L, NA, 1L, 1L, 6L, 4L, 3L, 6L), .Label = c(MEst, MAcad, MAdm, FEst, FAcad, FAdm), class = factor), expfc = c(22.8195266723633, 17.0644626617432, NA, 17.0644626617432, 17.0644626617432, 22.8195266723633, 5.1702127456665, NA, 17.0644626617432, 17.0644626617432, 22.8195266723633, 22.8195266723633, 22.8195266723633, 5.1702127456665, 5.1702127456665, 6.24137926101685, 6.24137926101685, 5.1702127456665, NA, 22.8195266723633, 22.8195266723633, 5.1702127456665, 17.0644626617432, 6.24137926101685, 5.1702127456665), d7 = structure(c(1L, 1L, NA, 1L, 1L, 1L, 1L, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, NA, 1L, 1L, 6L, 1L, 6L, 6L), .Label = c(Soltero, Casado, Separado, Divorciado, Viudo, Union libre), class = factor), ovt = c(NA, 93.3823547363281, NA, NA, NA, NA, 83.8235321044922, NA, NA, NA, NA, NA, NA, NA, 79.4117660522461, NA, NA, 19.1176471710205, NA, NA, NA, 85.2941207885742, NA, NA, NA)), .Names = c(id, stratum, expfc, d7, ovt ), row.names = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25), class = data.frame) Regards. Martin On Wed, Nov 12, 2014 at 1:39 PM, Anthony Damico ajdam...@gmail.com wrote: try resetting your levels? if that doesn't work, please dput() an example data set that we can test with :) thanks! sii.design - update( sii.design , d6 = factor( d6 ) ) On Wed, Nov 12, 2014 at 7:59 AM, Martin Canon martin.ca...@gmail.com wrote: Hi. I'm trying to calculate the weighted mean score of a quality of life measure (ovt) in patients with irritable bowel syndrome by their marital status (d7). This is a summary of the structure of the dataset: str(sii.tesis) 'data.frame':1063 obs. of 75 variables: $ id : int 51 52 53 54 55 56 57 58 59 60 ... $ stratum: Factor w/ 6 levels MEst,MAcad,..: 1 4 NA 4 4 1 6 NA 4 4 ... $ expfc : num 22.8 17.1 NA 17.1 17.1 ... $ d6 : Factor w/ 3 levels Estudiante,Profesor,..: 1 1 NA 1 1 1 3 NA 1 1 ... $ d7 : Factor w/ 6 levels Soltero,Casado,..: 1 1 NA 1 1 1 1 NA 1 1 ... $ d7c: Factor w/ 2 levels No estable,Estable: 1 1 NA 1 1 1 1 NA 1 1 ... $ s1cm : Factor w/ 2 levels No,Si: 1 2 NA 1 1 1 2 NA 1 1 ... $ ovt: num NA 93.4 NA NA NA ... I declared the sampling design: sii.design - svydesign( id = ~1, strata = ~stratum, weights = ~expfc, data = subset(sii.tesis, !is.na(stratum))) Then I tried to get the result: svyby(~ovt, ~d7, sii.design, svymean, na.rm = TRUE, level = 0.95) but i get the error: Error in tapply(1:NROW(x), list(factor(strata)), function(index) { : arguments must have same length The length of both variables is the same. If the variable ovt exists, there is a d7 match in the data frame. I try the same thing using another variable instead - role (d6) - and it works. svyby(~ovt, ~d6, sii.design, svymean, na.rm = TRUE, level = 0.95) d6 ovt se Estudiante Estudiante 71.01805 1.370569 Profesor Profesor 72.30923 6.518378 Administrativo Administrativo 75.69102 3.715050 If I use the recategorized d7 variable (d7c, two levels only) it works too: svyby(~ovt, ~d7c, sii.design, svymean, na.rm = TRUE, level = 0.95) d7c ovt se No estable No estable 70.92344 1.37460 Estable Estable 74.53719 4.16954 What could be the problem? Regards. Martin Canon Colombia, South America __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] tapply and functions with more than one objects
Hello, How i can use a costum function in tapply which has more than one variable? I mean sum(x) only needs one object but what when i have a function function(x,y) with more, how i indicate where are the other variables to use?7 I hope someone can help me. Thank you!! Best regards, Dominic __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply and functions with more than one objects
On Jan 22, 2013, at 2:24 PM, Dominic Roye wrote: Hello, How i can use a costum function in tapply which has more than one variable? I mean sum(x) only needs one object but what when i have a function function(x,y) with more, how i indicate where are the other variables to use?7 You can use: lapply(split( multi_col_object, category_vec) , function(x,y){sum(x,y)} ) aggregate(dat, category, FUN=sum) Or: do.call(rbind, by( multi_col_object, category_vec, function(x,y){ } ) Sometimes `Reduce` is more compact. Other times `mapply` is needed. -- David Winsemius Alameda, CA, USA __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] tapply to data.frame or matrix
Dear R users, imagine i have a dataframe and an indexing vector with the length of the amount of columns of the dataframe. Is there any convenient way to combine the colums of the dataframe into vectors (or straight away apply fundtions to these subsets) according to the indexing vector in a similar manner to the tapply function? For example, in the follwoing case, I would like to combine columns 1 and two into one vector, and columns 3-4 into another: test = as.data.frame(matrix(1:20, ncol = 5, nrow=4)) test.ind =c(1,1,2,2,2) Thanks a lot! Jannis __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply to data.frame or matrix
Hello, Here's a way. test - as.data.frame(matrix(1:20, ncol = 5, nrow=4)) test.ind - c(1,1,2,2,2) lapply(split(colnames(test), test.ind), function(x) unlist(test[, x])) Hope this helps, Rui Barradas Em 04-09-2012 15:40, Jannis escreveu: Dear R users, imagine i have a dataframe and an indexing vector with the length of the amount of columns of the dataframe. Is there any convenient way to combine the colums of the dataframe into vectors (or straight away apply fundtions to these subsets) according to the indexing vector in a similar manner to the tapply function? For example, in the follwoing case, I would like to combine columns 1 and two into one vector, and columns 3-4 into another: test = as.data.frame(matrix(1:20, ncol = 5, nrow=4)) test.ind =c(1,1,2,2,2) Thanks a lot! Jannis __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply to data.frame or matrix
Hi, Here's another way: testagg-aggregate(colnames(test),list(test.ind),function(x) test[,x]) list(unlist(testagg[,2][1]),unlist(testagg[,2][2])) #[[1]] #0.V11 0.V12 0.V13 0.V14 0.V21 0.V22 0.V23 0.V24 1 2 3 4 5 6 7 8 #[[2]] #1.V31 1.V32 1.V33 1.V34 1.V41 1.V42 1.V43 1.V44 1.V51 1.V52 1.V53 1.V54 # 9 10 11 12 13 14 15 16 17 18 19 20 A.K. - Original Message - From: Rui Barradas ruipbarra...@sapo.pt To: Jannis bt_jan...@yahoo.de Cc: r-help r-help@r-project.org Sent: Tuesday, September 4, 2012 11:30 AM Subject: Re: [R] tapply to data.frame or matrix Hello, Here's a way. test - as.data.frame(matrix(1:20, ncol = 5, nrow=4)) test.ind - c(1,1,2,2,2) lapply(split(colnames(test), test.ind), function(x) unlist(test[, x])) Hope this helps, Rui Barradas Em 04-09-2012 15:40, Jannis escreveu: Dear R users, imagine i have a dataframe and an indexing vector with the length of the amount of columns of the dataframe. Is there any convenient way to combine the colums of the dataframe into vectors (or straight away apply fundtions to these subsets) according to the indexing vector in a similar manner to the tapply function? For example, in the follwoing case, I would like to combine columns 1 and two into one vector, and columns 3-4 into another: test = as.data.frame(matrix(1:20, ncol = 5, nrow=4)) test.ind =c(1,1,2,2,2) Thanks a lot! Jannis __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply confusion
Actually its okay. I just created 16 subsets of the dataframe using the different months and then ran kruskal test 16 times. Im sure there is a nice way to code this to do it automatically and produce a nice table of the results but i only started learning R two weeks ago!!! Thanks for all the help -- View this message in context: http://r.789695.n4.nabble.com/tapply-confusion-tp4641729p4641821.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply confusion
Hello Thankyou for the help. kruskal.test(Temp, Roof) is simple but just returns one result for the whole temperature dataset organised by roof. I want to compare the Temp data for each Roof in each Month. So because i have temperature data on the three roofs for 16 different months then i want 16 separate kruskal.test results., How do i do this? Thanks -- View this message in context: http://r.789695.n4.nabble.com/tapply-confusion-tp4641729p4641820.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply confusion
On Aug 30, 2012, at 4:02 AM, andyspeak wrote: Hello Thankyou for the help. kruskal.test(Temp, Roof) is simple but just returns one result for the whole temperature dataset organised by roof. I want to compare the Temp data for each Roof in each Month. So because i have temperature data on the three roofs for 16 different months then i want 16 separate kruskal.test results., lapply( split(dfrm, dfrm$Month), function(xfrm) { kruskal.test(xfrm[[Temp]], xfrm[[Roof]] } Notice that I used an assumed name for the dataframe. You have apparently been following unwise advice to use attach. You would be advised to disregard that advice. -- David Winsemius, MD Alameda, CA, USA __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] tapply confusion
Hello I have a huge data frame with three columns 'Roof' 'Month' and 'Temp' i want to run analyses on the numerical Temp data by the factors Roof and Month, separately and together. For using more than one factor i understand i should use aggregate, but i am struggling with the tapply for single factor analysis. tapply(Temp, INDEX = Roof, FUN = median) This works fine, however if i try to do anything a bit more complex, such as: tapply(Temp, INDEX = Roof, FUN = kruskal.test) it gives the error - Error in length(g) : 'g' is missing What could be the problem? Thanks -- View this message in context: http://r.789695.n4.nabble.com/tapply-confusion-tp4641729.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply confusion
Le mercredi 29 août 2012 à 07:37 -0700, andyspeak a écrit : Hello I have a huge data frame with three columns 'Roof' 'Month' and 'Temp' i want to run analyses on the numerical Temp data by the factors Roof and Month, separately and together. For using more than one factor i understand i should use aggregate, but i am struggling with the tapply for single factor analysis. tapply(Temp, INDEX = Roof, FUN = median) This works fine, however if i try to do anything a bit more complex, such as: tapply(Temp, INDEX = Roof, FUN = kruskal.test) it gives the error - Error in length(g) : 'g' is missing What could be the problem? If you read ?kruskal.test, you'll notice its default function takes (at least) two arguments, the second being g. Its description is: g: a vector or factor object giving the group for the corresponding elements of ‘x’. Ignored if ‘x’ is a list. So you do not need tapply(): just call kruskal.test(Temp, Roof) The theoretical reason you cannot use tapply() is that it calls FUN separately for each subset of the data. kruskal.test() would never be passed the whole data set, which is needed to make a test of differences. Regards __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply confusion
On Aug 29, 2012, at 7:37 AM, andyspeak wrote: Hello I have a huge data frame with three columns 'Roof' 'Month' and 'Temp' i want to run analyses on the numerical Temp data by the factors Roof and Month, separately and together. For using more than one factor i understand i should use aggregate, but i am struggling with the tapply for single factor analysis. tapply(Temp, INDEX = Roof, FUN = median) This works fine, however if i try to do anything a bit more complex, such as: tapply(Temp, INDEX = Roof, FUN = kruskal.test) it gives the error - Error in length(g) : 'g' is missing What is the sound of one hand clapping? You are sending a bunch of single vectors with no grouping variable to a function that is expecting two data columns. Maybe you should explain what test you had in mind using natural language and we could help get you there. -- David Winsemius, MD Alameda, CA, USA __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply for enormous (2^31 row) matrices
On Thu, Feb 23, 2012 at 11:39 AM, Matthew Keller mckellerc...@gmail.com wrote: Thank you all very much for your help (on both the r-help and the bioconductor listserves). Benilton - I couldn't get sqldf to install on the server I'm using (error is: Error : package 'gsubfn' does not have a name space). I think this was a problem for R 2.13, and I'm trying to get the admin's to install a more up-to-date version. I know that I need to probably learn a modicum of SQL given the sizes of datasets I'm using now. Right. See the troubleshooting section of the sqldf home page: http://code.google.com/p/sqldf/#Troubleshooting -- Statistics Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply for enormous (2^31 row) matrices
Thank you all very much for your help (on both the r-help and the bioconductor listserves). Benilton - I couldn't get sqldf to install on the server I'm using (error is: Error : package 'gsubfn' does not have a name space). I think this was a problem for R 2.13, and I'm trying to get the admin's to install a more up-to-date version. I know that I need to probably learn a modicum of SQL given the sizes of datasets I'm using now. I ended up using a modified version of Hervé Pagès' excellent code (thank you!). I got a huge (40-fold) speed bump by using the data.table package for indexing/aggregate steps, making an hours long job a minutes long job. SO - read.table is hugely useful if you're dealing with indexing/apply-family functions on huge datasets. By the way, I'm not sure why, but read.table was a bit faster than scan for this problem... Here is the code for others: require(data.table) computeAllPairSums - function(filename, nbindiv,nrows.to.read) { con - file(filename, open=r) on.exit(close(con)) ans - matrix(numeric(nbindiv * nbindiv), nrow=nbindiv) chunk - 0L while (TRUE) { #read.table faster than scan df0 - read.table(con,col.names=c(ID1, ID2, ignored, sharing), colClasses=c(integer, integer, NULL, numeric),nrows=nrows.to.read,comment.char=) DT - data.table(df0) setkey(DT,ID1,ID2) ss - DT[,sum(sharing),by=ID1,ID2] if (nrow(df0) == 0L) break chunk - chunk + 1L cat(Processing chunk, chunk, ... ) idd - as.matrix(subset(ss,select=1:2)) newvec - as.vector(as.matrix(subset(ss,select=3))) ans[idd] - ans[idd] + newvec cat(OK\n) } ans } On Wed, Feb 22, 2012 at 3:20 PM, ilai ke...@math.montana.edu wrote: On Tue, Feb 21, 2012 at 4:04 PM, Matthew Keller mckellerc...@gmail.com wrote: X - read.big.matrix(file.loc.X,sep= ,type=double) hap.indices - bigsplit(X,1:2) #this runs for too long to be useful on these matrices #I was then going to use foreach loop to sum across the splits identified by bigsplit How about just using foreach earlier in the process ? e.g. split file.loc.X to (80) sub files and then run read.big.matrix/bigsplit/sum inside %dopar% If splitting X beforehand is a problem, you could also use ?scan to read in different chunks of the file, something like (untested obviously): # for X a matrix 800x4 lineind- seq(1,800,100) # create an index vec for the lines to read ReducedX- foreach(i = 1:8) %dopar%{ x - scan('file.loc.X',list(double(0),double(0),double(0),double(0)),skip=lineind[i],nlines=100) ... do your thing on x (aggregate/tapply etc.) } Hope this helped Elai. SO - does anyone have ideas on how to deal with this problem - i.e., how to use a tapply() like function on an enormous matrix? This isn't necessarily a bigtabulate question (although if I screwed up using bigsplit, let me know). If another package (e.g., an SQL package) can do something like this efficiently, I'd like to hear about it and your experiences using it. Thank you in advance, Matt -- Matthew C Keller Asst. Professor of Psychology University of Colorado at Boulder www.matthewckeller.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Matthew C Keller Asst. Professor of Psychology University of Colorado at Boulder www.matthewckeller.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply for enormous (2^31 row) matrices
On Tue, Feb 21, 2012 at 4:04 PM, Matthew Keller mckellerc...@gmail.com wrote: X - read.big.matrix(file.loc.X,sep= ,type=double) hap.indices - bigsplit(X,1:2) #this runs for too long to be useful on these matrices #I was then going to use foreach loop to sum across the splits identified by bigsplit How about just using foreach earlier in the process ? e.g. split file.loc.X to (80) sub files and then run read.big.matrix/bigsplit/sum inside %dopar% If splitting X beforehand is a problem, you could also use ?scan to read in different chunks of the file, something like (untested obviously): # for X a matrix 800x4 lineind- seq(1,800,100) # create an index vec for the lines to read ReducedX- foreach(i = 1:8) %dopar%{ x - scan('file.loc.X',list(double(0),double(0),double(0),double(0)),skip=lineind[i],nlines=100) ... do your thing on x (aggregate/tapply etc.) } Hope this helped Elai. SO - does anyone have ideas on how to deal with this problem - i.e., how to use a tapply() like function on an enormous matrix? This isn't necessarily a bigtabulate question (although if I screwed up using bigsplit, let me know). If another package (e.g., an SQL package) can do something like this efficiently, I'd like to hear about it and your experiences using it. Thank you in advance, Matt -- Matthew C Keller Asst. Professor of Psychology University of Colorado at Boulder www.matthewckeller.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] tapply for enormous (2^31 row) matrices
Hi all, SETUP: I have pairwise data on 22 chromosomes. Data matrix X for a given chromosome looks like this: 1 13 58 1.12 6 142 56 1.11 18 307 64 3.13 22 320 58 0.72 Where column 1 is person ID 1, column 2 is person ID 2, column 3 can be ignored, and column 4 is how much chromosomal sharing those two individuals have in some small portion of the chromosome. There are 9000 individual people, and therefore ~ (9000^2)/2 pairwise matches at each small location on the chromosome, so across an entire chromosome, these matrices are VERY large (e.g., 3 billion rows, which is the 2^31 vector size limitation in R). I have access to a server with 64 bit R, 1TB RAM and 80 processors. PROBLEM: A pair of individuals (e.g., person 1 and 13 from the first row above) will show up multiple times in a given file. I want to sum column 4 across each pair of individuals. If I could bring the matrix into R, I could use tapply() to accomplish this by indexing on paste(X[,1],X[,2]), but the matrix doesn't fit into R. I have been trying to use bigmemory and bigtabulate packages in R, but when I try to use the bigsplit function, R never completes the operation (after a day, I killed the process). In particular, I did this: X - read.big.matrix(file.loc.X,sep= ,type=double) hap.indices - bigsplit(X,1:2) #this runs for too long to be useful on these matrices #I was then going to use foreach loop to sum across the splits identified by bigsplit SO - does anyone have ideas on how to deal with this problem - i.e., how to use a tapply() like function on an enormous matrix? This isn't necessarily a bigtabulate question (although if I screwed up using bigsplit, let me know). If another package (e.g., an SQL package) can do something like this efficiently, I'd like to hear about it and your experiences using it. Thank you in advance, Matt -- Matthew C Keller Asst. Professor of Psychology University of Colorado at Boulder www.matthewckeller.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] tapply with specific quantile value
All - I have an example data frame x l.c.1 43.38812035 085 47.55710661 085 47.55710661 085 51.99211429 085 51.99211429 095 54.78449958 095 54.78449958 095 56.70201864 095 56.70201864 105 59.66361903 105 61.69573564 105 61.69573564 105 63.77469479 115 64.83191994 115 64.83191994 115 66.98222118 115 66.98222118 125 66.98222118 125 66.98222118 125 66.98222118 125 and I'd like to get the 3rd quantile by l.c.1 so I use tapply(x, l.c.1, quantile) and my output includes all quantiles (i.e., 0, 25%, 50%, 75%, 100%) but I'm only interested in the 75% quantile. Is there an additional statement or function I can use to get just the quantile that I want? Thanks for your help - SR Steven H. Ranney __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply with specific quantile value
Tena koe Steven The ... argument of the apply series of functions allows one to pass arguments to the called function. So: tapply(x, l.c.1, quantile, probs=0.75) should work (although I haven't tested it). HTH . Peter Alspach -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-bounces@r- project.org] On Behalf Of Steven Ranney Sent: Friday, 25 March 2011 12:18 p.m. To: r-help@r-project.org Subject: [R] tapply with specific quantile value All - I have an example data frame x l.c.1 43.38812035 085 47.55710661 085 47.55710661 085 51.99211429 085 51.99211429 095 54.78449958 095 54.78449958 095 56.70201864 095 56.70201864 105 59.66361903 105 61.69573564 105 61.69573564 105 63.77469479 115 64.83191994 115 64.83191994 115 66.98222118 115 66.98222118 125 66.98222118 125 66.98222118 125 66.98222118 125 and I'd like to get the 3rd quantile by l.c.1 so I use tapply(x, l.c.1, quantile) and my output includes all quantiles (i.e., 0, 25%, 50%, 75%, 100%) but I'm only interested in the 75% quantile. Is there an additional statement or function I can use to get just the quantile that I want? Thanks for your help - SR Steven H. Ranney __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. The contents of this e-mail are confidential and may be subject to legal privilege. If you are not the intended recipient you must not use, disseminate, distribute or reproduce all or any part of this e-mail or attachments. If you have received this e-mail in error, please notify the sender and delete all material pertaining to this e-mail. Any opinion or views expressed in this e-mail are those of the individual sender and may not represent those of The New Zealand Institute for Plant and Food Research Limited. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply with specific quantile value
Just have a look at ?quantile and the probs argument. tapply(x, l.c.1, quantile,probs=0.75) Anyway, quantiles and quartiles are not the same. I guess you meant the 3rd quartile. All - I have an example data frame x l.c.1 43.38812035 085 47.55710661 085 47.55710661 085 51.99211429 085 51.99211429 095 54.78449958 095 54.78449958 095 56.70201864 095 56.70201864 105 59.66361903 105 61.69573564 105 61.69573564 105 63.77469479 115 64.83191994 115 64.83191994 115 66.98222118 115 66.98222118 125 66.98222118 125 66.98222118 125 66.98222118 125 and I'd like to get the 3rd quantile by l.c.1 so I use tapply(x, l.c.1, quantile) and my output includes all quantiles (i.e., 0, 25%, 50%, 75%, 100%) but I'm only interested in the 75% quantile. Is there an additional statement or function I can use to get just the quantile that I want? Thanks for your help - SR Steven H. Ranney __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply with specific quantile value
Hi Steven, See the prob argument under ?quantile. The following should be what you want: tapply(x, l.c.1, quantile, prob = 0.75) HTH, Jorge * * On Thu, Mar 24, 2011 at 7:18 PM, Steven Ranney wrote: All - I have an example data frame x l.c.1 43.38812035 085 47.55710661085 47.55710661085 51.99211429085 51.99211429095 54.78449958 095 54.78449958 095 56.70201864 095 56.70201864 105 59.66361903 105 61.69573564105 61.69573564105 63.77469479 115 64.83191994 115 64.83191994 115 66.98222118115 66.98222118125 66.98222118125 66.98222118125 66.98222118125 and I'd like to get the 3rd quantile by l.c.1 so I use tapply(x, l.c.1, quantile) and my output includes all quantiles (i.e., 0, 25%, 50%, 75%, 100%) but I'm only interested in the 75% quantile. Is there an additional statement or function I can use to get just the quantile that I want? Thanks for your help - SR Steven H. Ranney __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply with specific quantile value
Worked just fine. I had been incorrectly trying tapply(x, l.c.1, quantile(probs=0.75)) rather than tapply(x, l.c.1, quantile, probs=0.75) Thanks for the help - SR Steven H. Ranney On Thu, Mar 24, 2011 at 6:03 PM, Peter Alspach peter.alsp...@plantandfood.co.nz wrote: Tena koe Steven The ... argument of the apply series of functions allows one to pass arguments to the called function. So: tapply(x, l.c.1, quantile, probs=0.75) should work (although I haven't tested it). HTH . Peter Alspach -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-bounces@r- project.org] On Behalf Of Steven Ranney Sent: Friday, 25 March 2011 12:18 p.m. To: r-help@r-project.org Subject: [R] tapply with specific quantile value All - I have an example data frame x l.c.1 43.38812035 085 47.55710661 085 47.55710661 085 51.99211429 085 51.99211429 095 54.78449958 095 54.78449958 095 56.70201864 095 56.70201864 105 59.66361903 105 61.69573564 105 61.69573564 105 63.77469479 115 64.83191994 115 64.83191994 115 66.98222118 115 66.98222118 125 66.98222118 125 66.98222118 125 66.98222118 125 and I'd like to get the 3rd quantile by l.c.1 so I use tapply(x, l.c.1, quantile) and my output includes all quantiles (i.e., 0, 25%, 50%, 75%, 100%) but I'm only interested in the 75% quantile. Is there an additional statement or function I can use to get just the quantile that I want? Thanks for your help - SR Steven H. Ranney __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. The contents of this e-mail are confidential and may be subject to legal privilege. If you are not the intended recipient you must not use, disseminate, distribute or reproduce all or any part of this e-mail or attachments. If you have received this e-mail in error, please notify the sender and delete all material pertaining to this e-mail. Any opinion or views expressed in this e-mail are those of the individual sender and may not represent those of The New Zealand Institute for Plant and Food Research Limited. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply output as a dataframe
On Mon, Apr 13, 2009 at 12:41 PM, Dan Dube ddube-at-advisen.com wrote: i use tapply and by often, but i always end up banging my head against the wall with the output. The proposed solution of Dan's problem posted on R-help was: do.call(rbind,a) When I use this 'solution' I get 'ERROR: second argument must be a list'. So head on wall continues. My tapply output is generated as follows: a=tapply(value,list(sampling.date,station.code),mean) which gives me this (in part): A B C D E F G H I J K 1/15/2008 0.004 0.027 0.019 0.015 0.035 0.022 0.007 0.038 0.042 0.045 0.0350 1/15/2009 0.027 0.027 0.031 0.015 0.008 0.021 0.007 0.027 0.026 0.029 0.0210 1/15/2010 0.016 0.020 0.015 0.022 0.015 0.013 0.007 0.014 0.019 0.019 0.0180 10/15/2007 0.052 0.051 0.032 0.024 0.017 0.044 0.015 0.058 0.063 0.061 0.0640 10/15/2008 0.042 0.054 0.030 0.017 0.024 0.030 0.019 0.044 0.047 0.051 0.0390 10/15/2009 0.047 0.035 0.031 0.020 0.012 0.039 0.019 0.051 0.055 0.054 0.0350 The only way I can figure out how to resolve this, such that I can, for example, plot station A against date, is to export the tapply output as a csv, and then reimport. Suggestions? I couldn't find a solution to this likely SIMPLE problem in Crawley or multiple searches of R help. Gregory A. Graves, Lead Scientist Everglades REstoration COoordination and VERification (RECOVER) Wetland Watershed Sciences / Restoration Sciences Department South Florida Water Management District Phones: DESK: 561 / 682 - 2429 CELL: 561 / 719 - 8157 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply output as a dataframe
Try as.data.frame(as.table(a)) - Phil Spector Statistical Computing Facility Department of Statistics UC Berkeley spec...@stat.berkeley.edu On Thu, 3 Feb 2011, Graves, Gregory wrote: On Mon, Apr 13, 2009 at 12:41 PM, Dan Dube ddube-at-advisen.com wrote: i use tapply and by often, but i always end up banging my head against the wall with the output. The proposed solution of Dan's problem posted on R-help was: do.call(rbind,a) When I use this 'solution' I get 'ERROR: second argument must be a list'. So head on wall continues. My tapply output is generated as follows: a=tapply(value,list(sampling.date,station.code),mean) which gives me this (in part): A B C D E F G H I J K 1/15/2008 0.004 0.027 0.019 0.015 0.035 0.022 0.007 0.038 0.042 0.045 0.0350 1/15/2009 0.027 0.027 0.031 0.015 0.008 0.021 0.007 0.027 0.026 0.029 0.0210 1/15/2010 0.016 0.020 0.015 0.022 0.015 0.013 0.007 0.014 0.019 0.019 0.0180 10/15/2007 0.052 0.051 0.032 0.024 0.017 0.044 0.015 0.058 0.063 0.061 0.0640 10/15/2008 0.042 0.054 0.030 0.017 0.024 0.030 0.019 0.044 0.047 0.051 0.0390 10/15/2009 0.047 0.035 0.031 0.020 0.012 0.039 0.019 0.051 0.055 0.054 0.0350 The only way I can figure out how to resolve this, such that I can, for example, plot station A against date, is to export the tapply output as a csv, and then reimport. Suggestions? I couldn't find a solution to this likely SIMPLE problem in Crawley or multiple searches of R help. Gregory A. Graves, Lead Scientist Everglades REstoration COoordination and VERification (RECOVER) Wetland Watershed Sciences / Restoration Sciences Department South Florida Water Management District Phones: DESK: 561 / 682 - 2429 CELL: 561 / 719 - 8157 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply output as a dataframe
On Feb 3, 2011, at 11:29 AM, Graves, Gregory wrote: On Mon, Apr 13, 2009 at 12:41 PM, Dan Dube ddube-at-advisen.com wrote: That is pushing two years ago, so I doubt very many people still have that posting on their mail-clients. (When I did go to the archives Dan Dube's problem was posed as how to bind a: dt = data.frame(bucket=rep(1:4,25),val=rnorm(100)) fn = function(x) { ret = c(unname(quantile(x,probs=seq(. 25,.75,.25),na.rm=T)),mean(x,na.rm=T)) } a = tapply(dt$val,dt$bucket,fn) i use tapply and by often, but i always end up banging my head against the wall with the output. The proposed solution of Dan's problem posted on R-help was: do.call(rbind,a) When I use this 'solution' I get 'ERROR: second argument must be a list'. So head on wall continues. My tapply output is generated as follows: a=tapply(value,list(sampling.date,station.code),mean) Why not give us sampling.date (which is probably NOT really a date but rather a character vector) and station.code so we can show you how to create a more appropriate structure? which gives me this (in part): A B C D E F G H I J K 1/15/2008 0.004 0.027 0.019 0.015 0.035 0.022 0.007 0.038 0.042 0.045 0.0350 1/15/2009 0.027 0.027 0.031 0.015 0.008 0.021 0.007 0.027 0.026 0.029 0.0210 1/15/2010 0.016 0.020 0.015 0.022 0.015 0.013 0.007 0.014 0.019 0.019 0.0180 10/15/2007 0.052 0.051 0.032 0.024 0.017 0.044 0.015 0.058 0.063 0.061 0.0640 10/15/2008 0.042 0.054 0.030 0.017 0.024 0.030 0.019 0.044 0.047 0.051 0.0390 10/15/2009 0.047 0.035 0.031 0.020 0.012 0.039 0.019 0.051 0.055 0.054 0.0350 The only way I can figure out how to resolve this, such that I can, for example, plot station A against date, is to export the tapply output as a csv, and then reimport. Suggestions? I couldn't find a solution to this likely SIMPLE problem Perhaps. but we haven't really been told what the problem is, have we? in Crawley or multiple searches of R help. Gregory A. Graves, Lead Scientist David Winsemius, MD West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply output as a dataframe
Yes, as far as I can tell, sampling.date is a character vector of the format 1/15/2008. It resides in the leftmost column of the tapply output. station.code are the A, B, C column headers which refer actual water quality station locations, and the values below those headers correspond to the sampling.date when samples were taken. Actually what I have done is to take the mid-point of each month and calculated its mean to deal with multiple samples taken in one month, and to generate NAs where no sample was taken by purposefully not adding the na.rm=T to the tapply command. Normally I would do this: rdate-as.POSIXct(strptime(date,format=%m/%d/%Y)) #convert sampling.date to date R can handle plot(A~rdate) If I just submit station.code like A I get all the values for Station A. It is in converting the sampling.date to an rdate that has me stumped. One reason being that in the tapply output the character vector representing date has no column name. I can't access that column. Gregory A. Graves, Lead Scientist Everglades REstoration COoordination and VERification (RECOVER) Wetland Watershed Sciences / Restoration Sciences Department South Florida Water Management District Phones: DESK: 561 / 682 - 2429 CELL: 561 / 719 - 8157 -Original Message- From: David Winsemius [mailto:dwinsem...@comcast.net] Sent: Thursday, February 03, 2011 12:50 PM To: Graves, Gregory Cc: r-help@r-project.org; Goodman, Patricia; Gorman, Patricia Subject: Re: [R] tapply output as a dataframe On Feb 3, 2011, at 11:29 AM, Graves, Gregory wrote: My tapply output is generated as follows: a=tapply(value,list(sampling.date,station.code),mean) Why not give us sampling.date (which is probably NOT really a date but rather a character vector) and station.code so we can show you how to create a more appropriate structure? which gives me this (in part): A B C D E F G H I J K 1/15/2008 0.004 0.027 0.019 0.015 0.035 0.022 0.007 0.038 0.042 0.045 0.0350 1/15/2009 0.027 0.027 0.031 0.015 0.008 0.021 0.007 0.027 0.026 0.029 0.0210 1/15/2010 0.016 0.020 0.015 0.022 0.015 0.013 0.007 0.014 0.019 0.019 0.0180 10/15/2007 0.052 0.051 0.032 0.024 0.017 0.044 0.015 0.058 0.063 0.061 0.0640 10/15/2008 0.042 0.054 0.030 0.017 0.024 0.030 0.019 0.044 0.047 0.051 0.0390 10/15/2009 0.047 0.035 0.031 0.020 0.012 0.039 0.019 0.051 0.055 0.054 0.0350 The only way I can figure out how to resolve this, such that I can, for example, plot station A against date, is to export the tapply output as a csv, and then reimport. Suggestions? I couldn't find a solution to this likely SIMPLE problem Perhaps. but we haven't really been told what the problem is, have we? in Crawley or multiple searches of R help. Gregory A. Graves, Lead Scientist David Winsemius, MD West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply output as a dataframe
On Feb 3, 2011, at 1:05 PM, Graves, Gregory wrote: Yes, as far as I can tell, sampling.date is a character vector of the format 1/15/2008. It resides in the leftmost column of the tapply output. station.code are the A, B, C column headers which refer actual water quality station locations, and the values below those headers correspond to the sampling.date when samples were taken. Actually what I have done is to take the mid-point of each month and calculated its mean to deal with multiple samples taken in one month, and to generate NAs where no sample was taken by purposefully not adding the na.rm=T to the tapply command. Normally I would do this: rdate-as.POSIXct(strptime(date,format=%m/%d/%Y)) #convert sampling.date to date R can handle plot(A~rdate) If I just submit station.code like A I get all the values for Station A. It is in converting the sampling.date to an rdate that has me stumped. One reason being that in the tapply output the character vector representing date has no column name. I can't access that column. It looks like a zoo object. zoo objects hold their time values in the rownames attribute. But since its not really ordered properly, it may just be a table with rownames. The str() function applied to the object from tapply would tell you the answer. -- David. Gregory A. Graves, Lead Scientist Everglades REstoration COoordination and VERification (RECOVER) Wetland Watershed Sciences / Restoration Sciences Department South Florida Water Management District Phones: DESK: 561 / 682 - 2429 CELL: 561 / 719 - 8157 -Original Message- From: David Winsemius [mailto:dwinsem...@comcast.net] Sent: Thursday, February 03, 2011 12:50 PM To: Graves, Gregory Cc: r-help@r-project.org; Goodman, Patricia; Gorman, Patricia Subject: Re: [R] tapply output as a dataframe On Feb 3, 2011, at 11:29 AM, Graves, Gregory wrote: My tapply output is generated as follows: a=tapply(value,list(sampling.date,station.code),mean) Why not give us sampling.date (which is probably NOT really a date but rather a character vector) and station.code so we can show you how to create a more appropriate structure? which gives me this (in part): A B C D E F G H I J K 1/15/2008 0.004 0.027 0.019 0.015 0.035 0.022 0.007 0.038 0.042 0.045 0.0350 1/15/2009 0.027 0.027 0.031 0.015 0.008 0.021 0.007 0.027 0.026 0.029 0.0210 1/15/2010 0.016 0.020 0.015 0.022 0.015 0.013 0.007 0.014 0.019 0.019 0.0180 10/15/2007 0.052 0.051 0.032 0.024 0.017 0.044 0.015 0.058 0.063 0.061 0.0640 10/15/2008 0.042 0.054 0.030 0.017 0.024 0.030 0.019 0.044 0.047 0.051 0.0390 10/15/2009 0.047 0.035 0.031 0.020 0.012 0.039 0.019 0.051 0.055 0.054 0.0350 The only way I can figure out how to resolve this, such that I can, for example, plot station A against date, is to export the tapply output as a csv, and then reimport. Suggestions? I couldn't find a solution to this likely SIMPLE problem Perhaps. but we haven't really been told what the problem is, have we? in Crawley or multiple searches of R help. Gregory A. Graves, Lead Scientist David Winsemius, MD West Hartford, CT David Winsemius, MD West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply output as a dataframe
On Thu, Feb 3, 2011 at 1:11 PM, David Winsemius dwinsem...@comcast.net wrote: On Feb 3, 2011, at 1:05 PM, Graves, Gregory wrote: Yes, as far as I can tell, sampling.date is a character vector of the format 1/15/2008. It resides in the leftmost column of the tapply output. station.code are the A, B, C column headers which refer actual water quality station locations, and the values below those headers correspond to the sampling.date when samples were taken. Actually what I have done is to take the mid-point of each month and calculated its mean to deal with multiple samples taken in one month, and to generate NAs where no sample was taken by purposefully not adding the na.rm=T to the tapply command. Normally I would do this: rdate-as.POSIXct(strptime(date,format=%m/%d/%Y)) #convert sampling.date to date R can handle plot(A~rdate) If I just submit station.code like A I get all the values for Station A. It is in converting the sampling.date to an rdate that has me stumped. One reason being that in the tapply output the character vector representing date has no column name. I can't access that column. It looks like a zoo object. zoo objects hold their time values in the rownames attribute. But since its not really ordered properly, it may just be a table with rownames. The str() function applied to the object from tapply would tell you the answer. Internally zoo objects hold their time index in the index attribute. library(zoo) dput(zoo(4:5)) structure(4:5, index = 1:2, class = zoo) -- Statistics Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply output as a dataframe
This works. Thanks. Gregory A. Graves, Lead Scientist Everglades REstoration COoordination and VERification (RECOVER) Wetland Watershed Sciences / Restoration Sciences Department South Florida Water Management District Phones: DESK: 561 / 682 - 2429 CELL: 561 / 719 - 8157 -Original Message- From: Phil Spector [mailto:spec...@stat.berkeley.edu] Sent: Thursday, February 03, 2011 12:41 PM To: Graves, Gregory Cc: r-help@r-project.org; Goodman, Patricia; Gorman, Patricia Subject: Re: [R] tapply output as a dataframe Try as.data.frame(as.table(a)) - Phil Spector Statistical Computing Facility Department of Statistics UC Berkeley spec...@stat.berkeley.edu On Thu, 3 Feb 2011, Graves, Gregory wrote: On Mon, Apr 13, 2009 at 12:41 PM, Dan Dube ddube-at-advisen.com wrote: i use tapply and by often, but i always end up banging my head against the wall with the output. The proposed solution of Dan's problem posted on R-help was: do.call(rbind,a) When I use this 'solution' I get 'ERROR: second argument must be a list'. So head on wall continues. My tapply output is generated as follows: a=tapply(value,list(sampling.date,station.code),mean) which gives me this (in part): A B C D E F G H I J K 1/15/2008 0.004 0.027 0.019 0.015 0.035 0.022 0.007 0.038 0.042 0.045 0.0350 1/15/2009 0.027 0.027 0.031 0.015 0.008 0.021 0.007 0.027 0.026 0.029 0.0210 1/15/2010 0.016 0.020 0.015 0.022 0.015 0.013 0.007 0.014 0.019 0.019 0.0180 10/15/2007 0.052 0.051 0.032 0.024 0.017 0.044 0.015 0.058 0.063 0.061 0.0640 10/15/2008 0.042 0.054 0.030 0.017 0.024 0.030 0.019 0.044 0.047 0.051 0.0390 10/15/2009 0.047 0.035 0.031 0.020 0.012 0.039 0.019 0.051 0.055 0.054 0.0350 The only way I can figure out how to resolve this, such that I can, for example, plot station A against date, is to export the tapply output as a csv, and then reimport. Suggestions? I couldn't find a solution to this likely SIMPLE problem in Crawley or multiple searches of R help. Gregory A. Graves, Lead Scientist Everglades REstoration COoordination and VERification (RECOVER) Wetland Watershed Sciences / Restoration Sciences Department South Florida Water Management District Phones: DESK: 561 / 682 - 2429 CELL: 561 / 719 - 8157 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply output
On 2010-10-06 13:24, Erik Iverson wrote: Hello, You can use ddply from the very useful plyr package to do this. There must be a way using base R functions, but plyr is worth looking into in my opinion. install.packages(plyr) library(plyr) ddply(myData, .(class, group, name), function(x) mean(x$height)) class group name V1 1 0 A Tom 62.5 2 0 B Jane 58.5 3 1 A Enzo 66.5 4 1 B Mary 70.5 Or use summarize: ddply(myData, .(class, group, name), summarize, mht = mean(height)) -Peter Ehlers Geoffrey Smith wrote: Hello, I am having trouble getting the output from the tapply function formatted so that it can be made into a nice table. Below is my question written in R code. Does anyone have any suggestions? Thank you. Geoff #Input the data; name- c('Tom', 'Tom', 'Jane', 'Jane', 'Enzo', 'Enzo', 'Mary', 'Mary'); year- c(2008, 2009, 2008, 2009, 2008, 2009, 2008, 2009); group- c('A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'); class- c(0, 0, 0, 0, 1, 1, 1, 1); height- c(62, 63, 59, 58, 67, 66, 70, 71); #Combine the data into a data frame; myData- data.frame(name, year, group, class, height); myData; #Calculate the mean of height by class, group, and name; tapply(myData$height, data.frame(myData$class, myData$group, myData$name), mean); #The raw output from the tapply function is fine, but I would; #really like the output to look like this; # class group name mean #0 ATom62.5 #0 BJane58.5 #1 AEnzo 66.5 #1 BMary 70.5 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply output
You can also use sqldf: require(sqldf) sqldf(select class, `group`, name, avg(height) + from myData + group by class, 'group', name) class group name avg(height) 1 0 B Jane58.5 2 0 A Tom62.5 3 1 A Enzo66.5 4 1 B Mary70.5 On Thu, Oct 7, 2010 at 4:49 AM, Peter Ehlers ehl...@ucalgary.ca wrote: On 2010-10-06 13:24, Erik Iverson wrote: Hello, You can use ddply from the very useful plyr package to do this. There must be a way using base R functions, but plyr is worth looking into in my opinion. install.packages(plyr) library(plyr) ddply(myData, .(class, group, name), function(x) mean(x$height)) class group name V1 1 0 A Tom 62.5 2 0 B Jane 58.5 3 1 A Enzo 66.5 4 1 B Mary 70.5 Or use summarize: ddply(myData, .(class, group, name), summarize, mht = mean(height)) -Peter Ehlers Geoffrey Smith wrote: Hello, I am having trouble getting the output from the tapply function formatted so that it can be made into a nice table. Below is my question written in R code. Does anyone have any suggestions? Thank you. Geoff #Input the data; name- c('Tom', 'Tom', 'Jane', 'Jane', 'Enzo', 'Enzo', 'Mary', 'Mary'); year- c(2008, 2009, 2008, 2009, 2008, 2009, 2008, 2009); group- c('A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'); class- c(0, 0, 0, 0, 1, 1, 1, 1); height- c(62, 63, 59, 58, 67, 66, 70, 71); #Combine the data into a data frame; myData- data.frame(name, year, group, class, height); myData; #Calculate the mean of height by class, group, and name; tapply(myData$height, data.frame(myData$class, myData$group, myData$name), mean); #The raw output from the tapply function is fine, but I would; #really like the output to look like this; # class group name mean # 0 A Tom 62.5 # 0 B Jane 58.5 # 1 A Enzo 66.5 # 1 B Mary 70.5 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] tapply output
Hello, I am having trouble getting the output from the tapply function formatted so that it can be made into a nice table. Below is my question written in R code. Does anyone have any suggestions? Thank you. Geoff #Input the data; name - c('Tom', 'Tom', 'Jane', 'Jane', 'Enzo', 'Enzo', 'Mary', 'Mary'); year - c(2008, 2009, 2008, 2009, 2008, 2009, 2008, 2009); group - c('A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'); class - c(0, 0, 0, 0, 1, 1, 1, 1); height - c(62, 63, 59, 58, 67, 66, 70, 71); #Combine the data into a data frame; myData - data.frame(name, year, group, class, height); myData; #Calculate the mean of height by class, group, and name; tapply(myData$height, data.frame(myData$class, myData$group, myData$name), mean); #The raw output from the tapply function is fine, but I would; #really like the output to look like this; # class group name mean #0 ATom62.5 #0 BJane58.5 #1 AEnzo 66.5 #1 BMary 70.5 -- Geoffrey Smith Visiting Assistant Professor Department of Finance W. P. Carey School of Business Arizona State University [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply output
Try this: aggregate(height ~ class + group + name, data = myData, FUN = mean) On Wed, Oct 6, 2010 at 4:13 PM, Geoffrey Smith g...@asu.edu wrote: Hello, I am having trouble getting the output from the tapply function formatted so that it can be made into a nice table. Below is my question written in R code. Does anyone have any suggestions? Thank you. Geoff #Input the data; name - c('Tom', 'Tom', 'Jane', 'Jane', 'Enzo', 'Enzo', 'Mary', 'Mary'); year - c(2008, 2009, 2008, 2009, 2008, 2009, 2008, 2009); group - c('A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'); class - c(0, 0, 0, 0, 1, 1, 1, 1); height - c(62, 63, 59, 58, 67, 66, 70, 71); #Combine the data into a data frame; myData - data.frame(name, year, group, class, height); myData; #Calculate the mean of height by class, group, and name; tapply(myData$height, data.frame(myData$class, myData$group, myData$name), mean); #The raw output from the tapply function is fine, but I would; #really like the output to look like this; # class group name mean #0 ATom62.5 #0 BJane58.5 #1 AEnzo 66.5 #1 BMary 70.5 -- Geoffrey Smith Visiting Assistant Professor Department of Finance W. P. Carey School of Business Arizona State University [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Henrique Dallazuanna Curitiba-Paraná-Brasil 25° 25' 40 S 49° 16' 22 O [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply output
Hello, You can use ddply from the very useful plyr package to do this. There must be a way using base R functions, but plyr is worth looking into in my opinion. install.packages(plyr) library(plyr) ddply(myData, .(class, group, name), function(x) mean(x$height)) class group name V1 1 0 A Tom 62.5 2 0 B Jane 58.5 3 1 A Enzo 66.5 4 1 B Mary 70.5 Geoffrey Smith wrote: Hello, I am having trouble getting the output from the tapply function formatted so that it can be made into a nice table. Below is my question written in R code. Does anyone have any suggestions? Thank you. Geoff #Input the data; name - c('Tom', 'Tom', 'Jane', 'Jane', 'Enzo', 'Enzo', 'Mary', 'Mary'); year - c(2008, 2009, 2008, 2009, 2008, 2009, 2008, 2009); group - c('A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'); class - c(0, 0, 0, 0, 1, 1, 1, 1); height - c(62, 63, 59, 58, 67, 66, 70, 71); #Combine the data into a data frame; myData - data.frame(name, year, group, class, height); myData; #Calculate the mean of height by class, group, and name; tapply(myData$height, data.frame(myData$class, myData$group, myData$name), mean); #The raw output from the tapply function is fine, but I would; #really like the output to look like this; # class group name mean #0 ATom62.5 #0 BJane58.5 #1 AEnzo 66.5 #1 BMary 70.5 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply output
Geoffrey - The output you want is exactly what the aggregate() function provides: aggregate(myData$height, myData[c('class','group','name')],mean) class group namex 1 1 A Enzo 66.5 2 0 B Jane 58.5 3 1 B Mary 70.5 4 0 A Tom 62.5 It should be mentioned that converting tapply's output to this form isn't too difficult: tt = tapply(myData$height, data.frame(myData$class, myData$group, myData$name), + mean) answer = as.data.frame(as.table(tt)) subset(answer,!is.na(Freq)) myData.class myData.group myData.name Freq 2 1AEnzo 66.5 7 0BJane 58.5 121BMary 70.5 130A Tom 62.5 - Phil Spector Statistical Computing Facility Department of Statistics UC Berkeley spec...@stat.berkeley.edu - Phil Spector Statistical Computing Facility Department of Statistics UC Berkeley spec...@stat.berkeley.edu On Wed, 6 Oct 2010, Geoffrey Smith wrote: Hello, I am having trouble getting the output from the tapply function formatted so that it can be made into a nice table. Below is my question written in R code. Does anyone have any suggestions? Thank you. Geoff #Input the data; name - c('Tom', 'Tom', 'Jane', 'Jane', 'Enzo', 'Enzo', 'Mary', 'Mary'); year - c(2008, 2009, 2008, 2009, 2008, 2009, 2008, 2009); group - c('A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'); class - c(0, 0, 0, 0, 1, 1, 1, 1); height - c(62, 63, 59, 58, 67, 66, 70, 71); #Combine the data into a data frame; myData - data.frame(name, year, group, class, height); myData; #Calculate the mean of height by class, group, and name; tapply(myData$height, data.frame(myData$class, myData$group, myData$name), mean); #The raw output from the tapply function is fine, but I would; #really like the output to look like this; # class group name mean #0 ATom62.5 #0 BJane58.5 #1 AEnzo 66.5 #1 BMary 70.5 -- Geoffrey Smith Visiting Assistant Professor Department of Finance W. P. Carey School of Business Arizona State University [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply help
That was very clever. Worked perfectly, thanks! And thanks to everyone else who provided feedback. On Jun 5, 2010, at 5:46 AM, jim holtman wrote: It this what you are looking for: set.seed(1) # create range for each possible class # 'name' the values so you can use them in the 'sapply' function lows-c(a=1, b=2, c=3, d=4, e=5) highs-c(a=5, b=6, c=7, d=8, e=9) # data values vals-sample(1:10,100,replace=T) #classes classes-sample(letters[1:5],100,replace=T) # split the data so that you retain the 'classes' name x.split - split(vals, classes) percentage - sapply(names(x.split), function(.class){ + # compute the percentage based on 'class' + sum((x.split[[.class]] = lows[.class]) + (x.split[[.class]] = highs[.class])) / length(x.split[[.class]]) * 100 + }) percentage abcde 50.0 45.0 62.5 54.54545 55.6 On Fri, Jun 4, 2010 at 4:02 PM, Mark Ebbert mark.ebb...@hci.utah.edu wrote: Dear R gurus, I am trying perform what I believe will be a pretty simple task, but I'm struggling to figure out how to do it. I have two vectors of the same length, the first is numeric and the second is factor. I understand that tapply is perfect for applying a function to the numeric vector by subsets of the factors in the second vector. My issue is trying to make use of two other vectors within the custom function I've written for tapply. The two other vectors are a high and low value for each subset I am breaking my data into, and I want to calculate the percentage of data points that fall into each respective range. I will attempt to provide a coherent example: # create range for each possible class lows-c(1,2,3,4,5) highs-c(5,6,7,8,9) # data values vals-sample(1:10,100,replace=T) #classes classes-sample(letters[1:5],100,replace=T) # Try to calculate percentage of values that fall # into the respective range for the given class. percentages-tapply(vals,classes, function(i){ length(i[i=lows[index] i=highs[index]])/length(i) # I don't know how to actually keep an index count in tapply, but I'm guessing there's a better way. }) I really appreciate any help. ME __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply help
It this what you are looking for: set.seed(1) # create range for each possible class # 'name' the values so you can use them in the 'sapply' function lows-c(a=1, b=2, c=3, d=4, e=5) highs-c(a=5, b=6, c=7, d=8, e=9) # data values vals-sample(1:10,100,replace=T) #classes classes-sample(letters[1:5],100,replace=T) # split the data so that you retain the 'classes' name x.split - split(vals, classes) percentage - sapply(names(x.split), function(.class){ + # compute the percentage based on 'class' + sum((x.split[[.class]] = lows[.class]) + (x.split[[.class]] = highs[.class])) / length(x.split[[.class]]) * 100 + }) percentage abcde 50.0 45.0 62.5 54.54545 55.6 On Fri, Jun 4, 2010 at 4:02 PM, Mark Ebbert mark.ebb...@hci.utah.edu wrote: Dear R gurus, I am trying perform what I believe will be a pretty simple task, but I'm struggling to figure out how to do it. I have two vectors of the same length, the first is numeric and the second is factor. I understand that tapply is perfect for applying a function to the numeric vector by subsets of the factors in the second vector. My issue is trying to make use of two other vectors within the custom function I've written for tapply. The two other vectors are a high and low value for each subset I am breaking my data into, and I want to calculate the percentage of data points that fall into each respective range. I will attempt to provide a coherent example: # create range for each possible class lows-c(1,2,3,4,5) highs-c(5,6,7,8,9) # data values vals-sample(1:10,100,replace=T) #classes classes-sample(letters[1:5],100,replace=T) # Try to calculate percentage of values that fall # into the respective range for the given class. percentages-tapply(vals,classes, function(i){ length(i[i=lows[index] i=highs[index]])/length(i) # I don't know how to actually keep an index count in tapply, but I'm guessing there's a better way. }) I really appreciate any help. ME __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] tapply help
Dear R gurus, I am trying perform what I believe will be a pretty simple task, but I'm struggling to figure out how to do it. I have two vectors of the same length, the first is numeric and the second is factor. I understand that tapply is perfect for applying a function to the numeric vector by subsets of the factors in the second vector. My issue is trying to make use of two other vectors within the custom function I've written for tapply. The two other vectors are a high and low value for each subset I am breaking my data into, and I want to calculate the percentage of data points that fall into each respective range. I will attempt to provide a coherent example: # create range for each possible class lows-c(1,2,3,4,5) highs-c(5,6,7,8,9) # data values vals-sample(1:10,100,replace=T) #classes classes-sample(letters[1:5],100,replace=T) # Try to calculate percentage of values that fall # into the respective range for the given class. percentages-tapply(vals,classes, function(i){ length(i[i=lows[index] i=highs[index]])/length(i) # I don't know how to actually keep an index count in tapply, but I'm guessing there's a better way. }) I really appreciate any help. ME __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply function with NA
See ?colSums On Mon, May 10, 2010 at 12:44 AM, vincent.deluard vincent.delu...@trimtabs.com wrote: Hi R users, I have a matrix m of the type: m X4.20.2010 X4.19.2010 X4.16.2010 [1,] 0.008319468 0. -0.008250825 [2,] 0.005574136 0.01816118 0.073081608 [3,] -0.047830688 0.01612903 -0.030239833 [4,] NA NA NA [5,] 0.008746356 0.02848576 -0.025566107 [6,] -0.007990868 0. -0.02667 I want to get the sum of each column. Normally I would do: apply(m,2,sum) but I get: apply(m,2,sum) X4.20.2010 X4.19.2010 X4.16.2010 NA NA NA This is because of the presence of NA in m. How do you the equivalent of sum(m[1:6,1],na.rm=TRUE) using apply? Many thanks! -- View this message in context: http://r.789695.n4.nabble.com/tapply-function-with-NA-tp2164930p2164930.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] tapply function with NA
Hi R users, I have a matrix m of the type: m X4.20.2010 X4.19.2010 X4.16.2010 [1,] 0.008319468 0. -0.008250825 [2,] 0.005574136 0.01816118 0.073081608 [3,] -0.047830688 0.01612903 -0.030239833 [4,] NA NA NA [5,] 0.008746356 0.02848576 -0.025566107 [6,] -0.007990868 0. -0.02667 I want to get the sum of each column. Normally I would do: apply(m,2,sum) but I get: apply(m,2,sum) X4.20.2010 X4.19.2010 X4.16.2010 NA NA NA This is because of the presence of NA in m. How do you the equivalent of sum(m[1:6,1],na.rm=TRUE) using apply? Many thanks! -- View this message in context: http://r.789695.n4.nabble.com/tapply-function-with-NA-tp2164930p2164930.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply function with NA
It is exactly the same tmp - matrix(1:24,6,4) tmp[4,] - NA tmp apply(tmp, 2, sum, na.rm=TRUE) [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tapply.
Hi steven mosher mosherste...@gmail.com napsal dne 27.04.2010 17:04:04: Thanks, I had been wondering what Drop did. That makes it more clear. While I have code that loops and does the problem correctly, I wanted to do things the R way and be fast and terse. hehe. So: ID dy jan ... 11264402000 1 1987 NA NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 3 1987 NA NA NA NA NA 218 NA NA 235 243 240 NA in words : for each id, for each year return the max of jan,feb,.over d the min of jan, feb over d the mean of jan,feb.. over d the (max+min)/2 of jan, feb...over d the count of d for jan.feb.. the results of a function called with all elements of this id something like aggregate(data[, months], list(id, d), my.summary) where my.summary is a function computing all required values and returning them in appropriate form. in words : split selected data to chunks according to list of indices, use required function to each chunk and return result. Regards Petr Anyway, your kind attention has been greatly appreciated. On Tue, Apr 27, 2010 at 2:40 AM, Petr PIKAL petr.pi...@precheza.cz wrote: Hi r-help-boun...@r-project.org napsal dne 26.04.2010 17:05:54: I guess my problem was seeing a bunch of examples where they pulled a variable from a dataframe.. tapply(df$data, index=list(.. df$data results in vector so as eg. df[,5] unless you use drop=FALSE option and I assumed that the df$data was just generalizable to a collection of vectors a vector of vector being a vector df[,1:15] is not a vector of vectors. R sometimes can give you nasty surprise with object types and modes but changing a type of object merely by selecting some part of it wold be quite problematic. see str(df$data) str(df[, 1]) str(df[,1, drop=FALSE]) str(df[,1:15]) Regards Petr Thanks. On Mon, Apr 26, 2010 at 2:43 AM, Petr PIKAL petr.pi...@precheza.cz wrote: Hi steven mosher mosherste...@gmail.com napsal dne 26.04.2010 10:21:37: That fails: The manual says: tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE) Arguments X an atomic object, typically a vector. INDEX list of factors, each of same length as X. The elements are coerced to factors by as.factor. my error says: Error in tapply(DF[, 1:15], DF$Year, mean, na.rm = T) : arguments must have same length The issue that I have is I dont understand what the requirements for the list of factors are. In my example DF$Years is a sequence of years..1979,1980,1982,1983, 1987.. like that with missing years: so when the manual say: list of factors each the same length as X? what does that mean? I could have a DF with 20 rows and only two different years. or 20 rows and 20 different years. Suppose: a- c(1,2,3,4) b-c(2,3,4,5) df=data.frame(a,b) length(df) data frame is not vector nor atomic but list hence length(df) gives you number of columns. It is similar to length of a list lll-list(a=1, b=2, c=3) length(lll) [1] 3 If you accept that the first argument of tapply has to be vector you can not put data frame there. Next second argument has to be list of factors so you can put there several factors, each of the same length as first argument (a vector). If you want to perform aggregating operation on whole data frame you shall consider ?by or ?aggregate Other options are plyr or doBy packages. Syntax for aggregate is quite similar to tapply, only first argument can be data frame. Regards Petr The length of DF is 2. Does that mean the list of factors, each of same length as X. would have to be 2? that doesnt seem to make sense. On Mon, Apr 26, 2010 at 12:26 AM, Petr PIKAL petr.pi...@precheza.cz wrote: Hi r-help-boun...@r-project.org napsal dne 26.04.2010 06:52:55: Having some difficulties with understanding how tapply works and getting return values I expect Data: dataframe. DF DF$Id $D $Year... Id D Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 11264402000 1 1980 NA NA NA NA NA 212 203 209 228 237 NA NA 11264402000 0 1981 NA NA 243 244 NA NA NA NA 225 NA 231 NA 11264402000 1 1981 NA 251 NA 248 241 NA NA NA 235 NA NA 245 11264402000 0 1982 236 237 242 240 242 205 199 NA NA NA NA NA 11264402000 1 1982 236 NA NA 240 242 NA NA NA NA NA NA NA 11264402000 0 1983 NA 247 NA NA NA NA NA 205 NA NA NA NA 11264402000
Re: [R] Tapply.
Thanks dennis. Is there a book on R u could recommend. On Mon, Apr 26, 2010 at 7:12 PM, Dennis Murphy djmu...@gmail.com wrote: Hi: On Mon, Apr 26, 2010 at 8:01 AM, steven mosher mosherste...@gmail.comwrote: Thanks, I was trying to stick with the base package and figure out how the base routines worked. If you want to use base functions, then here's a solution with aggregate: (the Id column was removed first): with(DF, aggregate(DF[, -2], list(Year = Year), FUN = mean, na.rm = TRUE)) YearD Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 1 1980 1.00 NaN NaN NaN NaN NaN 212 203 209 228 237 NaN NaN 2 1981 0.50 NaN 251 243 246 241 NaN NaN NaN 230 NaN 231 245 3 1982 0.50 236 237 242 240 242 205 199 NaN NaN NaN NaN NaN 4 1983 0.50 NaN 247 NaN NaN NaN NaN NaN 205 NaN 225 NaN NaN 5 1986 0.00 NaN NaN NaN 240 NaN NaN NaN 213 NaN NaN NaN NaN 6 1987 1.33 241 NaN NaN NaN NaN 218 NaN NaN 235 243 240 NaN 7 1988 1.33 238 246 249 246 244 213 212 224 232 238 232 230 8 1989 1.33 232 233 238 239 231 NaN 215 NaN NaN NaN NaN 238 The problem with tapply() is that the function has to be called recursively on each column you want to summarize. You could do it in a loop: res - matrix(NA, 8, 14) res[, 1] - unique(DF$Year) res[, 2] - with(DF, tapply(D, Year, mean, na.rm = TRUE)) for(j in 3:14) res[, j] - tapply(DF[, j], DF$Year, mean, na.rm = TRUE) res [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [1,] 1980 1.00 NaN NaN NaN NaN NaN 212 203 209 228 237 NaN [2,] 1981 0.50 NaN 251 243 246 241 NaN NaN NaN 230 NaN 231 [3,] 1982 0.50 236 237 242 240 242 205 199 NaN NaN NaN NaN [4,] 1983 0.50 NaN 247 NaN NaN NaN NaN NaN 205 NaN 225 NaN [5,] 1986 0.00 NaN NaN NaN 240 NaN NaN NaN 213 NaN NaN NaN [6,] 1987 1.33 241 NaN NaN NaN NaN 218 NaN NaN 235 243 240 [7,] 1988 1.33 238 246 249 246 244 213 212 224 232 238 232 [8,] 1989 1.33 232 233 238 239 231 NaN 215 NaN NaN NaN NaN [,14] [1,] NaN [2,] 245 [3,] NaN [4,] NaN [5,] NaN [6,] NaN [7,] 230 [8,] 238 but it's not the most efficient way to do things. Essentially, this approach conforms to the 'split-apply-combine' strategy which is more efficiently implemented in functions like aggregate() or in packages such as doBy, plyr, reshape and data.table, some of which were mentioned earlier by Petr Pikal. HTH, Dennis On Mon, Apr 26, 2010 at 8:01 AM, steven mosher mosherste...@gmail.comwrote: Thanks, I was trying to stick with the base package and figure out how the base routines worked. I looked at plyer and it was very appealing. I guess i'll give in and use it On Mon, Apr 26, 2010 at 2:33 AM, Dennis Murphy djmu...@gmail.com wrote: Hi: Use of ddply() in the plyr package appears to work. library(plyr) ddply(df[, -1], .(Year), colwise(mean), na.rm = TRUE) D Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 1 1.00 1980 NaN NaN NaN NaN NaN 212 203 209 228 237 NaN NaN 2 0.50 1981 NaN 251 243 246 241 NaN NaN NaN 230 NaN 231 245 3 0.50 1982 236 237 242 240 242 205 199 NaN NaN NaN NaN NaN 4 0.50 1983 NaN 247 NaN NaN NaN NaN NaN 205 NaN 225 NaN NaN 5 0.00 1986 NaN NaN NaN 240 NaN NaN NaN 213 NaN NaN NaN NaN 6 1.33 1987 241 NaN NaN NaN NaN 218 NaN NaN 235 243 240 NaN 7 1.33 1988 238 246 249 246 244 213 212 224 232 238 232 230 8 1.33 1989 232 233 238 239 231 NaN 215 NaN NaN NaN NaN 238 Replace the NaNs with NAs and that should do it HTH, Dennis On Sun, Apr 25, 2010 at 9:52 PM, steven mosher mosherste...@gmail.comwrote: Having some difficulties with understanding how tapply works and getting return values I expect Data: dataframe. DF DF$Id $D $Year... Id D Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 11264402000 1 1980 NA NA NA NA NA 212 203 209 228 237 NA NA 11264402000 0 1981 NA NA 243 244 NA NA NA NA 225 NA 231 NA 11264402000 1 1981 NA 251 NA 248 241 NA NA NA 235 NA NA 245 11264402000 0 1982 236 237 242 240 242 205 199 NA NA NA NA NA 11264402000 1 1982 236 NA NA 240 242 NA NA NA NA NA NA NA 11264402000 0 1983 NA 247 NA NA NA NA NA 205 NA NA NA NA 11264402000 1 1983 NA 247 NA NA NA NA NA NA NA 225 NA NA 11264402000 0 1986 NA NA NA 240 NA NA NA 213 NA NA NA NA 11264402000 0 1987 241 NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 1 1987 NA NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 3 1987 NA NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 0 1988 238 246 249 NA 244 213 212 224 232 238 232 230 11264402000 1 1988 238 246 249 246
Re: [R] Tapply.
Hi r-help-boun...@r-project.org napsal dne 26.04.2010 17:05:54: I guess my problem was seeing a bunch of examples where they pulled a variable from a dataframe.. tapply(df$data, index=list(.. df$data results in vector so as eg. df[,5] unless you use drop=FALSE option and I assumed that the df$data was just generalizable to a collection of vectors a vector of vector being a vector df[,1:15] is not a vector of vectors. R sometimes can give you nasty surprise with object types and modes but changing a type of object merely by selecting some part of it wold be quite problematic. see str(df$data) str(df[, 1]) str(df[,1, drop=FALSE]) str(df[,1:15]) Regards Petr Thanks. On Mon, Apr 26, 2010 at 2:43 AM, Petr PIKAL petr.pi...@precheza.cz wrote: Hi steven mosher mosherste...@gmail.com napsal dne 26.04.2010 10:21:37: That fails: The manual says: tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE) Arguments X an atomic object, typically a vector. INDEX list of factors, each of same length as X. The elements are coerced to factors by as.factor. my error says: Error in tapply(DF[, 1:15], DF$Year, mean, na.rm = T) : arguments must have same length The issue that I have is I dont understand what the requirements for the list of factors are. In my example DF$Years is a sequence of years..1979,1980,1982,1983, 1987.. like that with missing years: so when the manual say: list of factors each the same length as X? what does that mean? I could have a DF with 20 rows and only two different years. or 20 rows and 20 different years. Suppose: a- c(1,2,3,4) b-c(2,3,4,5) df=data.frame(a,b) length(df) data frame is not vector nor atomic but list hence length(df) gives you number of columns. It is similar to length of a list lll-list(a=1, b=2, c=3) length(lll) [1] 3 If you accept that the first argument of tapply has to be vector you can not put data frame there. Next second argument has to be list of factors so you can put there several factors, each of the same length as first argument (a vector). If you want to perform aggregating operation on whole data frame you shall consider ?by or ?aggregate Other options are plyr or doBy packages. Syntax for aggregate is quite similar to tapply, only first argument can be data frame. Regards Petr The length of DF is 2. Does that mean the list of factors, each of same length as X. would have to be 2? that doesnt seem to make sense. On Mon, Apr 26, 2010 at 12:26 AM, Petr PIKAL petr.pi...@precheza.cz wrote: Hi r-help-boun...@r-project.org napsal dne 26.04.2010 06:52:55: Having some difficulties with understanding how tapply works and getting return values I expect Data: dataframe. DF DF$Id $D $Year... Id D Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 11264402000 1 1980 NA NA NA NA NA 212 203 209 228 237 NA NA 11264402000 0 1981 NA NA 243 244 NA NA NA NA 225 NA 231 NA 11264402000 1 1981 NA 251 NA 248 241 NA NA NA 235 NA NA 245 11264402000 0 1982 236 237 242 240 242 205 199 NA NA NA NA NA 11264402000 1 1982 236 NA NA 240 242 NA NA NA NA NA NA NA 11264402000 0 1983 NA 247 NA NA NA NA NA 205 NA NA NA NA 11264402000 1 1983 NA 247 NA NA NA NA NA NA NA 225 NA NA 11264402000 0 1986 NA NA NA 240 NA NA NA 213 NA NA NA NA 11264402000 0 1987 241 NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 1 1987 NA NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 3 1987 NA NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 0 1988 238 246 249 NA 244 213 212 224 232 238 232 230 11264402000 1 1988 238 246 249 246 244 213 212 224 232 NA NA 230 11264402000 3 1988 238 246 249 246 244 213 212 224 232 NA NA 230 11264402000 0 1989 232 233 238 239 231 NA 215 NA NA NA NA 238 11264402000 1 1989 232 233 238 239 231 NA NA NA NA NA NA 238 11264402000 3 1989 232 233 238 239 231 NA NA NA NA NA NA 238 and the result should be a dataframe of column means by year with the variable D dropped (or kept doesnt matter) 11264402000 1 1980 NA NA NA NA NA 212 203 209 228 237 NA NA 11264402000.5 1981 NA NA 243 244 NA NA NA NA 225 NA 231 NA 11264402000.5 1982 236 237 242 240 242 205 199 NA NA NA NA NA 11264402000.5 1983 NA 247 NA NA NA NA NA 205 NA 225 NA NA 112644020001 1986
Re: [R] Tapply.
Hi If you are not satisfied with R intro docs which are distributed with R installation you can consider Introductory statistics with R by P.Dalgaard for beginners and mayby Modern applied statistics with S by W.N.Venables and B.D.Ripley which is a bit outdated and applies maybe a little more to S but still worth reading. Regards Petr r-help-boun...@r-project.org napsal dne 27.04.2010 10:05:25: Thanks dennis. Is there a book on R u could recommend. On Mon, Apr 26, 2010 at 7:12 PM, Dennis Murphy djmu...@gmail.com wrote: Hi: On Mon, Apr 26, 2010 at 8:01 AM, steven mosher mosherste...@gmail.comwrote: Thanks, I was trying to stick with the base package and figure out how the base routines worked. If you want to use base functions, then here's a solution with aggregate: (the Id column was removed first): with(DF, aggregate(DF[, -2], list(Year = Year), FUN = mean, na.rm = TRUE)) YearD Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 1 1980 1.00 NaN NaN NaN NaN NaN 212 203 209 228 237 NaN NaN 2 1981 0.50 NaN 251 243 246 241 NaN NaN NaN 230 NaN 231 245 3 1982 0.50 236 237 242 240 242 205 199 NaN NaN NaN NaN NaN 4 1983 0.50 NaN 247 NaN NaN NaN NaN NaN 205 NaN 225 NaN NaN 5 1986 0.00 NaN NaN NaN 240 NaN NaN NaN 213 NaN NaN NaN NaN 6 1987 1.33 241 NaN NaN NaN NaN 218 NaN NaN 235 243 240 NaN 7 1988 1.33 238 246 249 246 244 213 212 224 232 238 232 230 8 1989 1.33 232 233 238 239 231 NaN 215 NaN NaN NaN NaN 238 The problem with tapply() is that the function has to be called recursively on each column you want to summarize. You could do it in a loop: res - matrix(NA, 8, 14) res[, 1] - unique(DF$Year) res[, 2] - with(DF, tapply(D, Year, mean, na.rm = TRUE)) for(j in 3:14) res[, j] - tapply(DF[, j], DF$Year, mean, na.rm = TRUE) res [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [1,] 1980 1.00 NaN NaN NaN NaN NaN 212 203 209 228 237 NaN [2,] 1981 0.50 NaN 251 243 246 241 NaN NaN NaN 230 NaN 231 [3,] 1982 0.50 236 237 242 240 242 205 199 NaN NaN NaN NaN [4,] 1983 0.50 NaN 247 NaN NaN NaN NaN NaN 205 NaN 225 NaN [5,] 1986 0.00 NaN NaN NaN 240 NaN NaN NaN 213 NaN NaN NaN [6,] 1987 1.33 241 NaN NaN NaN NaN 218 NaN NaN 235 243 240 [7,] 1988 1.33 238 246 249 246 244 213 212 224 232 238 232 [8,] 1989 1.33 232 233 238 239 231 NaN 215 NaN NaN NaN NaN [,14] [1,] NaN [2,] 245 [3,] NaN [4,] NaN [5,] NaN [6,] NaN [7,] 230 [8,] 238 but it's not the most efficient way to do things. Essentially, this approach conforms to the 'split-apply-combine' strategy which is more efficiently implemented in functions like aggregate() or in packages such as doBy, plyr, reshape and data.table, some of which were mentioned earlier by Petr Pikal. HTH, Dennis On Mon, Apr 26, 2010 at 8:01 AM, steven mosher mosherste...@gmail.comwrote: Thanks, I was trying to stick with the base package and figure out how the base routines worked. I looked at plyer and it was very appealing. I guess i'll give in and use it On Mon, Apr 26, 2010 at 2:33 AM, Dennis Murphy djmu...@gmail.com wrote: Hi: Use of ddply() in the plyr package appears to work. library(plyr) ddply(df[, -1], .(Year), colwise(mean), na.rm = TRUE) D Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 1 1.00 1980 NaN NaN NaN NaN NaN 212 203 209 228 237 NaN NaN 2 0.50 1981 NaN 251 243 246 241 NaN NaN NaN 230 NaN 231 245 3 0.50 1982 236 237 242 240 242 205 199 NaN NaN NaN NaN NaN 4 0.50 1983 NaN 247 NaN NaN NaN NaN NaN 205 NaN 225 NaN NaN 5 0.00 1986 NaN NaN NaN 240 NaN NaN NaN 213 NaN NaN NaN NaN 6 1.33 1987 241 NaN NaN NaN NaN 218 NaN NaN 235 243 240 NaN 7 1.33 1988 238 246 249 246 244 213 212 224 232 238 232 230 8 1.33 1989 232 233 238 239 231 NaN 215 NaN NaN NaN NaN 238 Replace the NaNs with NAs and that should do it HTH, Dennis On Sun, Apr 25, 2010 at 9:52 PM, steven mosher mosherste...@gmail.comwrote: Having some difficulties with understanding how tapply works and getting return values I expect Data: dataframe. DF DF$Id $D $Year... Id D Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 11264402000 1 1980 NA NA NA NA NA 212 203 209 228 237 NA NA 11264402000 0 1981 NA NA 243 244 NA NA NA NA 225 NA 231 NA 11264402000 1 1981 NA 251 NA 248 241 NA NA NA 235 NA NA 245 11264402000 0 1982 236 237 242 240 242 205 199 NA NA NA NA NA 11264402000 1 1982 236 NA NA 240 242 NA NA NA NA NA NA NA 11264402000 0 1983 NA 247 NA NA
Re: [R] Tapply.
Thanks, I had been wondering what Drop did. That makes it more clear. While I have code that loops and does the problem correctly, I wanted to do things the R way and be fast and terse. hehe. So: ID dy jan ... 11264402000 1 1987 NA NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 3 1987 NA NA NA NA NA 218 NA NA 235 243 240 NA in words : for each id, for each year return the max of jan,feb,.over d the min of jan, feb over d the mean of jan,feb.. over d the (max+min)/2 of jan, feb...over d the count of d for jan.feb.. the results of a function called with all elements of this id Anyway, your kind attention has been greatly appreciated. On Tue, Apr 27, 2010 at 2:40 AM, Petr PIKAL petr.pi...@precheza.cz wrote: Hi r-help-boun...@r-project.org napsal dne 26.04.2010 17:05:54: I guess my problem was seeing a bunch of examples where they pulled a variable from a dataframe.. tapply(df$data, index=list(.. df$data results in vector so as eg. df[,5] unless you use drop=FALSE option and I assumed that the df$data was just generalizable to a collection of vectors a vector of vector being a vector df[,1:15] is not a vector of vectors. R sometimes can give you nasty surprise with object types and modes but changing a type of object merely by selecting some part of it wold be quite problematic. see str(df$data) str(df[, 1]) str(df[,1, drop=FALSE]) str(df[,1:15]) Regards Petr Thanks. On Mon, Apr 26, 2010 at 2:43 AM, Petr PIKAL petr.pi...@precheza.cz wrote: Hi steven mosher mosherste...@gmail.com napsal dne 26.04.2010 10:21:37: That fails: The manual says: tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE) Arguments X an atomic object, typically a vector. INDEX list of factors, each of same length as X. The elements are coerced to factors by as.factor. my error says: Error in tapply(DF[, 1:15], DF$Year, mean, na.rm = T) : arguments must have same length The issue that I have is I dont understand what the requirements for the list of factors are. In my example DF$Years is a sequence of years..1979,1980,1982,1983, 1987.. like that with missing years: so when the manual say: list of factors each the same length as X? what does that mean? I could have a DF with 20 rows and only two different years. or 20 rows and 20 different years. Suppose: a- c(1,2,3,4) b-c(2,3,4,5) df=data.frame(a,b) length(df) data frame is not vector nor atomic but list hence length(df) gives you number of columns. It is similar to length of a list lll-list(a=1, b=2, c=3) length(lll) [1] 3 If you accept that the first argument of tapply has to be vector you can not put data frame there. Next second argument has to be list of factors so you can put there several factors, each of the same length as first argument (a vector). If you want to perform aggregating operation on whole data frame you shall consider ?by or ?aggregate Other options are plyr or doBy packages. Syntax for aggregate is quite similar to tapply, only first argument can be data frame. Regards Petr The length of DF is 2. Does that mean the list of factors, each of same length as X. would have to be 2? that doesnt seem to make sense. On Mon, Apr 26, 2010 at 12:26 AM, Petr PIKAL petr.pi...@precheza.cz wrote: Hi r-help-boun...@r-project.org napsal dne 26.04.2010 06:52:55: Having some difficulties with understanding how tapply works and getting return values I expect Data: dataframe. DF DF$Id $D $Year... Id D Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 11264402000 1 1980 NA NA NA NA NA 212 203 209 228 237 NA NA 11264402000 0 1981 NA NA 243 244 NA NA NA NA 225 NA 231 NA 11264402000 1 1981 NA 251 NA 248 241 NA NA NA 235 NA NA 245 11264402000 0 1982 236 237 242 240 242 205 199 NA NA NA NA NA 11264402000 1 1982 236 NA NA 240 242 NA NA NA NA NA NA NA 11264402000 0 1983 NA 247 NA NA NA NA NA 205 NA NA NA NA 11264402000 1 1983 NA 247 NA NA NA NA NA NA NA 225 NA NA 11264402000 0 1986 NA NA NA 240 NA NA NA 213 NA NA NA NA 11264402000 0 1987 241 NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 1 1987 NA NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 3 1987 NA NA NA NA NA 218 NA NA 235 243 240 NA 11264402000
Re: [R] Tapply.
I've tried both mean and colMean. I did success with one attempt using mean, however if only have 1 year and its a NA then I get NaN ( which I can replace). I'll keep trying. On Mon, Apr 26, 2010 at 12:26 AM, Petr PIKAL petr.pi...@precheza.cz wrote: Hi r-help-boun...@r-project.org napsal dne 26.04.2010 06:52:55: Having some difficulties with understanding how tapply works and getting return values I expect Data: dataframe. DF DF$Id $D $Year... Id D Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 11264402000 1 1980 NA NA NA NA NA 212 203 209 228 237 NA NA 11264402000 0 1981 NA NA 243 244 NA NA NA NA 225 NA 231 NA 11264402000 1 1981 NA 251 NA 248 241 NA NA NA 235 NA NA 245 11264402000 0 1982 236 237 242 240 242 205 199 NA NA NA NA NA 11264402000 1 1982 236 NA NA 240 242 NA NA NA NA NA NA NA 11264402000 0 1983 NA 247 NA NA NA NA NA 205 NA NA NA NA 11264402000 1 1983 NA 247 NA NA NA NA NA NA NA 225 NA NA 11264402000 0 1986 NA NA NA 240 NA NA NA 213 NA NA NA NA 11264402000 0 1987 241 NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 1 1987 NA NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 3 1987 NA NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 0 1988 238 246 249 NA 244 213 212 224 232 238 232 230 11264402000 1 1988 238 246 249 246 244 213 212 224 232 NA NA 230 11264402000 3 1988 238 246 249 246 244 213 212 224 232 NA NA 230 11264402000 0 1989 232 233 238 239 231 NA 215 NA NA NA NA 238 11264402000 1 1989 232 233 238 239 231 NA NA NA NA NA NA 238 11264402000 3 1989 232 233 238 239 231 NA NA NA NA NA NA 238 and the result should be a dataframe of column means by year with the variable D dropped (or kept doesnt matter) 11264402000 1 1980 NA NA NA NA NA 212 203 209 228 237 NA NA 11264402000.5 1981 NA NA 243 244 NA NA NA NA 225 NA 231 NA 11264402000.5 1982 236 237 242 240 242 205 199 NA NA NA NA NA 11264402000.5 1983 NA 247 NA NA NA NA NA 205 NA 225 NA NA 112644020001 1986 NA NA NA 240 NA NA NA 213 NA NA NA NA 11264402000 2 1987 241 NA NA NA NA 218 NA NA 235 243 240 NA 112644020001.33 1988 238 246 249 246 244 213 212 224 232 238 232 230 112644020001.33 1989 232 233 238 239 231 NA 215 NA NA NA NA 238 It would seem that Tapply should work result-tapply( DF[,1:15], DF$Year, colMeans,na.rm=T) Why colMeans? It is function used instead of apply(...,.. ,mean). Maybe you want result-tapply( DF[,1:15], DF$Year, mean,na.rm=T) Regards Petr but i get errors about the length of arguments, which [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tapply.
That fails: The manual says: tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE) ArgumentsXan atomic object, typically a vector.INDEXlist of factors, each of same length as X. The elements are coerced to factors by as.factorhttp://127.0.0.1:31214/library/base/help/as.factor . my error says: Error in tapply(DF[, 1:15], DF$Year, mean, na.rm = T) : arguments must have same length The issue that I have is I dont understand what the requirements for the list of factors are. In my example DF$Years is a sequence of years..1979,1980,1982,1983, 1987.. like that with missing years: so when the manual say: list of factors each the same length as X? what does that mean? I could have a DF with 20 rows and only two different years. or 20 rows and 20 different years. Suppose: a- c(1,2,3,4) b-c(2,3,4,5) df=data.frame(a,b) length(df) The length of DF is 2. Does that mean the list of factors, each of same length as X. would have to be 2? that doesnt seem to make sense. On Mon, Apr 26, 2010 at 12:26 AM, Petr PIKAL petr.pi...@precheza.cz wrote: Hi r-help-boun...@r-project.org napsal dne 26.04.2010 06:52:55: Having some difficulties with understanding how tapply works and getting return values I expect Data: dataframe. DF DF$Id $D $Year... Id D Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 11264402000 1 1980 NA NA NA NA NA 212 203 209 228 237 NA NA 11264402000 0 1981 NA NA 243 244 NA NA NA NA 225 NA 231 NA 11264402000 1 1981 NA 251 NA 248 241 NA NA NA 235 NA NA 245 11264402000 0 1982 236 237 242 240 242 205 199 NA NA NA NA NA 11264402000 1 1982 236 NA NA 240 242 NA NA NA NA NA NA NA 11264402000 0 1983 NA 247 NA NA NA NA NA 205 NA NA NA NA 11264402000 1 1983 NA 247 NA NA NA NA NA NA NA 225 NA NA 11264402000 0 1986 NA NA NA 240 NA NA NA 213 NA NA NA NA 11264402000 0 1987 241 NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 1 1987 NA NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 3 1987 NA NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 0 1988 238 246 249 NA 244 213 212 224 232 238 232 230 11264402000 1 1988 238 246 249 246 244 213 212 224 232 NA NA 230 11264402000 3 1988 238 246 249 246 244 213 212 224 232 NA NA 230 11264402000 0 1989 232 233 238 239 231 NA 215 NA NA NA NA 238 11264402000 1 1989 232 233 238 239 231 NA NA NA NA NA NA 238 11264402000 3 1989 232 233 238 239 231 NA NA NA NA NA NA 238 and the result should be a dataframe of column means by year with the variable D dropped (or kept doesnt matter) 11264402000 1 1980 NA NA NA NA NA 212 203 209 228 237 NA NA 11264402000.5 1981 NA NA 243 244 NA NA NA NA 225 NA 231 NA 11264402000.5 1982 236 237 242 240 242 205 199 NA NA NA NA NA 11264402000.5 1983 NA 247 NA NA NA NA NA 205 NA 225 NA NA 112644020001 1986 NA NA NA 240 NA NA NA 213 NA NA NA NA 11264402000 2 1987 241 NA NA NA NA 218 NA NA 235 243 240 NA 112644020001.33 1988 238 246 249 246 244 213 212 224 232 238 232 230 112644020001.33 1989 232 233 238 239 231 NA 215 NA NA NA NA 238 It would seem that Tapply should work result-tapply( DF[,1:15], DF$Year, colMeans,na.rm=T) Why colMeans? It is function used instead of apply(...,.. ,mean). Maybe you want result-tapply( DF[,1:15], DF$Year, mean,na.rm=T) Regards Petr but i get errors about the length of arguments, which [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tapply.
Hi: Use of ddply() in the plyr package appears to work. library(plyr) ddply(df[, -1], .(Year), colwise(mean), na.rm = TRUE) D Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 1 1.00 1980 NaN NaN NaN NaN NaN 212 203 209 228 237 NaN NaN 2 0.50 1981 NaN 251 243 246 241 NaN NaN NaN 230 NaN 231 245 3 0.50 1982 236 237 242 240 242 205 199 NaN NaN NaN NaN NaN 4 0.50 1983 NaN 247 NaN NaN NaN NaN NaN 205 NaN 225 NaN NaN 5 0.00 1986 NaN NaN NaN 240 NaN NaN NaN 213 NaN NaN NaN NaN 6 1.33 1987 241 NaN NaN NaN NaN 218 NaN NaN 235 243 240 NaN 7 1.33 1988 238 246 249 246 244 213 212 224 232 238 232 230 8 1.33 1989 232 233 238 239 231 NaN 215 NaN NaN NaN NaN 238 Replace the NaNs with NAs and that should do it HTH, Dennis On Sun, Apr 25, 2010 at 9:52 PM, steven mosher mosherste...@gmail.comwrote: Having some difficulties with understanding how tapply works and getting return values I expect Data: dataframe. DF DF$Id $D $Year... Id D Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 11264402000 1 1980 NA NA NA NA NA 212 203 209 228 237 NA NA 11264402000 0 1981 NA NA 243 244 NA NA NA NA 225 NA 231 NA 11264402000 1 1981 NA 251 NA 248 241 NA NA NA 235 NA NA 245 11264402000 0 1982 236 237 242 240 242 205 199 NA NA NA NA NA 11264402000 1 1982 236 NA NA 240 242 NA NA NA NA NA NA NA 11264402000 0 1983 NA 247 NA NA NA NA NA 205 NA NA NA NA 11264402000 1 1983 NA 247 NA NA NA NA NA NA NA 225 NA NA 11264402000 0 1986 NA NA NA 240 NA NA NA 213 NA NA NA NA 11264402000 0 1987 241 NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 1 1987 NA NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 3 1987 NA NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 0 1988 238 246 249 NA 244 213 212 224 232 238 232 230 11264402000 1 1988 238 246 249 246 244 213 212 224 232 NA NA 230 11264402000 3 1988 238 246 249 246 244 213 212 224 232 NA NA 230 11264402000 0 1989 232 233 238 239 231 NA 215 NA NA NA NA 238 11264402000 1 1989 232 233 238 239 231 NA NA NA NA NA NA 238 11264402000 3 1989 232 233 238 239 231 NA NA NA NA NA NA 238 and the result should be a dataframe of column means by year with the variable D dropped (or kept doesnt matter) 11264402000 1 1980 NA NA NA NA NA 212 203 209 228 237 NA NA 11264402000.5 1981 NA NA 243 244 NA NA NA NA 225 NA 231 NA 11264402000.5 1982 236 237 242 240 242 205 199 NA NA NA NA NA 11264402000.5 1983 NA 247 NA NA NA NA NA 205 NA 225 NA NA 112644020001 1986 NA NA NA 240 NA NA NA 213 NA NA NA NA 11264402000 2 1987 241 NA NA NA NA 218 NA NA 235 243 240 NA 112644020001.33 1988 238 246 249 246 244 213 212 224 232 238 232 230 112644020001.33 1989 232 233 238 239 231 NA 215 NA NA NA NA 238 It would seem that Tapply should work result-tapply( DF[,1:15], DF$Year, colMeans,na.rm=T) but i get errors about the length of arguments, which [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tapply.
Hi steven mosher mosherste...@gmail.com napsal dne 26.04.2010 10:21:37: That fails: The manual says: tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE) Arguments X an atomic object, typically a vector. INDEX list of factors, each of same length as X. The elements are coerced to factors by as.factor. my error says: Error in tapply(DF[, 1:15], DF$Year, mean, na.rm = T) : arguments must have same length The issue that I have is I dont understand what the requirements for the list of factors are. In my example DF$Years is a sequence of years..1979,1980,1982,1983, 1987.. like that with missing years: so when the manual say: list of factors each the same length as X? what does that mean? I could have a DF with 20 rows and only two different years. or 20 rows and 20 different years. Suppose: a- c(1,2,3,4) b-c(2,3,4,5) df=data.frame(a,b) length(df) data frame is not vector nor atomic but list hence length(df) gives you number of columns. It is similar to length of a list lll-list(a=1, b=2, c=3) length(lll) [1] 3 If you accept that the first argument of tapply has to be vector you can not put data frame there. Next second argument has to be list of factors so you can put there several factors, each of the same length as first argument (a vector). If you want to perform aggregating operation on whole data frame you shall consider ?by or ?aggregate Other options are plyr or doBy packages. Syntax for aggregate is quite similar to tapply, only first argument can be data frame. Regards Petr The length of DF is 2. Does that mean the list of factors, each of same length as X. would have to be 2? that doesnt seem to make sense. On Mon, Apr 26, 2010 at 12:26 AM, Petr PIKAL petr.pi...@precheza.cz wrote: Hi r-help-boun...@r-project.org napsal dne 26.04.2010 06:52:55: Having some difficulties with understanding how tapply works and getting return values I expect Data: dataframe. DF DF$Id $D $Year... Id D Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 11264402000 1 1980 NA NA NA NA NA 212 203 209 228 237 NA NA 11264402000 0 1981 NA NA 243 244 NA NA NA NA 225 NA 231 NA 11264402000 1 1981 NA 251 NA 248 241 NA NA NA 235 NA NA 245 11264402000 0 1982 236 237 242 240 242 205 199 NA NA NA NA NA 11264402000 1 1982 236 NA NA 240 242 NA NA NA NA NA NA NA 11264402000 0 1983 NA 247 NA NA NA NA NA 205 NA NA NA NA 11264402000 1 1983 NA 247 NA NA NA NA NA NA NA 225 NA NA 11264402000 0 1986 NA NA NA 240 NA NA NA 213 NA NA NA NA 11264402000 0 1987 241 NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 1 1987 NA NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 3 1987 NA NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 0 1988 238 246 249 NA 244 213 212 224 232 238 232 230 11264402000 1 1988 238 246 249 246 244 213 212 224 232 NA NA 230 11264402000 3 1988 238 246 249 246 244 213 212 224 232 NA NA 230 11264402000 0 1989 232 233 238 239 231 NA 215 NA NA NA NA 238 11264402000 1 1989 232 233 238 239 231 NA NA NA NA NA NA 238 11264402000 3 1989 232 233 238 239 231 NA NA NA NA NA NA 238 and the result should be a dataframe of column means by year with the variable D dropped (or kept doesnt matter) 11264402000 1 1980 NA NA NA NA NA 212 203 209 228 237 NA NA 11264402000.5 1981 NA NA 243 244 NA NA NA NA 225 NA 231 NA 11264402000.5 1982 236 237 242 240 242 205 199 NA NA NA NA NA 11264402000.5 1983 NA 247 NA NA NA NA NA 205 NA 225 NA NA 112644020001 1986 NA NA NA 240 NA NA NA 213 NA NA NA NA 11264402000 2 1987 241 NA NA NA NA 218 NA NA 235 243 240 NA 112644020001.33 1988 238 246 249 246 244 213 212 224 232 238 232 230 112644020001.33 1989 232 233 238 239 231 NA 215 NA NA NA NA 238 It would seem that Tapply should work result-tapply( DF[,1:15], DF$Year, colMeans,na.rm=T) Why colMeans? It is function used instead of apply(...,.. ,mean). Maybe you want result-tapply( DF[,1:15], DF$Year, mean,na.rm=T) Regards Petr but i get errors about the length of arguments, which [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list
Re: [R] Tapply.
Thanks, I was trying to stick with the base package and figure out how the base routines worked. I looked at plyer and it was very appealing. I guess i'll give in and use it On Mon, Apr 26, 2010 at 2:33 AM, Dennis Murphy djmu...@gmail.com wrote: Hi: Use of ddply() in the plyr package appears to work. library(plyr) ddply(df[, -1], .(Year), colwise(mean), na.rm = TRUE) D Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 1 1.00 1980 NaN NaN NaN NaN NaN 212 203 209 228 237 NaN NaN 2 0.50 1981 NaN 251 243 246 241 NaN NaN NaN 230 NaN 231 245 3 0.50 1982 236 237 242 240 242 205 199 NaN NaN NaN NaN NaN 4 0.50 1983 NaN 247 NaN NaN NaN NaN NaN 205 NaN 225 NaN NaN 5 0.00 1986 NaN NaN NaN 240 NaN NaN NaN 213 NaN NaN NaN NaN 6 1.33 1987 241 NaN NaN NaN NaN 218 NaN NaN 235 243 240 NaN 7 1.33 1988 238 246 249 246 244 213 212 224 232 238 232 230 8 1.33 1989 232 233 238 239 231 NaN 215 NaN NaN NaN NaN 238 Replace the NaNs with NAs and that should do it HTH, Dennis On Sun, Apr 25, 2010 at 9:52 PM, steven mosher mosherste...@gmail.comwrote: Having some difficulties with understanding how tapply works and getting return values I expect Data: dataframe. DF DF$Id $D $Year... Id D Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 11264402000 1 1980 NA NA NA NA NA 212 203 209 228 237 NA NA 11264402000 0 1981 NA NA 243 244 NA NA NA NA 225 NA 231 NA 11264402000 1 1981 NA 251 NA 248 241 NA NA NA 235 NA NA 245 11264402000 0 1982 236 237 242 240 242 205 199 NA NA NA NA NA 11264402000 1 1982 236 NA NA 240 242 NA NA NA NA NA NA NA 11264402000 0 1983 NA 247 NA NA NA NA NA 205 NA NA NA NA 11264402000 1 1983 NA 247 NA NA NA NA NA NA NA 225 NA NA 11264402000 0 1986 NA NA NA 240 NA NA NA 213 NA NA NA NA 11264402000 0 1987 241 NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 1 1987 NA NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 3 1987 NA NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 0 1988 238 246 249 NA 244 213 212 224 232 238 232 230 11264402000 1 1988 238 246 249 246 244 213 212 224 232 NA NA 230 11264402000 3 1988 238 246 249 246 244 213 212 224 232 NA NA 230 11264402000 0 1989 232 233 238 239 231 NA 215 NA NA NA NA 238 11264402000 1 1989 232 233 238 239 231 NA NA NA NA NA NA 238 11264402000 3 1989 232 233 238 239 231 NA NA NA NA NA NA 238 and the result should be a dataframe of column means by year with the variable D dropped (or kept doesnt matter) 11264402000 1 1980 NA NA NA NA NA 212 203 209 228 237 NA NA 11264402000.5 1981 NA NA 243 244 NA NA NA NA 225 NA 231 NA 11264402000.5 1982 236 237 242 240 242 205 199 NA NA NA NA NA 11264402000.5 1983 NA 247 NA NA NA NA NA 205 NA 225 NA NA 112644020001 1986 NA NA NA 240 NA NA NA 213 NA NA NA NA 11264402000 2 1987 241 NA NA NA NA 218 NA NA 235 243 240 NA 112644020001.33 1988 238 246 249 246 244 213 212 224 232 238 232 230 112644020001.33 1989 232 233 238 239 231 NA 215 NA NA NA NA 238 It would seem that Tapply should work result-tapply( DF[,1:15], DF$Year, colMeans,na.rm=T) but i get errors about the length of arguments, which [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tapply.
I guess my problem was seeing a bunch of examples where they pulled a variable from a dataframe.. tapply(df$data, index=list(.. and I assumed that the df$data was just generalizable to a collection of vectors a vector of vector being a vector Thanks. On Mon, Apr 26, 2010 at 2:43 AM, Petr PIKAL petr.pi...@precheza.cz wrote: Hi steven mosher mosherste...@gmail.com napsal dne 26.04.2010 10:21:37: That fails: The manual says: tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE) Arguments X an atomic object, typically a vector. INDEX list of factors, each of same length as X. The elements are coerced to factors by as.factor. my error says: Error in tapply(DF[, 1:15], DF$Year, mean, na.rm = T) : arguments must have same length The issue that I have is I dont understand what the requirements for the list of factors are. In my example DF$Years is a sequence of years..1979,1980,1982,1983, 1987.. like that with missing years: so when the manual say: list of factors each the same length as X? what does that mean? I could have a DF with 20 rows and only two different years. or 20 rows and 20 different years. Suppose: a- c(1,2,3,4) b-c(2,3,4,5) df=data.frame(a,b) length(df) data frame is not vector nor atomic but list hence length(df) gives you number of columns. It is similar to length of a list lll-list(a=1, b=2, c=3) length(lll) [1] 3 If you accept that the first argument of tapply has to be vector you can not put data frame there. Next second argument has to be list of factors so you can put there several factors, each of the same length as first argument (a vector). If you want to perform aggregating operation on whole data frame you shall consider ?by or ?aggregate Other options are plyr or doBy packages. Syntax for aggregate is quite similar to tapply, only first argument can be data frame. Regards Petr The length of DF is 2. Does that mean the list of factors, each of same length as X. would have to be 2? that doesnt seem to make sense. On Mon, Apr 26, 2010 at 12:26 AM, Petr PIKAL petr.pi...@precheza.cz wrote: Hi r-help-boun...@r-project.org napsal dne 26.04.2010 06:52:55: Having some difficulties with understanding how tapply works and getting return values I expect Data: dataframe. DF DF$Id $D $Year... Id D Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 11264402000 1 1980 NA NA NA NA NA 212 203 209 228 237 NA NA 11264402000 0 1981 NA NA 243 244 NA NA NA NA 225 NA 231 NA 11264402000 1 1981 NA 251 NA 248 241 NA NA NA 235 NA NA 245 11264402000 0 1982 236 237 242 240 242 205 199 NA NA NA NA NA 11264402000 1 1982 236 NA NA 240 242 NA NA NA NA NA NA NA 11264402000 0 1983 NA 247 NA NA NA NA NA 205 NA NA NA NA 11264402000 1 1983 NA 247 NA NA NA NA NA NA NA 225 NA NA 11264402000 0 1986 NA NA NA 240 NA NA NA 213 NA NA NA NA 11264402000 0 1987 241 NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 1 1987 NA NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 3 1987 NA NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 0 1988 238 246 249 NA 244 213 212 224 232 238 232 230 11264402000 1 1988 238 246 249 246 244 213 212 224 232 NA NA 230 11264402000 3 1988 238 246 249 246 244 213 212 224 232 NA NA 230 11264402000 0 1989 232 233 238 239 231 NA 215 NA NA NA NA 238 11264402000 1 1989 232 233 238 239 231 NA NA NA NA NA NA 238 11264402000 3 1989 232 233 238 239 231 NA NA NA NA NA NA 238 and the result should be a dataframe of column means by year with the variable D dropped (or kept doesnt matter) 11264402000 1 1980 NA NA NA NA NA 212 203 209 228 237 NA NA 11264402000.5 1981 NA NA 243 244 NA NA NA NA 225 NA 231 NA 11264402000.5 1982 236 237 242 240 242 205 199 NA NA NA NA NA 11264402000.5 1983 NA 247 NA NA NA NA NA 205 NA 225 NA NA 112644020001 1986 NA NA NA 240 NA NA NA 213 NA NA NA NA 11264402000 2 1987 241 NA NA NA NA 218 NA NA 235 243 240 NA 112644020001.33 1988 238 246 249 246 244 213 212 224 232 238 232 230 112644020001.33 1989 232 233 238 239 231 NA 215 NA NA NA NA 238 It would seem that Tapply should work result-tapply( DF[,1:15], DF$Year, colMeans,na.rm=T) Why colMeans? It is function used instead of apply(...,.. ,mean). Maybe you want result-tapply( DF[,1:15], DF$Year, mean,na.rm=T) Regards Petr but i get errors about the length of arguments,
Re: [R] Tapply.
Hi: On Mon, Apr 26, 2010 at 8:01 AM, steven mosher mosherste...@gmail.comwrote: Thanks, I was trying to stick with the base package and figure out how the base routines worked. If you want to use base functions, then here's a solution with aggregate: (the Id column was removed first): with(DF, aggregate(DF[, -2], list(Year = Year), FUN = mean, na.rm = TRUE)) YearD Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 1 1980 1.00 NaN NaN NaN NaN NaN 212 203 209 228 237 NaN NaN 2 1981 0.50 NaN 251 243 246 241 NaN NaN NaN 230 NaN 231 245 3 1982 0.50 236 237 242 240 242 205 199 NaN NaN NaN NaN NaN 4 1983 0.50 NaN 247 NaN NaN NaN NaN NaN 205 NaN 225 NaN NaN 5 1986 0.00 NaN NaN NaN 240 NaN NaN NaN 213 NaN NaN NaN NaN 6 1987 1.33 241 NaN NaN NaN NaN 218 NaN NaN 235 243 240 NaN 7 1988 1.33 238 246 249 246 244 213 212 224 232 238 232 230 8 1989 1.33 232 233 238 239 231 NaN 215 NaN NaN NaN NaN 238 The problem with tapply() is that the function has to be called recursively on each column you want to summarize. You could do it in a loop: res - matrix(NA, 8, 14) res[, 1] - unique(DF$Year) res[, 2] - with(DF, tapply(D, Year, mean, na.rm = TRUE)) for(j in 3:14) res[, j] - tapply(DF[, j], DF$Year, mean, na.rm = TRUE) res [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [1,] 1980 1.00 NaN NaN NaN NaN NaN 212 203 209 228 237 NaN [2,] 1981 0.50 NaN 251 243 246 241 NaN NaN NaN 230 NaN 231 [3,] 1982 0.50 236 237 242 240 242 205 199 NaN NaN NaN NaN [4,] 1983 0.50 NaN 247 NaN NaN NaN NaN NaN 205 NaN 225 NaN [5,] 1986 0.00 NaN NaN NaN 240 NaN NaN NaN 213 NaN NaN NaN [6,] 1987 1.33 241 NaN NaN NaN NaN 218 NaN NaN 235 243 240 [7,] 1988 1.33 238 246 249 246 244 213 212 224 232 238 232 [8,] 1989 1.33 232 233 238 239 231 NaN 215 NaN NaN NaN NaN [,14] [1,] NaN [2,] 245 [3,] NaN [4,] NaN [5,] NaN [6,] NaN [7,] 230 [8,] 238 but it's not the most efficient way to do things. Essentially, this approach conforms to the 'split-apply-combine' strategy which is more efficiently implemented in functions like aggregate() or in packages such as doBy, plyr, reshape and data.table, some of which were mentioned earlier by Petr Pikal. HTH, Dennis On Mon, Apr 26, 2010 at 8:01 AM, steven mosher mosherste...@gmail.comwrote: Thanks, I was trying to stick with the base package and figure out how the base routines worked. I looked at plyer and it was very appealing. I guess i'll give in and use it On Mon, Apr 26, 2010 at 2:33 AM, Dennis Murphy djmu...@gmail.com wrote: Hi: Use of ddply() in the plyr package appears to work. library(plyr) ddply(df[, -1], .(Year), colwise(mean), na.rm = TRUE) D Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 1 1.00 1980 NaN NaN NaN NaN NaN 212 203 209 228 237 NaN NaN 2 0.50 1981 NaN 251 243 246 241 NaN NaN NaN 230 NaN 231 245 3 0.50 1982 236 237 242 240 242 205 199 NaN NaN NaN NaN NaN 4 0.50 1983 NaN 247 NaN NaN NaN NaN NaN 205 NaN 225 NaN NaN 5 0.00 1986 NaN NaN NaN 240 NaN NaN NaN 213 NaN NaN NaN NaN 6 1.33 1987 241 NaN NaN NaN NaN 218 NaN NaN 235 243 240 NaN 7 1.33 1988 238 246 249 246 244 213 212 224 232 238 232 230 8 1.33 1989 232 233 238 239 231 NaN 215 NaN NaN NaN NaN 238 Replace the NaNs with NAs and that should do it HTH, Dennis On Sun, Apr 25, 2010 at 9:52 PM, steven mosher mosherste...@gmail.comwrote: Having some difficulties with understanding how tapply works and getting return values I expect Data: dataframe. DF DF$Id $D $Year... Id D Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 11264402000 1 1980 NA NA NA NA NA 212 203 209 228 237 NA NA 11264402000 0 1981 NA NA 243 244 NA NA NA NA 225 NA 231 NA 11264402000 1 1981 NA 251 NA 248 241 NA NA NA 235 NA NA 245 11264402000 0 1982 236 237 242 240 242 205 199 NA NA NA NA NA 11264402000 1 1982 236 NA NA 240 242 NA NA NA NA NA NA NA 11264402000 0 1983 NA 247 NA NA NA NA NA 205 NA NA NA NA 11264402000 1 1983 NA 247 NA NA NA NA NA NA NA 225 NA NA 11264402000 0 1986 NA NA NA 240 NA NA NA 213 NA NA NA NA 11264402000 0 1987 241 NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 1 1987 NA NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 3 1987 NA NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 0 1988 238 246 249 NA 244 213 212 224 232 238 232 230 11264402000 1 1988 238 246 249 246 244 213 212 224 232 NA NA 230 11264402000 3 1988 238 246 249 246 244 213 212 224 232 NA NA 230 11264402000 0 1989 232 233 238 239 231 NA 215 NA NA NA NA 238 11264402000
[R] Tapply.
Having some difficulties with understanding how tapply works and getting return values I expect Data: dataframe. DF DF$Id $D $Year... Id D Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 11264402000 1 1980 NA NA NA NA NA 212 203 209 228 237 NA NA 11264402000 0 1981 NA NA 243 244 NA NA NA NA 225 NA 231 NA 11264402000 1 1981 NA 251 NA 248 241 NA NA NA 235 NA NA 245 11264402000 0 1982 236 237 242 240 242 205 199 NA NA NA NA NA 11264402000 1 1982 236 NA NA 240 242 NA NA NA NA NA NA NA 11264402000 0 1983 NA 247 NA NA NA NA NA 205 NA NA NA NA 11264402000 1 1983 NA 247 NA NA NA NA NA NA NA 225 NA NA 11264402000 0 1986 NA NA NA 240 NA NA NA 213 NA NA NA NA 11264402000 0 1987 241 NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 1 1987 NA NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 3 1987 NA NA NA NA NA 218 NA NA 235 243 240 NA 11264402000 0 1988 238 246 249 NA 244 213 212 224 232 238 232 230 11264402000 1 1988 238 246 249 246 244 213 212 224 232 NA NA 230 11264402000 3 1988 238 246 249 246 244 213 212 224 232 NA NA 230 11264402000 0 1989 232 233 238 239 231 NA 215 NA NA NA NA 238 11264402000 1 1989 232 233 238 239 231 NA NA NA NA NA NA 238 11264402000 3 1989 232 233 238 239 231 NA NA NA NA NA NA 238 and the result should be a dataframe of column means by year with the variable D dropped (or kept doesnt matter) 11264402000 1 1980 NA NA NA NA NA 212 203 209 228 237 NA NA 11264402000.5 1981 NA NA 243 244 NA NA NA NA 225 NA 231 NA 11264402000.5 1982 236 237 242 240 242 205 199 NA NA NA NA NA 11264402000.5 1983 NA 247 NA NA NA NA NA 205 NA 225 NA NA 112644020001 1986 NA NA NA 240 NA NA NA 213 NA NA NA NA 11264402000 2 1987 241 NA NA NA NA 218 NA NA 235 243 240 NA 112644020001.33 1988 238 246 249 246 244 213 212 224 232 238 232 230 112644020001.33 1989 232 233 238 239 231 NA 215 NA NA NA NA 238 It would seem that Tapply should work result-tapply( DF[,1:15], DF$Year, colMeans,na.rm=T) but i get errors about the length of arguments, which [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply syntax
What is the function set()? Is that a typo? When I type ?set I get nothing, and when I try to evaluate that code R tells me it can't find the function. -- View this message in context: http://n4.nabble.com/tapply-syntax-tp1692503p1694586.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply syntax
On 29/03/2010, at 1:27 PM, Jeff Brown wrote: What is the function set()? Is that a typo? When I type ?set I get nothing, and when I try to evaluate that code R tells me it can't find the function. Yeah, it's a typo. (S)he meant ``subset''. cheers, Rolf Turner ## Attention:\ This e-mail message is privileged and confid...{{dropped:9}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply syntax
The message tells you everything - there is no function 'set' in the workspace you are using. Did you forget to load a library? What is the context in which you are trying to use it? On Sun, Mar 28, 2010 at 8:27 PM, Jeff Brown dopethatwantsc...@yahoo.comwrote: What is the function set()? Is that a typo? When I type ?set I get nothing, and when I try to evaluate that code R tells me it can't find the function. -- View this message in context: http://n4.nabble.com/tapply-syntax-tp1692503p1694586.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.htmlhttp://www.r-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply syntax
sorry - I use many abbreviations and I try to remove them before I post questions/answers - 'set' is my abb. for subset david On 3/28/2010 8:27 PM, Jeff Brown [via R] wrote: What is the function set()? Is that a typo? When I type ?set I get nothing, and when I try to evaluate that code R tells me it can't find the function. View message @ http://n4.nabble.com/tapply-syntax-tp1692503p1694586.html To unsubscribe from Re: tapply syntax, click here (link removed) =. -- View this message in context: http://n4.nabble.com/tapply-syntax-tp1692503p1694626.html Sent from the R help mailing list archive at Nabble.com. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] tapply syntax
Dear R-help members, Apologies for the trouble. I have a question : Essentially, I have a dataset which stores genetic variations for individual patients. Each individual patient can have more than one variation, and each new record corresponds to a new variation (thus, both individual patients and variations are non-unique). So the dataset looks something like this ((letters = patients, numbers = variation type). Patient, Variation Type A, 1 A, 2 A, 3 B, 1 C, 2 D, 2 D, 3 E, 2 E, 4 F, 4 My final desired output is a data.frame or a vector containing patients, each corresponding to a desired subset of variations. For e.g., if I only was interested in variation type 2,3, my output would look like this. A, 2 B, 0 C, 1 D, 2 E, 1 F, 0. I am trying to figure out how to use tapply to do this. It would be something like tapply (Variation Type, Patient, ??? ) I am not sure about the function syntax of ??? to subselect only 2,3, and have been looking at the r-help. Sorry! Essentially, I am trying to avoid awkward loops in this whole process. Thanks very much for your advice! Min-Han [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply syntax
Hi, I figured a workaround to my problem, but if anyone has any advice on how to express a function in tapply to achieve the same outcome, that would be awesome and I'd learn something about functions! The workaround was tapply ((data$Variation.Type %in% c(2,3)), data$Patient, sum) Thanks. Min-Han On Fri, Mar 26, 2010 at 12:40 PM, Min-Han Tan minhan.scie...@gmail.comwrote: Dear R-help members, Apologies for the trouble. I have a question : Essentially, I have a dataset which stores genetic variations for individual patients. Each individual patient can have more than one variation, and each new record corresponds to a new variation (thus, both individual patients and variations are non-unique). So the dataset looks something like this ((letters = patients, numbers = variation type). Patient, Variation Type A, 1 A, 2 A, 3 B, 1 C, 2 D, 2 D, 3 E, 2 E, 4 F, 4 My final desired output is a data.frame or a vector containing patients, each corresponding to a desired subset of variations. For e.g., if I only was interested in variation type 2,3, my output would look like this. A, 2 B, 0 C, 1 D, 2 E, 1 F, 0. I am trying to figure out how to use tapply to do this. It would be something like tapply (Variation Type, Patient, ??? ) I am not sure about the function syntax of ??? to subselect only 2,3, and have been looking at the r-help. Sorry! Essentially, I am trying to avoid awkward loops in this whole process. Thanks very much for your advice! Min-Han [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply syntax
how about: d1=data.frame(pat=c(rep('a',3),'b','c',rep('d',2),rep('e',2),'f'),var=c(1,2,3,1,2,2,3,2,4,4)) ds=set(d1,var %in% c(2,3)) with(ds,tapply(var,pat,FUN=length)) hth, David Freedman, CDC, Atlanta -- View this message in context: http://n4.nabble.com/tapply-syntax-tp1692503p1692553.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply for function taking of 1 argument?
sjaffe wrote: I'm sure I can put this together from the various 'apply's and split, but I wonder if anyone has a quick incantation: E.g. I can do tapply( data, groups, mean) but how can I do something like: tapply( list(data,weights), groups, weighted.mean ) ? (or: mapply is to sapply as ? is to tapply ) Thanks for your help. coef(lm(data ~ -1 + as.factor(groups), weights=weights)) Not the fastest, but IMO more comprehensible than the constructions involving anonymous functions. J. R. M. Hosking __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply for function taking of 1 argument?
On Feb 4, 2010, at 9:56 AM, J. R. M. Hosking wrote: sjaffe wrote: I'm sure I can put this together from the various 'apply's and split, but I wonder if anyone has a quick incantation: E.g. I can do tapply( data, groups, mean) but how can I do something like: tapply( list(data,weights), groups, weighted.mean ) ? (or: mapply is to sapply as ? is to tapply ) Thanks for your help. coef(lm(data ~ -1 + as.factor(groups), weights=weights)) Not the fastest, but IMO more comprehensible than the constructions involving anonymous functions. Are you sure? (Am I sure?) Thomas Lumley has corrected my misinterpretations on this point (and I apologize to him for the fact that he has had to do it more than once.) https://stat.ethz.ch/pipermail/r-help/2010-February/226536.html I am guessing that the OP was using either sampling weights or replication weights (he did not say), so lm( , weights) might not be appropriate tool. -- David Winsemius, MD Heritage Laboratories West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply for function taking of 1 argument?
Hi r-help-boun...@r-project.org napsal dne 02.02.2010 22:16:06: 'fraid not :-(( tapply( data, groups, weighted.mean, weights) tapply(seq(along=lll), rrr, function(i, x, w) weighted.mean(x[i], w[i]), x=lll, w=ttt) If you want to subset more than one thing, subset the index vector. The above help I obtained from Prof.Ripley several years ago so (untested) tapply( seq(along=data), groups, function (i, x, w) weighted.mean(x[i], w[i]), x=data, w=weights) I believe it shall still work. Regards Petr won't work because the *entire* weights vector is passed as the 2nd arg to weighted.means. But weighted.mean needs 'weights' to be split in the same way as 'data' -- the first and 2nd args need to correspond. Jorge Ivan Velez wrote: Hi sjaffem, You were almost there: tapply( yourdata, groups, weighted.mean, weights) See ?tapply for more information. HTH, Jorge -- View this message in context: http://n4.nabble.com/tapply-for-function-taking- of-1-argument-tp1460392p1460419.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply for function taking of 1 argument?
Yes, this is clearly the key to working with subsets. Thanks -Original Message- From: Petr PIKAL [mailto:petr.pi...@precheza.cz] Sent: Wednesday, February 03, 2010 4:16 AM To: Steve Jaffe Cc: r-help@r-project.org Subject: Re: [R] tapply for function taking of 1 argument? Hi r-help-boun...@r-project.org napsal dne 02.02.2010 22:16:06: 'fraid not :-(( tapply( data, groups, weighted.mean, weights) tapply(seq(along=lll), rrr, function(i, x, w) weighted.mean(x[i], w[i]), x=lll, w=ttt) If you want to subset more than one thing, subset the index vector. The above help I obtained from Prof.Ripley several years ago so (untested) tapply( seq(along=data), groups, function (i, x, w) weighted.mean(x[i], w[i]), x=data, w=weights) I believe it shall still work. Regards Petr won't work because the *entire* weights vector is passed as the 2nd arg to weighted.means. But weighted.mean needs 'weights' to be split in the same way as 'data' -- the first and 2nd args need to correspond. Jorge Ivan Velez wrote: Hi sjaffem, You were almost there: tapply( yourdata, groups, weighted.mean, weights) See ?tapply for more information. HTH, Jorge -- View this message in context: http://n4.nabble.com/tapply-for-function-taking- of-1-argument-tp1460392p1460419.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply for function taking of 1 argument?
also, library(plyr) ddply(d,~grp,function(df) weighted.mean(df$x,df$w)) -- View this message in context: http://n4.nabble.com/tapply-for-function-taking-of-1-argument-tp1460392p1461428.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply for function taking of 1 argument?
On Wed, Feb 3, 2010 at 11:06 AM, David Freedman 3.14da...@gmail.com wrote: also, library(plyr) ddply(d,~grp,function(df) weighted.mean(df$x,df$w)) Or ddply(d, grp, summarise, mean = weighted.mean(x, w)) which is convenient if you want more than one output Hadley -- http://had.co.nz/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply for function taking of 1 argument?
Also try this: library(sqldf) DF - data.frame(data = 1:10, groups = rep(1:2, 5), weights = 1) sqldf(select groups, sum(data * weights)/sum(weights) 'wtd mean' from DF group by groups) groups wtd mean 1 15 2 26 On Tue, Feb 2, 2010 at 5:06 PM, sjaffe sja...@riskspan.com wrote: Thanks! :-) I suppose it's obvious, but one will generally have to use a (anonymous) function to 'unpack' the data.frame into columns, unless the function already knows how to do this. I mention this because when I tested the solution on my example I got an unexpected result -- apparently weighted.mean will operate on a 2-column dataframe but not in the way one would expect. data = 1:10 weights = rep(1,10) groups = rep(c(1,2),5) by( data.frame(data,weights), groups, weighted.mean) groups: 1 [1] 15 groups: 2 [1] 17.5 But by( data.frame(data,weights), groups, function(d) { weighted.mean(d[,1], d[,2]) } ) does the right thing groups: 1 [1] 5 groups: 2 [1] 6 Bert Gunter wrote: ?by Bert Gunter Genentech Nonclinical Statistics -- View this message in context: http://n4.nabble.com/tapply-for-function-taking-of-1-argument-tp1460392p1460489.html Sent from the R help mailing list archive at Nabble.com. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply for function taking of 1 argument?
Thanks, Iâm actually more comfortable with vector-ish syntax than sql-ish but this is a good thing to keep in mind⦠I wonder how it compares in performance versus âbyâ or âtapplyâ From: Gabor Grothendieck [via R] [mailto:ml-node+1461531-1948782...@n4.nabble.com] Sent: Wednesday, February 03, 2010 1:19 PM To: Steve Jaffe Subject: Re: tapply for function taking of 1 argument? Also try this: library(sqldf) DF - data.frame(data = 1:10, groups = rep(1:2, 5), weights = 1) sqldf(select groups, sum(data * weights)/sum(weights) 'wtd mean' from DF group by groups) groups wtd mean 1 15 2 26 On Tue, Feb 2, 2010 at 5:06 PM, sjaffe [hidden email]http://n4.nabble.com/user/SendEmail.jtp?type=nodenode=1461531i=0 wrote: Thanks! :-) I suppose it's obvious, but one will generally have to use a (anonymous) function to 'unpack' the data.frame into columns, unless the function already knows how to do this. I mention this because when I tested the solution on my example I got an unexpected result -- apparently weighted.mean will operate on a 2-column dataframe but not in the way one would expect. data = 1:10 weights = rep(1,10) groups = rep(c(1,2),5) by( data.frame(data,weights), groups, weighted.mean) groups: 1 [1] 15 groups: 2 [1] 17.5 But by( data.frame(data,weights), groups, function(d) { weighted.mean(d[,1], d[,2]) } ) does the right thing groups: 1 [1] 5 groups: 2 [1] 6 Bert Gunter wrote: ?by Bert Gunter Genentech Nonclinical Statistics -- View this message in context: http://n4.nabble.com/tapply-for-function-taking-of-1-argument-tp1460392p1460489.html Sent from the R help mailing list archive at Nabble.com. [[alternative HTML version deleted]] __ [hidden email]http://n4.nabble.com/user/SendEmail.jtp?type=nodenode=1461531i=1 mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ [hidden email]http://n4.nabble.com/user/SendEmail.jtp?type=nodenode=1461531i=2 mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. View message @ http://n4.nabble.com/tapply-for-function-taking-of-1-argument-tp1460392p1461531.html To unsubscribe from Re: tapply for function taking of 1 argument?, click here (link removed) ==. -- View this message in context: http://n4.nabble.com/tapply-for-function-taking-of-1-argument-tp1460392p1461541.html Sent from the R help mailing list archive at Nabble.com. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply for function taking of 1 argument?
My editorial opinion only: It will of necessity be slower (because there's more machinery underlying the sqldf package); but I doubt whether it would be noticeably slower than the native R solution in most practical situations. The same would be true for plyR's implementation (it relies on the proto package, which slows things down a bit). The point is that the most important issue in almost all cases is the programmer's time to create and debug correct code, especially as the native machine speeds continue to increase. R gives you the option to choose whatever idiom you prefer to minimize this. The software implementation differences thereafter will rarely be important. In other words, pick your poison. Cheers, Bert Bert Gunter Genentech Nonclinical Biostatistics -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of sjaffe Sent: Wednesday, February 03, 2010 10:25 AM To: r-help@r-project.org Subject: Re: [R] tapply for function taking of 1 argument? Thanks, Ibm actually more comfortable with vector-ish syntax than sql-ish but this is a good thing to keep in mindb I wonder how it compares in performance versus bbyb or btapplyb From: Gabor Grothendieck [via R] [mailto:ml-node+1461531-1948782...@n4.nabble.com] Sent: Wednesday, February 03, 2010 1:19 PM To: Steve Jaffe Subject: Re: tapply for function taking of 1 argument? Also try this: library(sqldf) DF - data.frame(data = 1:10, groups = rep(1:2, 5), weights = 1) sqldf(select groups, sum(data * weights)/sum(weights) 'wtd mean' from DF group by groups) groups wtd mean 1 15 2 26 On Tue, Feb 2, 2010 at 5:06 PM, sjaffe [hidden email]http://n4.nabble.com/user/SendEmail.jtp?type=nodenode=1461531i=0 wrote: Thanks! :-) I suppose it's obvious, but one will generally have to use a (anonymous) function to 'unpack' the data.frame into columns, unless the function already knows how to do this. I mention this because when I tested the solution on my example I got an unexpected result -- apparently weighted.mean will operate on a 2-column dataframe but not in the way one would expect. data = 1:10 weights = rep(1,10) groups = rep(c(1,2),5) by( data.frame(data,weights), groups, weighted.mean) groups: 1 [1] 15 groups: 2 [1] 17.5 But by( data.frame(data,weights), groups, function(d) { weighted.mean(d[,1], d[,2]) } ) does the right thing groups: 1 [1] 5 groups: 2 [1] 6 Bert Gunter wrote: ?by Bert Gunter Genentech Nonclinical Statistics -- View this message in context: http://n4.nabble.com/tapply-for-function-taking-of-1-argument-tp1460392p1460 489.html Sent from the R help mailing list archive at Nabble.com. [[alternative HTML version deleted]] __ [hidden email]http://n4.nabble.com/user/SendEmail.jtp?type=nodenode=1461531i=1 mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ [hidden email]http://n4.nabble.com/user/SendEmail.jtp?type=nodenode=1461531i=2 mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. View message @ http://n4.nabble.com/tapply-for-function-taking-of-1-argument-tp1460392p1461 531.html To unsubscribe from Re: tapply for function taking of 1 argument?, click here (link removed) ==. -- View this message in context: http://n4.nabble.com/tapply-for-function-taking-of-1-argument-tp1460392p1461 541.html Sent from the R help mailing list archive at Nabble.com. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply for function taking of 1 argument?
It will of necessity be slower (because there's more machinery underlying the sqldf package); but I doubt whether it would be noticeably slower than the native R solution in most practical situations. The same would be true for plyR's implementation (it relies on the proto package, which slows things down a bit). Plyr doesn't use proto at all (that's ggplot2). Plyr is generally faster than split + lapply etc for large datasets with many splits, but slower with smaller datasets/fewer splits. The point is that the most important issue in almost all cases is the programmer's time to create and debug correct code, especially as the native machine speeds continue to increase. R gives you the option to choose whatever idiom you prefer to minimize this. The software implementation differences thereafter will rarely be important. Totally agreed! In my mind the advantage of learning plyr, is that you learn one set of methods that work for lists, data frames and arrays. And because all of the functions are designed with consistency in mind, it hopefully takes less time to learn them all. Hadley -- http://had.co.nz/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] tapply for function taking of 1 argument?
I'm sure I can put this together from the various 'apply's and split, but I wonder if anyone has a quick incantation: E.g. I can do tapply( data, groups, mean) but how can I do something like: tapply( list(data,weights), groups, weighted.mean ) ? (or: mapply is to sapply as ? is to tapply ) Thanks for your help. -- View this message in context: http://n4.nabble.com/tapply-for-function-taking-of-1-argument-tp1460392p1460392.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply for function taking of 1 argument?
Hi sjaffem, You were almost there: tapply( yourdata, groups, weighted.mean, weights) See ?tapply for more information. HTH, Jorge On Tue, Feb 2, 2010 at 3:58 PM, sjaffe wrote: I'm sure I can put this together from the various 'apply's and split, but I wonder if anyone has a quick incantation: E.g. I can do tapply( data, groups, mean) but how can I do something like: tapply( list(data,weights), groups, weighted.mean ) ? (or: mapply is to sapply as ? is to tapply ) Thanks for your help. -- View this message in context: http://n4.nabble.com/tapply-for-function-taking-of-1-argument-tp1460392p1460392.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply for function taking of 1 argument?
'fraid not :-(( tapply( data, groups, weighted.mean, weights) won't work because the *entire* weights vector is passed as the 2nd arg to weighted.means. But weighted.mean needs 'weights' to be split in the same way as 'data' -- the first and 2nd args need to correspond. Jorge Ivan Velez wrote: Hi sjaffem, You were almost there: tapply( yourdata, groups, weighted.mean, weights) See ?tapply for more information. HTH, Jorge -- View this message in context: http://n4.nabble.com/tapply-for-function-taking-of-1-argument-tp1460392p1460419.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply for function taking of 1 argument?
?by Bert Gunter Genentech Nonclinical Statistics -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of sjaffe Sent: Tuesday, February 02, 2010 1:16 PM To: r-help@r-project.org Subject: Re: [R] tapply for function taking of 1 argument? 'fraid not :-(( tapply( data, groups, weighted.mean, weights) won't work because the *entire* weights vector is passed as the 2nd arg to weighted.means. But weighted.mean needs 'weights' to be split in the same way as 'data' -- the first and 2nd args need to correspond. Jorge Ivan Velez wrote: Hi sjaffem, You were almost there: tapply( yourdata, groups, weighted.mean, weights) See ?tapply for more information. HTH, Jorge -- View this message in context: http://n4.nabble.com/tapply-for-function-taking-of-1-argument-tp1460392p1460 419.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply for function taking of 1 argument?
On Tue, 2 Feb 2010, sjaffe wrote: 'fraid not :-(( tapply( data, groups, weighted.mean, weights) won't work because the *entire* weights vector is passed as the 2nd arg to weighted.means. But weighted.mean needs 'weights' to be split in the same way as 'data' -- the first and 2nd args need to correspond. try sapply( split( data.frame(x,w), grp) , do.call, what=weighted.mean ) HTH, Chuck Jorge Ivan Velez wrote: Hi sjaffem, You were almost there: tapply( yourdata, groups, weighted.mean, weights) See ?tapply for more information. HTH, Jorge -- View this message in context: http://n4.nabble.com/tapply-for-function-taking-of-1-argument-tp1460392p1460419.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Charles C. Berry(858) 534-2098 Dept of Family/Preventive Medicine E mailto:cbe...@tajo.ucsd.edu UC San Diego http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply for function taking of 1 argument?
Excellent! I knew there would be a clever answer using 'do.call' :-) -Original Message- From: Charles C. Berry [mailto:cbe...@tajo.ucsd.edu] Sent: Tuesday, February 02, 2010 4:25 PM To: Steve Jaffe Cc: r-help@r-project.org Subject: Re: [R] tapply for function taking of 1 argument? On Tue, 2 Feb 2010, sjaffe wrote: 'fraid not :-(( tapply( data, groups, weighted.mean, weights) won't work because the *entire* weights vector is passed as the 2nd arg to weighted.means. But weighted.mean needs 'weights' to be split in the same way as 'data' -- the first and 2nd args need to correspond. try sapply( split( data.frame(x,w), grp) , do.call, what=weighted.mean ) HTH, Chuck Jorge Ivan Velez wrote: Hi sjaffem, You were almost there: tapply( yourdata, groups, weighted.mean, weights) See ?tapply for more information. HTH, Jorge -- View this message in context: http://n4.nabble.com/tapply-for-function-taking-of-1-argument-tp1460392p1460419.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Charles C. Berry(858) 534-2098 Dept of Family/Preventive Medicine E mailto:cbe...@tajo.ucsd.edu UC San Diego http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply for function taking of 1 argument?
Thanks! :-) I suppose it's obvious, but one will generally have to use a (anonymous) function to 'unpack' the data.frame into columns, unless the function already knows how to do this. I mention this because when I tested the solution on my example I got an unexpected result -- apparently weighted.mean will operate on a 2-column dataframe but not in the way one would expect. data = 1:10 weights = rep(1,10) groups = rep(c(1,2),5) by( data.frame(data,weights), groups, weighted.mean) groups: 1 [1] 15 groups: 2 [1] 17.5 But by( data.frame(data,weights), groups, function(d) { weighted.mean(d[,1], d[,2]) } ) does the right thing groups: 1 [1] 5 groups: 2 [1] 6 Bert Gunter wrote: ?by Bert Gunter Genentech Nonclinical Statistics -- View this message in context: http://n4.nabble.com/tapply-for-function-taking-of-1-argument-tp1460392p1460489.html Sent from the R help mailing list archive at Nabble.com. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply on multiple groups
On Jan 28, 2010, at 10:26 AM, GL wrote: Can you make tapply break down groups similar to bwplot or such? Example: Data frame has one measure (Days) and two Dimensions (MM and Place). All have the same length. length(dbs.final$Days) [1] 3306 length() [1] 3306 length() [1] 3306 Doing the following makes a nice table for one dimension and one measure: do.call(rbind,tapply(dbs.final$Days,dbs.final$Place, summary)) But, what I really need to do is break it down on two dimensions and one measures - effectively equivalent to the following bwplot call: bwplot( Days ~ MM | Place, ,data=dbs.final) Is there an equivalent to the | operation in tapply? Please reread the help page for tapply. Perhaps?: tapply(dbs.final$Days, list(dbs.final$MM, dbs.final$Place) summary) -- David -- View this message in context: http://n4.nabble.com/tapply-on-multiple-groups-tp1380593p1380593.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD Heritage Laboratories West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply on multiple groups
Thanks. My mistake was that I used c(dbs.final$Days,dbs.final$Place) instead of list(... when I tried to follow that part of the documentation. David Winsemius dwinsem...@comcast.net 1/28/2010 11:49 AM On Jan 28, 2010, at 10:26 AM, GL wrote: Can you make tapply break down groups similar to bwplot or such? Example: Data frame has one measure (Days) and two Dimensions (MM and Place). All have the same length. length(dbs.final$Days) [1] 3306 length() [1] 3306 length() [1] 3306 Doing the following makes a nice table for one dimension and one measure: do.call(rbind,tapply(dbs.final$Days,dbs.final$Place, summary)) But, what I really need to do is break it down on two dimensions and one measures - effectively equivalent to the following bwplot call: bwplot( Days ~ MM | Place, ,data=dbs.final) Is there an equivalent to the | operation in tapply? Please reread the help page for tapply. Perhaps?: tapply(dbs.final$Days, list(dbs.final$MM, dbs.final$Place) summary) -- David -- View this message in context: http://n4.nabble.com/tapply-on-multiple-groups-tp1380593p1380593.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD Heritage Laboratories West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] tapply and more than one function, with different arguments
Dear R-users, I am working with R version 2.10.1. Say I have is a simple function like this: my.fun - function(x, mult) mult*sum(x) Now, I want to apply this function along with some other (say 'max') to a simple data.frame, like: dat - data.frame(x = 1:4, grp = c(a,a,b,b)) Ideally, the result would look something like this (if mult = 10): max my.fun a 2 30 b 4 70 I have tried it that way: apply.more.functions - function(dat, FUN = c(max, my.fun), ...) { res - NULL for(f in FUN) res[[f]] - tapply(dat$x, dat$grp, FUN = f, ...) data.frame(res) } # let's test it: apply.more.functions(dat, FUN = c(max, min)) max min a 2 1 b 4 3 # perfect! # now, with an additional argument: apply.more.functions(dat, FUN = c(max, my.fun), mult = 10) max my.fun a 10 30 b 10 70 # uhuh! Apparently, 'mult' has been used in the calculation of 'max' as well. How can I modify apply.more.functions in order to avoid this? Your advice would be appreciated; Kind regards Heinrich. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply and more than one function, with different arguments
Try replacing 'max' with 'mean' and see what you get. Then have a look at ?max and see what max() does with extra arguments. I'm not sure it's relevant, but it might be useful to check what Hmisc::summarize does. -Peter Ehlers RINNER Heinrich wrote: Dear R-users, I am working with R version 2.10.1. Say I have is a simple function like this: my.fun - function(x, mult) mult*sum(x) Now, I want to apply this function along with some other (say 'max') to a simple data.frame, like: dat - data.frame(x = 1:4, grp = c(a,a,b,b)) Ideally, the result would look something like this (if mult = 10): max my.fun a 2 30 b 4 70 I have tried it that way: apply.more.functions - function(dat, FUN = c(max, my.fun), ...) { res - NULL for(f in FUN) res[[f]] - tapply(dat$x, dat$grp, FUN = f, ...) data.frame(res) } # let's test it: apply.more.functions(dat, FUN = c(max, min)) max min a 2 1 b 4 3 # perfect! # now, with an additional argument: apply.more.functions(dat, FUN = c(max, my.fun), mult = 10) max my.fun a 10 30 b 10 70 # uhuh! Apparently, 'mult' has been used in the calculation of 'max' as well. How can I modify apply.more.functions in order to avoid this? Your advice would be appreciated; Kind regards Heinrich. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Peter Ehlers University of Calgary __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply and more than one function, with different arguments
Hi: Using the plyr package, we can get the result as follows: library(plyr) my.fun - function(x, mult) mult*sum(x) dat - data.frame(x = 1:4, grp = c(a,a,b,b)) ddply(dat, .(grp), summarize, max = max(x), myfun = my.fun(x, 10)) grp max myfun 1 a 230 2 b 470 HTH, Dennis On Tue, Jan 26, 2010 at 8:26 AM, RINNER Heinrich heinrich.rin...@tirol.gv.at wrote: Dear R-users, I am working with R version 2.10.1. Say I have is a simple function like this: my.fun - function(x, mult) mult*sum(x) Now, I want to apply this function along with some other (say 'max') to a simple data.frame, like: dat - data.frame(x = 1:4, grp = c(a,a,b,b)) Ideally, the result would look something like this (if mult = 10): max my.fun a 2 30 b 4 70 I have tried it that way: apply.more.functions - function(dat, FUN = c(max, my.fun), ...) { res - NULL for(f in FUN) res[[f]] - tapply(dat$x, dat$grp, FUN = f, ...) data.frame(res) } # let's test it: apply.more.functions(dat, FUN = c(max, min)) max min a 2 1 b 4 3 # perfect! # now, with an additional argument: apply.more.functions(dat, FUN = c(max, my.fun), mult = 10) max my.fun a 10 30 b 10 70 # uhuh! Apparently, 'mult' has been used in the calculation of 'max' as well. How can I modify apply.more.functions in order to avoid this? Your advice would be appreciated; Kind regards Heinrich. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply and more than one function, with different arguments
Hi Dennis, now that's a very nice function, and this seems to be just what I need! Thanks a lot! -Heinrich. Von: Dennis Murphy [djmu...@gmail.com] Gesendet: Dienstag, 26. Januar 2010 19:44 An: RINNER Heinrich Cc: r-help Betreff: Re: [R] tapply and more than one function, with different arguments Hi: Using the plyr package, we can get the result as follows: library(plyr) my.fun - function(x, mult) mult*sum(x) dat - data.frame(x = 1:4, grp = c(a,a,b,b)) ddply(dat, .(grp), summarize, max = max(x), myfun = my.fun(x, 10)) grp max myfun 1 a 230 2 b 470 HTH, Dennis On Tue, Jan 26, 2010 at 8:26 AM, RINNER Heinrich heinrich.rin...@tirol.gv.atmailto:heinrich.rin...@tirol.gv.at wrote: Dear R-users, I am working with R version 2.10.1. Say I have is a simple function like this: my.fun - function(x, mult) mult*sum(x) Now, I want to apply this function along with some other (say 'max') to a simple data.frame, like: dat - data.frame(x = 1:4, grp = c(a,a,b,b)) Ideally, the result would look something like this (if mult = 10): max my.fun a 2 30 b 4 70 I have tried it that way: apply.more.functions - function(dat, FUN = c(max, my.fun), ...) { res - NULL for(f in FUN) res[[f]] - tapply(dat$x, dat$grp, FUN = f, ...) data.frame(res) } # let's test it: apply.more.functions(dat, FUN = c(max, min)) max min a 2 1 b 4 3 # perfect! # now, with an additional argument: apply.more.functions(dat, FUN = c(max, my.fun), mult = 10) max my.fun a 10 30 b 10 70 # uhuh! Apparently, 'mult' has been used in the calculation of 'max' as well. How can I modify apply.more.functions in order to avoid this? Your advice would be appreciated; Kind regards Heinrich. __ R-help@r-project.orgmailto:R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] tapply function
Hi, I tried to use tapply function to find the mean of the data in each group as the following command, but the result are NA, as there are several missing values in each group. tapply(data,group,mean) Could someone please advice me the way to ignore the missing data in order for the fucntion to run successfully? Thanks Fir __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply function
you must have missing values in data. Try tapply(data, group, mean, na.rm = TRUE) If that's not the case, read the bottom of this email about the posting guide. HTH, --sundar On Tue, Nov 3, 2009 at 5:28 AM, FMH kagba2...@yahoo.com wrote: Hi, I tried to use tapply function to find the mean of the data in each group as the following command, but the result are NA, as there are several missing values in each group. tapply(data,group,mean) Could someone please advice me the way to ignore the missing data in order for the fucntion to run successfully? Thanks Fir __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] tapply with multiple arguments that are not part of the same data frame
Hi all, I would like to invoke a function that takes multiple arguments (some of which are specified columns in the data frame, and others that are independent of the data frame) on split parts of a data frame, how do I do this? For example, let's say I have a data frame fitness_data name height weight country rob 5.8200 usa nancy 5.5140 germany jen 5.6150 usa clark 5.10 210 germany matt 5.9 280 canada ralph6 270 canada ... ... Now let us say I have a function, my_func(h, w, noise, dir), which takes as input: (1) a vector of heights (2) a vector of weights (3) a user-input numeric noise value (4) a user-input string dir for the directory to output the end result of the function to This function does some calculations on the input data and outputs a dataframe that is then written to a file in the dir directory. If I want to apply this function to data grouped by each country in the fitness_data dataframe, how would I do this? I tried looking through the mailing archives, but couldn't nail down the solution. I tried something like split(mapply( function(a,b,c,d) my_func(fitness_data$h, fitness_data$w, 2.5, my_directory)), fitness_data$country) but this considered fitness_data$h, and fitness_data$w in each single row for a country, rather than a vector of heights or weights across all rows corresponding to that country. Thanks! [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply with multiple arguments that are not part of the same data frame
I just realized my earlier post of my question below was not in Plain Text mode, hence the repeat post...apologies! Kavitha On Thu, Oct 22, 2009 at 4:19 PM, Kavitha Venkatesan kavitha.venkate...@gmail.com wrote: Hi all, I would like to invoke a function that takes multiple arguments (some of which are specified columns in the data frame, and others that are independent of the data frame) on split parts of a data frame, how do I do this? For example, let's say I have a data frame fitness_data name height weight country rob 5.8 200 usa nancy 5.5 140 germany jen 5.6 150 usa clark 5.10 210 germany matt 5.9 280 canada ralph 6 270 canada ... ... Now let us say I have a function, my_func(h, w, noise, dir), which takes as input: (1) a vector of heights (2) a vector of weights (3) a user-input numeric noise value (4) a user-input string dir for the directory to output the end result of the function to This function does some calculations on the input data and outputs a dataframe that is then written to a file in the dir directory. If I want to apply this function to data grouped by each country in the fitness_data dataframe, how would I do this? I tried looking through the mailing archives, but couldn't nail down the solution. I tried something like split(mapply( function(a,b,c,d) my_func(fitness_data$h, fitness_data$w, 2.5, my_directory)), fitness_data$country) but this considered fitness_data$h, and fitness_data$w in each single row for a country, rather than a vector of heights or weights across all rows corresponding to that country. Thanks! __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply() and using factor() on a factor
Thank you Mohamed and Bill for your replies. (I did not send the data because it is unwieldy.) Yes Bill, the issue arises directly from what you had guessed. I was working with a subset of the data (which implicitly had factors for the complete data set). On this, what is the best way take a subset of the data which ignores these extraneous factors? log-data.frame(Flag=1:2, RequestID=factor(letters[1:2],levels=letters[1:10])) log2 -subset(log, RequestID==a) levels(log2$RequestID) [1] a b c d e f g h i j In other words, how do I take a subset which yields a as the only level for log2? Alex -Original Message- From: William Dunlap [mailto:wdun...@tibco.com] Sent: Thursday, October 15, 2009 11:59 PM To: Alexander Peterhansl; r-help@r-project.org Subject: RE: [R] tapply() and using factor() on a factor -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Alexander Peterhansl Sent: Thursday, October 15, 2009 2:50 PM To: r-help@r-project.org Subject: [R] tapply() and using factor() on a factor Dear List, Shouldn't result1 and result2 be equal in the following case? Note that log$RequestID is a factor. That is, is.factor(log$RequestID) yields TRUE. result1 - tapply(log$Flag,factor(log$RequestID),sum) result2 - tapply(log$Flag,log$RequestID,sum) Showing us the output of dput(log) (or str(log) and summary(log)) would let people discover the problem more readily. Since you didn't I'll guess what the dataset may contain. If log$RequestID is a factor with lots of unused levels tapply will output an NA for each unused level. factor(log$RequestID) will create a new set of levels, only those actually used, so tapply will not be forced to fill those spots with NA's. E.g., log-data.frame(Flag=1:2, RequestID=factor(letters[1:2], levels=letters[1:10])) tapply(log$Flag, log$RequestID, sum) a b c d e f g h i j 1 2 NA NA NA NA NA NA NA NA tapply(log$Flag, factor(log$RequestID), sum) a b 1 2 I suppose tapply(X,INDEX,FUN) could call FUN(X[0]) to see how to fill the cells with no data behind them, but it doesn't. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com Yet, when I summarize the output, I get the following: summary(result1) Min.1st Qu. Median Mean 3rd Qu.Max. 11.00 11.00 11.00 26.06 11.00 101.00 summary(result2) Min. 1st Qu. Median Mean 3rd Qu.Max.NA's 11.00 11.00 11.0026.06 11.00 101.00 978.00 Why does result2 have 978 NA's? Any help on this would be appreciated. Alex [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply() and using factor() on a factor
On Oct 16, 2009, at 11:33 AM, Alexander Peterhansl wrote: Thank you Mohamed and Bill for your replies. (I did not send the data because it is unwieldy.) Yes Bill, the issue arises directly from what you had guessed. I was working with a subset of the data (which implicitly had factors for the complete data set). On this, what is the best way take a subset of the data which ignores these extraneous factors? log-data.frame(Flag=1:2, RequestID=factor(letters[1:2],levels=letters[1:10])) log2 -subset(log, RequestID==a) levels(log2$RequestID) [1] a b c d e f g h i j log2$RequestID - factor(log2$RequestID) You might think that log2 -subset(log, RequestID==a, drop=TRUE) might do that task, but it clearly doesn't. -- DW In other words, how do I take a subset which yields a as the only level for log2? Alex -Original Message- From: William Dunlap [mailto:wdun...@tibco.com] Sent: Thursday, October 15, 2009 11:59 PM To: Alexander Peterhansl; r-help@r-project.org Subject: RE: [R] tapply() and using factor() on a factor -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Alexander Peterhansl Sent: Thursday, October 15, 2009 2:50 PM To: r-help@r-project.org Subject: [R] tapply() and using factor() on a factor Dear List, Shouldn't result1 and result2 be equal in the following case? Note that log$RequestID is a factor. That is, is.factor(log$RequestID) yields TRUE. result1 - tapply(log$Flag,factor(log$RequestID),sum) result2 - tapply(log$Flag,log$RequestID,sum) Showing us the output of dput(log) (or str(log) and summary(log)) would let people discover the problem more readily. Since you didn't I'll guess what the dataset may contain. If log$RequestID is a factor with lots of unused levels tapply will output an NA for each unused level. factor(log$RequestID) will create a new set of levels, only those actually used, so tapply will not be forced to fill those spots with NA's. E.g., log-data.frame(Flag=1:2, RequestID=factor(letters[1:2], levels=letters[1:10])) tapply(log$Flag, log$RequestID, sum) a b c d e f g h i j 1 2 NA NA NA NA NA NA NA NA tapply(log$Flag, factor(log$RequestID), sum) a b 1 2 I suppose tapply(X,INDEX,FUN) could call FUN(X[0]) to see how to fill the cells with no data behind them, but it doesn't. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com Yet, when I summarize the output, I get the following: summary(result1) Min.1st Qu. Median Mean 3rd Qu.Max. 11.00 11.00 11.00 26.06 11.00 101.00 summary(result2) Min. 1st Qu. Median Mean 3rd Qu.Max.NA's 11.00 11.00 11.0026.06 11.00 101.00 978.00 Why does result2 have 978 NA's? Any help on this would be appreciated. David Winsemius, MD Heritage Laboratories West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] tapply() and using factor() on a factor
Dear List, Shouldn't result1 and result2 be equal in the following case? Note that log$RequestID is a factor. That is, is.factor(log$RequestID) yields TRUE. result1 - tapply(log$Flag,factor(log$RequestID),sum) result2 - tapply(log$Flag,log$RequestID,sum) Yet, when I summarize the output, I get the following: summary(result1) Min.1st Qu. Median Mean 3rd Qu.Max. 11.00 11.00 11.00 26.06 11.00 101.00 summary(result2) Min. 1st Qu. Median Mean 3rd Qu.Max.NA's 11.00 11.00 11.0026.06 11.00 101.00 978.00 Why does result2 have 978 NA's? Any help on this would be appreciated. Alex [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tapply() and using factor() on a factor
-Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Alexander Peterhansl Sent: Thursday, October 15, 2009 2:50 PM To: r-help@r-project.org Subject: [R] tapply() and using factor() on a factor Dear List, Shouldn't result1 and result2 be equal in the following case? Note that log$RequestID is a factor. That is, is.factor(log$RequestID) yields TRUE. result1 - tapply(log$Flag,factor(log$RequestID),sum) result2 - tapply(log$Flag,log$RequestID,sum) Showing us the output of dput(log) (or str(log) and summary(log)) would let people discover the problem more readily. Since you didn't I'll guess what the dataset may contain. If log$RequestID is a factor with lots of unused levels tapply will output an NA for each unused level. factor(log$RequestID) will create a new set of levels, only those actually used, so tapply will not be forced to fill those spots with NA's. E.g., log-data.frame(Flag=1:2, RequestID=factor(letters[1:2], levels=letters[1:10])) tapply(log$Flag, log$RequestID, sum) a b c d e f g h i j 1 2 NA NA NA NA NA NA NA NA tapply(log$Flag, factor(log$RequestID), sum) a b 1 2 I suppose tapply(X,INDEX,FUN) could call FUN(X[0]) to see how to fill the cells with no data behind them, but it doesn't. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com Yet, when I summarize the output, I get the following: summary(result1) Min.1st Qu. Median Mean 3rd Qu.Max. 11.00 11.00 11.00 26.06 11.00 101.00 summary(result2) Min. 1st Qu. Median Mean 3rd Qu.Max.NA's 11.00 11.00 11.0026.06 11.00 101.00 978.00 Why does result2 have 978 NA's? Any help on this would be appreciated. Alex [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.