Hi:
Here's what I tried:
# data frame versions (aggregate, ddply):
aggregate(age ~ municipality + employed, data = data.test, FUN = mean)
municipality employed age
1B no 55.57407
2C no 44.67463
3A yes 41.58759
4B yes 43.59330
5C yes 43.82545
ddply(data.test, .(municipality, employed), summarise, mean = mean(age))
municipality employed mean
1A yes 41.58759
2B no 55.57407
3B yes 43.59330
4C no 44.67463
5C yes 43.82545
It appears that aggregate() silently removes groups where no observations
are present, but ddply() has an option .drop, which when set to FALSE,
returns NaN for the not employed group in municipality A:
ddply(data.test, .(municipality, employed), summarise, avgage = mean(age),
.drop = FALSE)
municipality employed avgage
1A no NaN
2A yes 41.58759
3B no 55.57407
4B yes 43.59330
5C no 44.67463
6C yes 43.82545
# tapply/daply
with(data.test, tapply(age, list(municipality, employed), mean))
no yes
A NA 41.58759
B 55.57407 43.59330
C 44.67463 43.82545
daply(data.test, .(municipality, employed), function(d){mean(d$age)} )
employed
municipality no yes
A 41.58759 44.67463
B 55.57407 43.82545
C 43.59330 NA
The .drop argument has a different meaning in daply. Some R functions have
an na.last argument, and it may be that somewhere in daply, there is a
function call that moves all NAs to the end. The means are in the right
order except for the first, where the NA is supposed to be, so everything is
offset in the table by 1. I've cc'ed Hadley on this.
HTH,
Dennis
On Thu, Sep 9, 2010 at 2:43 AM, Jan van der Laan rh...@eoos.dds.nl wrote:
Dear list,
I get some strange results with daply from the plyr package. In the example
below, the average age per municipality for employed en unemployed is
calculated. If I do this using tapply (see code below) I get the following
result:
no yes
A NA 36.94931
B 51.22505 34.24887
C 48.05759 51.00198
If I do this using daply:
municipality no yes
A 36.94931 48.05759
B 51.22505 51.00198
C 34.24887 NA
daply generates the same numbers. However, these are not in the correct
cells. For example, in municipality A everybody is employed. Therefore, the
NA should be in the cell for unemployed in municipality A.
Am I using daply incorrectly or is there indeed something wrong with the
output of daply?
Regards,
Jan
I am using version 1.1 of the plyr-package.
# Generate some test data
data.test - data.frame(
municipality=rep(LETTERS[1:3], each=10),
employed=sample(c(yes, no), 30, replace=TRUE),
age=runif(30,20,70))
# Make sure everybody is employed in municipality A
data.test$employed[data.test$municipality == A] - yes
# Compare the output of tapply:
tapply(data.test$age, list(data.test$municipality, data.test$employed),
mean)
# to that of daply:
daply(data.test, .(municipality, employed), function(d){mean(d$age)} )
# results of ddply are the samen as tapply
ddply(data.test, .(municipality, employed), function(d){mean(d$age)} )
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.