[R] Strange output daply with empty strata

2010-09-09 Thread Jan van der Laan

Dear list,

I get some strange results with daply from the plyr package. In the  
example below, the average age per municipality for employed en  
unemployed is calculated. If I do this using tapply (see code below) I  
get the following result:


no  yes
A   NA 36.94931
B 51.22505 34.24887
C 48.05759 51.00198

If I do this using daply:

municipality   no  yes
   A 36.94931 48.05759
   B 51.22505 51.00198
   C 34.24887   NA

daply generates the same numbers. However, these are not in the  
correct cells. For example, in municipality A everybody is employed.  
Therefore, the NA should be in the cell for unemployed in municipality  
A.


Am I using daply incorrectly or is there indeed something wrong with  
the output of daply?


Regards,

Jan


I am using version 1.1 of the plyr-package.


# Generate some test data
data.test - data.frame(
  municipality=rep(LETTERS[1:3], each=10),
  employed=sample(c(yes, no), 30, replace=TRUE),
  age=runif(30,20,70))
# Make sure everybody is employed in municipality A
data.test$employed[data.test$municipality == A] - yes

# Compare the output of tapply:
tapply(data.test$age, list(data.test$municipality, data.test$employed),
mean)
# to that of daply:
daply(data.test, .(municipality, employed), function(d){mean(d$age)} )
# results of ddply are the samen as tapply
ddply(data.test, .(municipality, employed), function(d){mean(d$age)} )

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Strange output daply with empty strata

2010-09-09 Thread Dennis Murphy
Hi:

Here's what I tried:

# data frame versions (aggregate, ddply):

aggregate(age ~ municipality + employed, data = data.test, FUN = mean)
  municipality employed  age
1B   no 55.57407
2C   no 44.67463
3A  yes 41.58759
4B  yes 43.59330
5C  yes 43.82545

ddply(data.test, .(municipality, employed), summarise, mean = mean(age))
  municipality employed mean
1A  yes 41.58759
2B   no 55.57407
3B  yes 43.59330
4C   no 44.67463
5C  yes 43.82545

It appears that aggregate() silently removes groups where no observations
are present, but ddply() has an option .drop, which when set to FALSE,
returns NaN for the not employed group in municipality A:

ddply(data.test, .(municipality, employed), summarise, avgage = mean(age),
.drop = FALSE)
  municipality employed   avgage
1A   no  NaN
2A  yes 41.58759
3B   no 55.57407
4B  yes 43.59330
5C   no 44.67463
6C  yes 43.82545

#  tapply/daply

with(data.test, tapply(age, list(municipality, employed), mean))
no  yes
A   NA 41.58759
B 55.57407 43.59330
C 44.67463 43.82545

daply(data.test, .(municipality, employed), function(d){mean(d$age)} )
employed
municipality   no  yes
   A 41.58759 44.67463
   B 55.57407 43.82545
   C 43.59330   NA

The .drop argument has a different meaning in daply. Some R functions have
an na.last argument, and it may be that somewhere in daply, there is a
function call that moves all NAs to the end. The means are in the right
order except for the first, where the NA is supposed to be, so everything is
offset in the table by 1. I've cc'ed Hadley on this.

HTH,
Dennis


On Thu, Sep 9, 2010 at 2:43 AM, Jan van der Laan rh...@eoos.dds.nl wrote:

 Dear list,

 I get some strange results with daply from the plyr package. In the example
 below, the average age per municipality for employed en unemployed is
 calculated. If I do this using tapply (see code below) I get the following
 result:

no  yes
 A   NA 36.94931
 B 51.22505 34.24887
 C 48.05759 51.00198

 If I do this using daply:

 municipality   no  yes
   A 36.94931 48.05759
   B 51.22505 51.00198
   C 34.24887   NA

 daply generates the same numbers. However, these are not in the correct
 cells. For example, in municipality A everybody is employed. Therefore, the
 NA should be in the cell for unemployed in municipality A.

 Am I using daply incorrectly or is there indeed something wrong with the
 output of daply?

 Regards,

 Jan


 I am using version 1.1 of the plyr-package.


 # Generate some test data
 data.test - data.frame(
  municipality=rep(LETTERS[1:3], each=10),
  employed=sample(c(yes, no), 30, replace=TRUE),
  age=runif(30,20,70))
 # Make sure everybody is employed in municipality A
 data.test$employed[data.test$municipality == A] - yes

 # Compare the output of tapply:
 tapply(data.test$age, list(data.test$municipality, data.test$employed),
 mean)
 # to that of daply:
 daply(data.test, .(municipality, employed), function(d){mean(d$age)} )
 # results of ddply are the samen as tapply
 ddply(data.test, .(municipality, employed), function(d){mean(d$age)} )

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Strange output daply with empty strata

2010-09-09 Thread hadley wickham
 daply(data.test, .(municipality, employed), function(d){mean(d$age)} )
     employed
 municipality   no  yes
    A 41.58759 44.67463
    B 55.57407 43.82545
    C 43.59330   NA

 The .drop argument has a different meaning in daply. Some R functions have
 an na.last argument, and it may be that somewhere in daply, there is a
 function call that moves all NAs to the end. The means are in the right
 order except for the first, where the NA is supposed to be, so everything is
 offset in the table by 1. I've cc'ed Hadley on this.

This is a bug, which I've fixed in the development version (hopefully
to be released next week).
In the plyr 1.2:

 daply(data.test, .(municipality, employed), function(d){mean(d$age)} )
employed
municipality   no  yes
   A   NA 39.49980
   B 44.69291 51.63733
   C 57.38072 45.28978

Hadley

-- 
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Strange output daply with empty strata

2010-09-09 Thread Jan van der Laan


This is a bug, which I've fixed in the development version (hopefully
to be released next week).
In the plyr 1.2:
   


OK, thank you both for your answers. I'll wait for the next version.

Regards,
Jan

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.