I am using ddply on a data set that contains 2+ million rows; trying to rank 
the values of a variable within groups, and then transform the ranks to 
(approximate) z-scores --- i.e generate quantiles on the normal scale.
Here is some sample data for one group:x <- NA 0.3640951 0.1175880 0.3453916 
0.4214050 0.7469022 0.1091423 0.6099482        NA        NA 0.6786140 0.1785854 
0.9750262        NA

I have tried the following two alternatives: 
(1) Using the qnorm function from the stats package in conjunction with the 
percent_rank function from the dplyr  package:For example:
y <- qnorm(percent_rank(x))
This produces -Inf and Inf for the extreme values in the sample data. This 
issue is resolved if I use the rank function from the stats package instead, 
for example:y <- qnorm(rank(x, na.last = "keep", ties.method = 
"average")/length(x))
but if there are no NAs in a certain group, the upper extreme data point is 
still evaluated to Inf.
(2) Using the ztransform function from the GenABEL package:
For example: 
y <- ztransform(percent_rank(x))
This preserves the extreme values but produces one of the following types of 
errors when used on my full data set.

Error in ztransform(x) : trait is binary
ORError in ztransform(x) : trait is monomorphic
I suspect these errors may be due to the fact that there are very few 
observations and/or several missing values (NAs) within certain groups, but I 
am not sure since there are several hundred groups.  
Is there a better way?
Sent from Yahoo Mail. Get the app
        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to