I am using ddply on a data set that contains 2+ million rows; trying to rank the values of a variable within groups, and then transform the ranks to (approximate) z-scores --- i.e generate quantiles on the normal scale. Here is some sample data for one group:x <- NA 0.3640951 0.1175880 0.3453916 0.4214050 0.7469022 0.1091423 0.6099482 NA NA 0.6786140 0.1785854 0.9750262 NA
I have tried the following two alternatives: (1) Using the qnorm function from the stats package in conjunction with the percent_rank function from the dplyr package:For example: y <- qnorm(percent_rank(x)) This produces -Inf and Inf for the extreme values in the sample data. This issue is resolved if I use the rank function from the stats package instead, for example:y <- qnorm(rank(x, na.last = "keep", ties.method = "average")/length(x)) but if there are no NAs in a certain group, the upper extreme data point is still evaluated to Inf. (2) Using the ztransform function from the GenABEL package: For example: y <- ztransform(percent_rank(x)) This preserves the extreme values but produces one of the following types of errors when used on my full data set. Error in ztransform(x) : trait is binary ORError in ztransform(x) : trait is monomorphic I suspect these errors may be due to the fact that there are very few observations and/or several missing values (NAs) within certain groups, but I am not sure since there are several hundred groups. Is there a better way? Sent from Yahoo Mail. Get the app [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.