Re: [Rd] quantile(), IQR() and median() for factors
I like the idea of median and friends working on ordered factors. Just a couple of thoughts on possible implementations. Adding extra checks and functionality will slow down the function. For a single evaluation on a given dataset this slowdown will not be noticeable, but inside of a simulation, bootstrap, or other high iteration technique, it could matter. I would suggest creating a core function that does just the calculations (median, quantile, iqr) assuming that the data passed in is correct without doing any checks or anything fancy. Then the user callable function (median et. al.) would do the checks dispatch to other functions for anything fancy, etc. then call the core function with the clean data. The common user would not really notice a difference, but someone programming a high iteration technique could clean the data themselves, then call the core function directly bypassing the checks/branches. Just out of curiosity (from someone who only learned from English (Americanized at that) and not Italian texts), what would the median of [Low, Low, Medium, High] be? -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 -Original Message- From: r-devel-boun...@r-project.org [mailto:r-devel-boun...@r- project.org] On Behalf Of Simone Giannerini Sent: Thursday, March 05, 2009 4:49 PM To: R-devel Subject: [Rd] quantile(), IQR() and median() for factors Dear all, from the help page of quantile: x numeric vectors whose sample quantiles are wanted. Missing values are ignored. from the help page of IQR: x a numeric vector. as a matter of facts it seems that both quantile() and IQR() do not check for the presence of a numeric input. See the following: set.seed(11) x - rbinom(n=11,size=2,prob=.5) x - factor(x,ordered=TRUE) x [1] 1 0 1 0 0 2 0 1 2 0 0 Levels: 0 1 2 quantile(x) 0% 25% 50% 75% 100% 0 NA 0 NA 2 Levels: 0 1 2 Warning messages: 1: In Ops.ordered((1 - h), qs[i]) : '*' is not meaningful for ordered factors 2: In Ops.ordered(h, x[hi[i]]) : '*' is not meaningful for ordered factors IQR(x) [1] 1 whereas median has the check: median(x) Error in median.default(x) : need numeric data I also take the opportunity to ask your comments on the following related subject: In my opinion it would be convenient that median() and the like (quantile(), IQR()) be implemented for ordered factors for which in fact they can be well defined. For instance, in this way functions like apply(x,FUN=median,...) could be used without the need of further processing for data frames that contain both numeric variables and ordered factors. If on the one hand, to my limited knowledge, in English introductory statistics textbooks the fact that the median is well defined for ordered categorical variables is only mentioned marginally, on the other hand, in the Italian Statistics literature this is often discussed in detail and this could mislead students and practitioners that might expect median() to work for ordered factors. In this message https://stat.ethz.ch/pipermail/r-help/2003-November/042684.html Martin Maechler considers the possibility of doing such a job by allowing for extra arguments low and high as it is done for mad(). I am willing to give a contribution if requested, and comments are welcome. Thank you for the attention, kind regards, Simone R.version _ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major 2 minor 8.1 year 2008 month 12 day 22 svn rev 47281 language R version.string R version 2.8.1 (2008-12-22) LC_COLLATE=Italian_Italy.1252;LC_CTYPE=Italian_Italy.1252;LC_MONETARY= Italian_Italy.1252;LC_NUMERIC=C;LC_TIME=Italian_Italy.1252 -- __ Simone Giannerini Dipartimento di Scienze Statistiche Paolo Fortunati Universita' di Bologna Via delle belle arti 41 - 40126 Bologna, ITALY Tel: +39 051 2098262 Fax: +39 051 232153 http://www2.stat.unibo.it/giannerini/ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] quantile(), IQR() and median() for factors
On Fri, 6 Mar 2009, Greg Snow wrote: I like the idea of median and friends working on ordered factors. Just a couple of thoughts on possible implementations. Adding extra checks and functionality will slow down the function. For a single evaluation on a given dataset this slowdown will not be noticeable, but inside of a simulation, bootstrap, or other high iteration technique, it could matter. I would suggest creating a core function that does just the calculations (median, quantile, iqr) assuming that the data passed in is correct without doing any checks or anything fancy. Then the user callable function (median et. al.) would do the checks dispatch to other functions for anything fancy, etc. then call the core function with the clean data. The common user would not really notice a difference, but someone programming a high iteration technique could clean the data themselves, then call the core function directly bypassing the checks/branches. Since median and quantile are already generic, adding a 'ordered' method would be zero cost to other uses. And the factor check at the head of median.default could be replaced by median.factor if someone could show a convincing performance difference. Just out of curiosity (from someone who only learned from English (Americanized at that) and not Italian texts), what would the median of [Low, Low, Medium, High] be? I don't think it is 'the' median but 'a' median. (Even English Wikipedia says the median is not unique for even numbers of inputs.) -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 -Original Message- From: r-devel-boun...@r-project.org [mailto:r-devel-boun...@r- project.org] On Behalf Of Simone Giannerini Sent: Thursday, March 05, 2009 4:49 PM To: R-devel Subject: [Rd] quantile(), IQR() and median() for factors Dear all, from the help page of quantile: x numeric vectors whose sample quantiles are wanted. Missing values are ignored. from the help page of IQR: x a numeric vector. as a matter of facts it seems that both quantile() and IQR() do not check for the presence of a numeric input. See the following: set.seed(11) x - rbinom(n=11,size=2,prob=.5) x - factor(x,ordered=TRUE) x [1] 1 0 1 0 0 2 0 1 2 0 0 Levels: 0 1 2 quantile(x) 0% 25% 50% 75% 100% 0 NA 0 NA 2 Levels: 0 1 2 Warning messages: 1: In Ops.ordered((1 - h), qs[i]) : '*' is not meaningful for ordered factors 2: In Ops.ordered(h, x[hi[i]]) : '*' is not meaningful for ordered factors IQR(x) [1] 1 whereas median has the check: median(x) Error in median.default(x) : need numeric data I also take the opportunity to ask your comments on the following related subject: In my opinion it would be convenient that median() and the like (quantile(), IQR()) be implemented for ordered factors for which in fact they can be well defined. For instance, in this way functions like apply(x,FUN=median,...) could be used without the need of further processing for data frames that contain both numeric variables and ordered factors. If on the one hand, to my limited knowledge, in English introductory statistics textbooks the fact that the median is well defined for ordered categorical variables is only mentioned marginally, on the other hand, in the Italian Statistics literature this is often discussed in detail and this could mislead students and practitioners that might expect median() to work for ordered factors. In this message https://stat.ethz.ch/pipermail/r-help/2003-November/042684.html Martin Maechler considers the possibility of doing such a job by allowing for extra arguments low and high as it is done for mad(). I am willing to give a contribution if requested, and comments are welcome. Thank you for the attention, kind regards, Simone R.version _ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major 2 minor 8.1 year 2008 month 12 day 22 svn rev 47281 language R version.string R version 2.8.1 (2008-12-22) LC_COLLATE=Italian_Italy.1252;LC_CTYPE=Italian_Italy.1252;LC_MONETARY= Italian_Italy.1252;LC_NUMERIC=C;LC_TIME=Italian_Italy.1252 -- __ Simone Giannerini Dipartimento di Scienze Statistiche Paolo Fortunati Universita' di Bologna Via delle belle arti 41 - 40126 Bologna, ITALY Tel: +39 051 2098262 Fax: +39 051 232153 http://www2.stat.unibo.it/giannerini/ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University
Re: [Rd] quantile(), IQR() and median() for factors
Dear Greg, thank you for your comments, as Prof. Ripley pointed out, in the case of even sample size the median is not unique and is formed by the two central observations or a function of them, if that makes sense. Dear Prof. Ripley, thank you for your concern, may I notice that (in case of non-negative data) one can get the median from mad() with center=0,constant=1 mad(1:10,center=0,constant=1) [1] 5.5 mad(1:10,center=0,constant=1,high=TRUE) [1] 6 mad(1:10,center=0,constant=1,low=TRUE) [1] 5 so that it seems that part of the code of mad() might be a starting point, at least for median(). I confirm my availability to work on the matter if requested. Kind regards, Simone On Fri, Mar 6, 2009 at 6:36 PM, Prof Brian Ripley rip...@stats.ox.ac.uk wrote: On Fri, 6 Mar 2009, Greg Snow wrote: I like the idea of median and friends working on ordered factors. Just a couple of thoughts on possible implementations. Adding extra checks and functionality will slow down the function. For a single evaluation on a given dataset this slowdown will not be noticeable, but inside of a simulation, bootstrap, or other high iteration technique, it could matter. I would suggest creating a core function that does just the calculations (median, quantile, iqr) assuming that the data passed in is correct without doing any checks or anything fancy. Then the user callable function (median et. al.) would do the checks dispatch to other functions for anything fancy, etc. then call the core function with the clean data. The common user would not really notice a difference, but someone programming a high iteration technique could clean the data themselves, then call the core function directly bypassing the checks/branches. Since median and quantile are already generic, adding a 'ordered' method would be zero cost to other uses. And the factor check at the head of median.default could be replaced by median.factor if someone could show a convincing performance difference. Just out of curiosity (from someone who only learned from English (Americanized at that) and not Italian texts), what would the median of [Low, Low, Medium, High] be? I don't think it is 'the' median but 'a' median. (Even English Wikipedia says the median is not unique for even numbers of inputs.) -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 -Original Message- From: r-devel-boun...@r-project.org [mailto:r-devel-boun...@r- project.org] On Behalf Of Simone Giannerini Sent: Thursday, March 05, 2009 4:49 PM To: R-devel Subject: [Rd] quantile(), IQR() and median() for factors Dear all, from the help page of quantile: x numeric vectors whose sample quantiles are wanted. Missing values are ignored. from the help page of IQR: x a numeric vector. as a matter of facts it seems that both quantile() and IQR() do not check for the presence of a numeric input. See the following: set.seed(11) x - rbinom(n=11,size=2,prob=.5) x - factor(x,ordered=TRUE) x [1] 1 0 1 0 0 2 0 1 2 0 0 Levels: 0 1 2 quantile(x) 0% 25% 50% 75% 100% 0 NA 0 NA 2 Levels: 0 1 2 Warning messages: 1: In Ops.ordered((1 - h), qs[i]) : '*' is not meaningful for ordered factors 2: In Ops.ordered(h, x[hi[i]]) : '*' is not meaningful for ordered factors IQR(x) [1] 1 whereas median has the check: median(x) Error in median.default(x) : need numeric data I also take the opportunity to ask your comments on the following related subject: In my opinion it would be convenient that median() and the like (quantile(), IQR()) be implemented for ordered factors for which in fact they can be well defined. For instance, in this way functions like apply(x,FUN=median,...) could be used without the need of further processing for data frames that contain both numeric variables and ordered factors. If on the one hand, to my limited knowledge, in English introductory statistics textbooks the fact that the median is well defined for ordered categorical variables is only mentioned marginally, on the other hand, in the Italian Statistics literature this is often discussed in detail and this could mislead students and practitioners that might expect median() to work for ordered factors. In this message https://stat.ethz.ch/pipermail/r-help/2003-November/042684.html Martin Maechler considers the possibility of doing such a job by allowing for extra arguments low and high as it is done for mad(). I am willing to give a contribution if requested, and comments are welcome. Thank you for the attention, kind regards, Simone R.version _ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major 2 minor 8.1 year 2008 month 12 day 22 svn rev 47281 language R
Re: [Rd] quantile(), IQR() and median() for factors
Yes I have discussed right continuous, left continous, etc. definitions for the median in numeric data. I was just curious what the discussion was in texts that cover quantiles/medians of ordered categorical data in detail. I do not expect Low.5 as computer output for the median (but Low.Medium does make sense in a way). Back in my theory classes when we actually needed a firm definition I remember using the left continuous mainly (Low for the example), but I don't remember why we chose that over the right continuous version, probably just the teachers/books preference (I do remember it made things simpler than using the average of the middle 2 when n was even). -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 -Original Message- From: Simone Giannerini [mailto:sgianner...@gmail.com] Sent: Friday, March 06, 2009 2:08 PM To: Prof Brian Ripley Cc: Greg Snow; R-devel Subject: Re: [Rd] quantile(), IQR() and median() for factors Dear Greg, thank you for your comments, as Prof. Ripley pointed out, in the case of even sample size the median is not unique and is formed by the two central observations or a function of them, if that makes sense. Dear Prof. Ripley, thank you for your concern, may I notice that (in case of non-negative data) one can get the median from mad() with center=0,constant=1 mad(1:10,center=0,constant=1) [1] 5.5 mad(1:10,center=0,constant=1,high=TRUE) [1] 6 mad(1:10,center=0,constant=1,low=TRUE) [1] 5 so that it seems that part of the code of mad() might be a starting point, at least for median(). I confirm my availability to work on the matter if requested. Kind regards, Simone On Fri, Mar 6, 2009 at 6:36 PM, Prof Brian Ripley rip...@stats.ox.ac.uk wrote: On Fri, 6 Mar 2009, Greg Snow wrote: I like the idea of median and friends working on ordered factors. Just a couple of thoughts on possible implementations. Adding extra checks and functionality will slow down the function. For a single evaluation on a given dataset this slowdown will not be noticeable, but inside of a simulation, bootstrap, or other high iteration technique, it could matter. I would suggest creating a core function that does just the calculations (median, quantile, iqr) assuming that the data passed in is correct without doing any checks or anything fancy. Then the user callable function (median et. al.) would do the checks dispatch to other functions for anything fancy, etc. then call the core function with the clean data. The common user would not really notice a difference, but someone programming a high iteration technique could clean the data themselves, then call the core function directly bypassing the checks/branches. Since median and quantile are already generic, adding a 'ordered' method would be zero cost to other uses. And the factor check at the head of median.default could be replaced by median.factor if someone could show a convincing performance difference. Just out of curiosity (from someone who only learned from English (Americanized at that) and not Italian texts), what would the median of [Low, Low, Medium, High] be? I don't think it is 'the' median but 'a' median. (Even English Wikipedia says the median is not unique for even numbers of inputs.) -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 -Original Message- From: r-devel-boun...@r-project.org [mailto:r-devel-boun...@r- project.org] On Behalf Of Simone Giannerini Sent: Thursday, March 05, 2009 4:49 PM To: R-devel Subject: [Rd] quantile(), IQR() and median() for factors Dear all, from the help page of quantile: x numeric vectors whose sample quantiles are wanted. Missing values are ignored. from the help page of IQR: x a numeric vector. as a matter of facts it seems that both quantile() and IQR() do not check for the presence of a numeric input. See the following: set.seed(11) x - rbinom(n=11,size=2,prob=.5) x - factor(x,ordered=TRUE) x [1] 1 0 1 0 0 2 0 1 2 0 0 Levels: 0 1 2 quantile(x) 0% 25% 50% 75% 100% 0 NA 0 NA 2 Levels: 0 1 2 Warning messages: 1: In Ops.ordered((1 - h), qs[i]) : '*' is not meaningful for ordered factors 2: In Ops.ordered(h, x[hi[i]]) : '*' is not meaningful for ordered factors IQR(x) [1] 1 whereas median has the check: median(x) Error in median.default(x) : need numeric data I also take the opportunity to ask your comments on the following related subject: In my opinion it would be convenient that median() and the like (quantile(), IQR()) be implemented for ordered factors for which in fact they can be well defined. For instance, in this way functions like apply(x,FUN=median
[Rd] quantile(), IQR() and median() for factors
Dear all, from the help page of quantile: x numeric vectors whose sample quantiles are wanted. Missing values are ignored. from the help page of IQR: x a numeric vector. as a matter of facts it seems that both quantile() and IQR() do not check for the presence of a numeric input. See the following: set.seed(11) x - rbinom(n=11,size=2,prob=.5) x - factor(x,ordered=TRUE) x [1] 1 0 1 0 0 2 0 1 2 0 0 Levels: 0 1 2 quantile(x) 0% 25% 50% 75% 100% 0 NA 0 NA 2 Levels: 0 1 2 Warning messages: 1: In Ops.ordered((1 - h), qs[i]) : '*' is not meaningful for ordered factors 2: In Ops.ordered(h, x[hi[i]]) : '*' is not meaningful for ordered factors IQR(x) [1] 1 whereas median has the check: median(x) Error in median.default(x) : need numeric data I also take the opportunity to ask your comments on the following related subject: In my opinion it would be convenient that median() and the like (quantile(), IQR()) be implemented for ordered factors for which in fact they can be well defined. For instance, in this way functions like apply(x,FUN=median,...) could be used without the need of further processing for data frames that contain both numeric variables and ordered factors. If on the one hand, to my limited knowledge, in English introductory statistics textbooks the fact that the median is well defined for ordered categorical variables is only mentioned marginally, on the other hand, in the Italian Statistics literature this is often discussed in detail and this could mislead students and practitioners that might expect median() to work for ordered factors. In this message https://stat.ethz.ch/pipermail/r-help/2003-November/042684.html Martin Maechler considers the possibility of doing such a job by allowing for extra arguments low and high as it is done for mad(). I am willing to give a contribution if requested, and comments are welcome. Thank you for the attention, kind regards, Simone R.version _ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major 2 minor 8.1 year 2008 month 12 day 22 svn rev 47281 language R version.string R version 2.8.1 (2008-12-22) LC_COLLATE=Italian_Italy.1252;LC_CTYPE=Italian_Italy.1252;LC_MONETARY=Italian_Italy.1252;LC_NUMERIC=C;LC_TIME=Italian_Italy.1252 -- __ Simone Giannerini Dipartimento di Scienze Statistiche Paolo Fortunati Universita' di Bologna Via delle belle arti 41 - 40126 Bologna, ITALY Tel: +39 051 2098262 Fax: +39 051 232153 http://www2.stat.unibo.it/giannerini/ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel