Re: [R] Linear regression with a rounded response variable
Hi Ravi, And remember that the vanilla rounding procedure is biased upward. That is, an observation of 5 actually may have ranged from 4.5 to 5.4. Jim On Thu, Oct 22, 2015 at 7:15 AM, peter salzmanwrote: > here is one thought: > > if you plug in your numbers into any kind of regression you will get > prediction that are real numbers and not necessarily integers, it may be > that you predictions are good enough with this approximate value of Y. you > could test this by randomly shuffling your data by +- 0.5 and compare the > results with the original result. > > let me add another idea: > > if data is not fully observed this falls under the umbrella of censored > data, in this case you have interval censoring. if you see 5 then the > observations is in interval [4.5, 5.5] > i'm not familiar with the field but i'd search for 'regression with > interval censoring' > > > peter > > > On Wed, Oct 21, 2015 at 10:53 AM, Ravi Varadhan > wrote: > > > Hi, > > I am dealing with a regression problem where the response variable, time > > (second) to walk 15 ft, is rounded to the nearest integer. I do not care > > for the regression coefficients per se, but my main interest is in > getting > > the prediction equation for walking speed, given the predictors (age, > > height, sex, etc.), where the predictions will be real numbers, and not > > integers. The hope is that these predictions should provide unbiased > > estimates of the "unrounded" walking speed. These sounds like a > measurement > > error problem, where the measurement error is due to rounding and hence > > would be uniformly distributed (-0.5, 0.5). > > > > Are there any canonical approaches for handling this type of a problem? > > What is wrong with just doing the standard linear regression? > > > > I googled and saw that this question was asked by someone else in a > > stackexchange post, but it was unanswered. Any suggestions? > > > > Thank you, > > Ravi > > > > Ravi Varadhan, Ph.D. (Biostatistics), Ph.D. (Environmental Engg) > > Associate Professor, Department of Oncology > > Division of Biostatistics & Bionformatics > > Sidney Kimmel Comprehensive Cancer Center > > Johns Hopkins University > > 550 N. Broadway, Suite -E > > Baltimore, MD 21205 > > 410-502-2619 > > > > > > [[alternative HTML version deleted]] > > > > __ > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > > > -- > Peter Salzman, PhD > Department of Biostatistics and Computational Biology > University of Rochester > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Linear regression with a rounded response variable
Dear Peter, Charles, Gabor, Jim, Mark, Victor, Peter, and Harold, You have given me plenty of ammunition. Thank you very much for the useful answers. Gratefully, Ravi From: peter dalgaard <pda...@gmail.com> Sent: Wednesday, October 21, 2015 8:11 PM To: Charles C. Berry Cc: Ravi Varadhan; r-help@r-project.org Subject: Re: [R] Linear regression with a rounded response variable > On 21 Oct 2015, at 19:57 , Charles C. Berry <ccbe...@ucsd.edu> wrote: > > On Wed, 21 Oct 2015, Ravi Varadhan wrote: > >> [snippage] > > If half the subjects have a value of 5 seconds and the rest are split between > 4 and 6, your assertion that rounding induces an error of > dunif(epsilon,-0.5,0.5) is surely wrong (more positive errors in the 6 second > group and more negative errors in the 4 second group under any plausible > model). Yes, and I think that the suggestion in another post to look at censored regression is more in the right direction. In general, I'd expect the bias caused by rounding the response to quite small, except at very high granularity. I did a few small experiments with the simplest possible linear model: estimating a mean based on highly rounded data, > y <- round(rnorm(1e2,pi,.5)) > mean(y) [1] 3.12 > table(y) y 2 3 4 5 13 63 23 1 Or, using a bigger sample: > mean(round(rnorm(1e8,pi,.5))) [1] 3.139843 in which there is a visible bias, but quite a small one: > pi - 3.139843 [1] 0.001749654 At lower granularity (sd=1 instead of .5), the bias has almost disappeared. > mean(round(rnorm(1e8,pi,1))) [1] 3.141577 If the granularity is increased sufficiently, you _will_ see a sizeable bias (because almost all observations will be round(pi)==3): > mean(round(rnorm(1e8,pi,.1))) [1] 3.00017 A full ML fit (with known sigma=1) is pretty easily done: > library(stats4) > mll <- function(mu)-sum(log(pnorm(y+.5,mu, .5)-pnorm(y-.5, mu, .5))) > mle(mll,start=list(mu=3)) Call: mle(minuslogl = mll, start = list(mu = 3)) Coefficients: mu 3.122069 > mean(y) [1] 3.12 As you see, the difference is only 0.002. A small simulation (1000 repl.) gave (r[1,]==MLE ; r{2,]==mean) > summary(r[1,]-r[2,]) Min. 1st Qu.Median Mean 3rd Qu. Max. -0.004155 0.000702 0.001495 0.001671 0.002554 0.006860 so the corrections relative to the crude mean stay within one unit in the 2nd place. Notice that the corrections are pretty darn close to cancelling out the bias. -pd > > > HTH, > > Chuck > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Linear regression with a rounded response variable
> Yes, and I think that the suggestion in another post to look at censored > regression is more in the right direction. I think this is right and perhaps the best (or at least better) pathway to pursue than considering this within the framework of measurement error (ME). Of course there *is* ME in the observed walking time since the observed value is only one draw from the distribution of potential times that could have been observed for each individual. But, the typical econometric correction for ME requires that we have an observed value and then an estimate of its variance. Theoretically, I would imagine this variance to be heteroscedastic and to vary by individual. In Ravi's regression with the observed value on the LHS, there is no bias in the regression coefficients because the ME is not correlated with the error term, but the standard errors of the coefficients would be too large. If such this conditional variance did exist, you could treat the reciprocal of the variance as a weight in WLS, such that values with less ME have greater weight in the estimation and there would also exists a closed form way to correct the standard errors. This however, is not the problem as I understand it from Ravi. Instead, he observes x which lies within a known interval, x_l < x < x_u where x_l and x_u denote upper and lower limits for the observed values. At first this threw me for a loop because censoring in my work is typically done at the extremes with left/right censored data. But, there is also a package in R for interval censoring (called interval), though I have not used it before. Some googling on this topic drew me to some good worked examples that I think fit within the framework Ravi is working within. So, perhaps Ravi's question really has two issues, one of which might be solvable: there is ME in the outcome value, y. But, perhaps that is ignorable. The censoring is perhaps not ignorable, and even better yet solvable? __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Linear regression with a rounded response variable
Hi, I am dealing with a regression problem where the response variable, time (second) to walk 15 ft, is rounded to the nearest integer. I do not care for the regression coefficients per se, but my main interest is in getting the prediction equation for walking speed, given the predictors (age, height, sex, etc.), where the predictions will be real numbers, and not integers. The hope is that these predictions should provide unbiased estimates of the "unrounded" walking speed. These sounds like a measurement error problem, where the measurement error is due to rounding and hence would be uniformly distributed (-0.5, 0.5). Are there any canonical approaches for handling this type of a problem? What is wrong with just doing the standard linear regression? I googled and saw that this question was asked by someone else in a stackexchange post, but it was unanswered. Any suggestions? Thank you, Ravi Ravi Varadhan, Ph.D. (Biostatistics), Ph.D. (Environmental Engg) Associate Professor, Department of Oncology Division of Biostatistics & Bionformatics Sidney Kimmel Comprehensive Cancer Center Johns Hopkins University 550 N. Broadway, Suite -E Baltimore, MD 21205 410-502-2619 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Linear regression with a rounded response variable
On Wed, 21 Oct 2015, Ravi Varadhan wrote: Hi, I am dealing with a regression problem where the response variable, time (second) to walk 15 ft, is rounded to the nearest integer. I do not care for the regression coefficients per se, but my main interest is in getting the prediction equation for walking speed, given the predictors (age, height, sex, etc.), where the predictions will be real numbers, and not integers. The hope is that these predictions should provide unbiased estimates of the "unrounded" walking speed. These sounds like a measurement error problem, where the measurement error is due to rounding and hence would be uniformly distributed (-0.5, 0.5). Not the usual "measurement error model" problem, though, where the errors are in X and not independent of XB. Look back at the proof of the unbiasedness of least squares under the Gauss-Markov setup. The errors in Y need to have expectation zero. From your description (but see caveat below) this is true of walking *time*, but not not exactly true of walking *speed* (modulo the usual assumptions if they apply to time). In fact if E(epsilon) = 0 were true of unrounded time, it would not be true of unrounded speed (and vice versa). Are there any canonical approaches for handling this type of a problem? Work out the bias analytically? Parametric bootstrap? Data augmentation and friends? What is wrong with just doing the standard linear regression? Well, what do the actual values look like? If half the subjects have a value of 5 seconds and the rest are split between 4 and 6, your assertion that rounding induces an error of dunif(epsilon,-0.5,0.5) is surely wrong (more positive errors in the 6 second group and more negative errors in the 4 second group under any plausible model). HTH, Chuck __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Linear regression with a rounded response variable
Hi Ravi, Thanks for this interesting question. My thoughts are given below. If you believe the rounding is indeed uniformly distributed, then the problem is equivalent with adding a uniform random error between (-0.5, 0.5) for every observation in addition to the standard normal error, which will make the new error term have a mixture distribution. Intuitively, the impact of this newly added term depends on the relative scale of the original normal and the new uniform error terms. To see the exact impact, you can simulate sets of new response variables by adding uniform errors from (-0.5, 0.5) to the original response variables and see the results. I wish I could have more theoretical answers and hope this helps as well. Best, Xu Xu Tian, Ph.D. Senior Statistician Validus Research New York, NY 10005 On Wed, Oct 21, 2015 at 10:53 AM, Ravi Varadhanwrote: > Hi, > I am dealing with a regression problem where the response variable, time > (second) to walk 15 ft, is rounded to the nearest integer. I do not care > for the regression coefficients per se, but my main interest is in getting > the prediction equation for walking speed, given the predictors (age, > height, sex, etc.), where the predictions will be real numbers, and not > integers. The hope is that these predictions should provide unbiased > estimates of the "unrounded" walking speed. These sounds like a measurement > error problem, where the measurement error is due to rounding and hence > would be uniformly distributed (-0.5, 0.5). > > Are there any canonical approaches for handling this type of a problem? > What is wrong with just doing the standard linear regression? > > I googled and saw that this question was asked by someone else in a > stackexchange post, but it was unanswered. Any suggestions? > > Thank you, > Ravi > > Ravi Varadhan, Ph.D. (Biostatistics), Ph.D. (Environmental Engg) > Associate Professor, Department of Oncology > Division of Biostatistics & Bionformatics > Sidney Kimmel Comprehensive Cancer Center > Johns Hopkins University > 550 N. Broadway, Suite -E > Baltimore, MD 21205 > 410-502-2619 > > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- *Xu Tian* [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Linear regression with a rounded response variable
here is one thought: if you plug in your numbers into any kind of regression you will get prediction that are real numbers and not necessarily integers, it may be that you predictions are good enough with this approximate value of Y. you could test this by randomly shuffling your data by +- 0.5 and compare the results with the original result. let me add another idea: if data is not fully observed this falls under the umbrella of censored data, in this case you have interval censoring. if you see 5 then the observations is in interval [4.5, 5.5] i'm not familiar with the field but i'd search for 'regression with interval censoring' peter On Wed, Oct 21, 2015 at 10:53 AM, Ravi Varadhanwrote: > Hi, > I am dealing with a regression problem where the response variable, time > (second) to walk 15 ft, is rounded to the nearest integer. I do not care > for the regression coefficients per se, but my main interest is in getting > the prediction equation for walking speed, given the predictors (age, > height, sex, etc.), where the predictions will be real numbers, and not > integers. The hope is that these predictions should provide unbiased > estimates of the "unrounded" walking speed. These sounds like a measurement > error problem, where the measurement error is due to rounding and hence > would be uniformly distributed (-0.5, 0.5). > > Are there any canonical approaches for handling this type of a problem? > What is wrong with just doing the standard linear regression? > > I googled and saw that this question was asked by someone else in a > stackexchange post, but it was unanswered. Any suggestions? > > Thank you, > Ravi > > Ravi Varadhan, Ph.D. (Biostatistics), Ph.D. (Environmental Engg) > Associate Professor, Department of Oncology > Division of Biostatistics & Bionformatics > Sidney Kimmel Comprehensive Cancer Center > Johns Hopkins University > 550 N. Broadway, Suite -E > Baltimore, MD 21205 > 410-502-2619 > > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Peter Salzman, PhD Department of Biostatistics and Computational Biology University of Rochester [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Linear regression with a rounded response variable
This could be modeled directly using Bayesian techniques. Consider the Bayesian version of the following model where we only observe y and X. y0 is not observed. y0 <- X b + error y <- round(y0) The following code is based on modifying the code in the README of the CRAN rcppbugs R package. library(rcppbugs) set.seed(123) # set up the test data - y and X are observed but not y0 NR <- 1e2L NC <- 2L X <- cbind(1, rnorm(10)) y0 <- X %*% 1:2 y <- round(y0) # for comparison run a normal linear model w/ lm.fit using X and y lm.res <- lm.fit(X,y) print(coef(lm.res)) ##x1x2 ## 0.9569366 1.9170808 # RCppBugs Model b <- mcmc.normal(rnorm(NC),mu=0,tau=0.0001) tau.y <- mcmc.gamma(sd(as.vector(y)),alpha=0.1,beta=0.1) y.hat <- deterministic(function(X,b) { round(X %*% b) }, X, b) y.lik <- mcmc.normal(y,mu=y.hat,tau=tau.y,observed=TRUE) m <- create.model(b, tau.y, y.hat, y.lik) # run the Bayesian model based on y and X cat("running model...\n") runtime <- system.time(ans <- run.model(m, iterations=1e5L, burn=1e4L, adapt=1e3L, thin=10L)) print(apply(ans[["b"]],2,mean)) ## [1] 0.9882485 2.0009989 On Wed, Oct 21, 2015 at 10:53 AM, Ravi Varadhanwrote: > Hi, > I am dealing with a regression problem where the response variable, time > (second) to walk 15 ft, is rounded to the nearest integer. I do not care > for the regression coefficients per se, but my main interest is in getting > the prediction equation for walking speed, given the predictors (age, > height, sex, etc.), where the predictions will be real numbers, and not > integers. The hope is that these predictions should provide unbiased > estimates of the "unrounded" walking speed. These sounds like a measurement > error problem, where the measurement error is due to rounding and hence > would be uniformly distributed (-0.5, 0.5). > > Are there any canonical approaches for handling this type of a problem? > What is wrong with just doing the standard linear regression? > > I googled and saw that this question was asked by someone else in a > stackexchange post, but it was unanswered. Any suggestions? > > Thank you, > Ravi > > Ravi Varadhan, Ph.D. (Biostatistics), Ph.D. (Environmental Engg) > Associate Professor, Department of Oncology > Division of Biostatistics & Bionformatics > Sidney Kimmel Comprehensive Cancer Center > Johns Hopkins University > 550 N. Broadway, Suite -E > Baltimore, MD 21205 > 410-502-2619 > > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Linear regression with a rounded response variable
> On 21 Oct 2015, at 19:57 , Charles C. Berrywrote: > > On Wed, 21 Oct 2015, Ravi Varadhan wrote: > >> [snippage] > > If half the subjects have a value of 5 seconds and the rest are split between > 4 and 6, your assertion that rounding induces an error of > dunif(epsilon,-0.5,0.5) is surely wrong (more positive errors in the 6 second > group and more negative errors in the 4 second group under any plausible > model). Yes, and I think that the suggestion in another post to look at censored regression is more in the right direction. In general, I'd expect the bias caused by rounding the response to quite small, except at very high granularity. I did a few small experiments with the simplest possible linear model: estimating a mean based on highly rounded data, > y <- round(rnorm(1e2,pi,.5)) > mean(y) [1] 3.12 > table(y) y 2 3 4 5 13 63 23 1 Or, using a bigger sample: > mean(round(rnorm(1e8,pi,.5))) [1] 3.139843 in which there is a visible bias, but quite a small one: > pi - 3.139843 [1] 0.001749654 At lower granularity (sd=1 instead of .5), the bias has almost disappeared. > mean(round(rnorm(1e8,pi,1))) [1] 3.141577 If the granularity is increased sufficiently, you _will_ see a sizeable bias (because almost all observations will be round(pi)==3): > mean(round(rnorm(1e8,pi,.1))) [1] 3.00017 A full ML fit (with known sigma=1) is pretty easily done: > library(stats4) > mll <- function(mu)-sum(log(pnorm(y+.5,mu, .5)-pnorm(y-.5, mu, .5))) > mle(mll,start=list(mu=3)) Call: mle(minuslogl = mll, start = list(mu = 3)) Coefficients: mu 3.122069 > mean(y) [1] 3.12 As you see, the difference is only 0.002. A small simulation (1000 repl.) gave (r[1,]==MLE ; r{2,]==mean) > summary(r[1,]-r[2,]) Min. 1st Qu.Median Mean 3rd Qu. Max. -0.004155 0.000702 0.001495 0.001671 0.002554 0.006860 so the corrections relative to the crude mean stay within one unit in the 2nd place. Notice that the corrections are pretty darn close to cancelling out the bias. -pd > > > HTH, > > Chuck > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.