Re: [R] Linear regression with a rounded response variable

2015-10-22 Thread Jim Lemon
Hi Ravi,
And remember that the vanilla rounding procedure is biased upward. That is,
an observation of 5 actually may have ranged from 4.5 to 5.4.

Jim

On Thu, Oct 22, 2015 at 7:15 AM, peter salzman 
wrote:

> here is one thought:
>
> if you plug in your numbers into any kind of regression you will get
> prediction that are real numbers and not necessarily integers, it may be
> that you predictions are good enough with this approximate value of Y. you
> could test this by randomly shuffling your data by +- 0.5 and compare the
> results with the original result.
>
> let me add another idea:
>
> if data is not fully observed this falls under the umbrella of censored
> data, in this case you have interval censoring. if you see 5 then the
> observations is in interval [4.5, 5.5]
> i'm not familiar with the field but i'd search for 'regression with
> interval censoring'
>
>
> peter
>
>
> On Wed, Oct 21, 2015 at 10:53 AM, Ravi Varadhan 
> wrote:
>
> > Hi,
> > I am dealing with a regression problem where the response variable, time
> > (second) to walk 15 ft, is rounded to the nearest integer.  I do not care
> > for the regression coefficients per se, but my main interest is in
> getting
> > the prediction equation for walking speed, given the predictors (age,
> > height, sex, etc.), where the predictions will be real numbers, and not
> > integers.  The hope is that these predictions should provide unbiased
> > estimates of the "unrounded" walking speed. These sounds like a
> measurement
> > error problem, where the measurement error is due to rounding and hence
> > would be uniformly distributed (-0.5, 0.5).
> >
> > Are there any canonical approaches for handling this type of a problem?
> > What is wrong with just doing the standard linear regression?
> >
> > I googled and saw that this question was asked by someone else in a
> > stackexchange post, but it was unanswered.  Any suggestions?
> >
> > Thank you,
> > Ravi
> >
> > Ravi Varadhan, Ph.D. (Biostatistics), Ph.D. (Environmental Engg)
> > Associate Professor,  Department of Oncology
> > Division of Biostatistics & Bionformatics
> > Sidney Kimmel Comprehensive Cancer Center
> > Johns Hopkins University
> > 550 N. Broadway, Suite -E
> > Baltimore, MD 21205
> > 410-502-2619
> >
> >
> > [[alternative HTML version deleted]]
> >
> > __
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>
>
> --
> Peter Salzman, PhD
> Department of Biostatistics and Computational Biology
> University of Rochester
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Linear regression with a rounded response variable

2015-10-22 Thread Ravi Varadhan
Dear Peter, Charles, Gabor, Jim, Mark, Victor, Peter, and Harold,
You have given me plenty of ammunition.  Thank you very much for the useful 
answers. 
Gratefully,
Ravi

From: peter dalgaard <pda...@gmail.com>
Sent: Wednesday, October 21, 2015 8:11 PM
To: Charles C. Berry
Cc: Ravi Varadhan; r-help@r-project.org
Subject: Re: [R] Linear regression with a rounded response variable

> On 21 Oct 2015, at 19:57 , Charles C. Berry <ccbe...@ucsd.edu> wrote:
>
> On Wed, 21 Oct 2015, Ravi Varadhan wrote:
>
>> [snippage]
>
> If half the subjects have a value of 5 seconds and the rest are split between 
> 4 and 6, your assertion that rounding induces an error of 
> dunif(epsilon,-0.5,0.5) is surely wrong (more positive errors in the 6 second 
> group and more negative errors in the 4 second group under any plausible 
> model).

Yes, and I think that the suggestion in another post to look at censored 
regression is more in the right direction.

In general, I'd expect the bias caused by rounding the response to quite small, 
except at very high granularity. I did a few small experiments with the 
simplest possible linear model: estimating a mean based on highly rounded data,

> y <- round(rnorm(1e2,pi,.5))
> mean(y)
[1] 3.12
> table(y)
y
 2  3  4  5
13 63 23  1

Or, using a bigger sample:

> mean(round(rnorm(1e8,pi,.5)))
[1] 3.139843

in which there is a visible bias, but quite a small one:

> pi - 3.139843
[1] 0.001749654

At lower granularity (sd=1 instead of .5), the bias has almost disappeared.

> mean(round(rnorm(1e8,pi,1)))
[1] 3.141577

If the granularity is increased sufficiently, you _will_ see a sizeable bias 
(because almost all observations will be round(pi)==3):

> mean(round(rnorm(1e8,pi,.1)))
[1] 3.00017


A full ML fit (with known sigma=1) is pretty easily done:

> library(stats4)
> mll <- function(mu)-sum(log(pnorm(y+.5,mu, .5)-pnorm(y-.5, mu, .5)))
> mle(mll,start=list(mu=3))

Call:
mle(minuslogl = mll, start = list(mu = 3))

Coefficients:
  mu
3.122069
> mean(y)
[1] 3.12

As you see, the difference is only 0.002.

A small simulation (1000 repl.) gave (r[1,]==MLE ; r{2,]==mean)

> summary(r[1,]-r[2,])
 Min.   1st Qu.Median  Mean   3rd Qu.  Max.
-0.004155  0.000702  0.001495  0.001671  0.002554  0.006860

so the corrections relative to the crude mean stay within one unit in the 2nd 
place. Notice  that the corrections are pretty darn close to cancelling out the 
bias.

-pd

>
>
> HTH,
>
> Chuck
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd@cbs.dk  Priv: pda...@gmail.com









__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Linear regression with a rounded response variable

2015-10-22 Thread Doran, Harold

> Yes, and I think that the suggestion in another post to look at censored 
> regression is more in the right direction.

I think this is right and perhaps the best (or at least better) pathway to 
pursue than considering this within the framework of measurement error (ME). Of 
course there *is* ME in the observed walking time since the observed value is 
only one draw from the distribution of potential times that could have been 
observed for each individual.

But, the typical econometric correction for ME requires that we have an 
observed value and then an estimate of its variance. Theoretically, I would 
imagine this variance to be heteroscedastic and to vary by individual.  In 
Ravi's regression with the observed value on the LHS, there is no bias in the 
regression coefficients because the ME is not correlated with the error term, 
but the standard errors of the coefficients would be too large. If such this 
conditional variance did exist, you could treat the reciprocal of the variance 
as a weight in WLS, such that values with less ME have greater weight in the 
estimation and there would also exists a closed form way to correct the 
standard errors.

This however, is not the problem as I understand it from Ravi. Instead, he 
observes x which lies within a known interval, x_l < x < x_u where x_l and x_u 
denote upper and lower limits for the observed values.

At first this threw me for a loop because censoring in my work is typically 
done at the extremes with left/right censored data. But, there is also a 
package in R for interval censoring (called interval), though I have not used 
it before. Some googling on this topic drew me to some good worked examples 
that I think fit within the framework Ravi is working within.

So, perhaps Ravi's question really has two issues, one of which might be 
solvable: there is ME in the outcome value, y. But, perhaps that is ignorable. 
The censoring is perhaps not ignorable, and even better yet solvable?

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Linear regression with a rounded response variable

2015-10-21 Thread Ravi Varadhan
Hi,
I am dealing with a regression problem where the response variable, time 
(second) to walk 15 ft, is rounded to the nearest integer.  I do not care for 
the regression coefficients per se, but my main interest is in getting the 
prediction equation for walking speed, given the predictors (age, height, sex, 
etc.), where the predictions will be real numbers, and not integers.  The hope 
is that these predictions should provide unbiased estimates of the "unrounded" 
walking speed. These sounds like a measurement error problem, where the 
measurement error is due to rounding and hence would be uniformly distributed 
(-0.5, 0.5).

Are there any canonical approaches for handling this type of a problem? What is 
wrong with just doing the standard linear regression?

I googled and saw that this question was asked by someone else in a 
stackexchange post, but it was unanswered.  Any suggestions?

Thank you,
Ravi

Ravi Varadhan, Ph.D. (Biostatistics), Ph.D. (Environmental Engg)
Associate Professor,  Department of Oncology
Division of Biostatistics & Bionformatics
Sidney Kimmel Comprehensive Cancer Center
Johns Hopkins University
550 N. Broadway, Suite -E
Baltimore, MD 21205
410-502-2619


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Linear regression with a rounded response variable

2015-10-21 Thread Charles C. Berry

On Wed, 21 Oct 2015, Ravi Varadhan wrote:

Hi, I am dealing with a regression problem where the response variable, 
time (second) to walk 15 ft, is rounded to the nearest integer.  I do 
not care for the regression coefficients per se, but my main interest is 
in getting the prediction equation for walking speed, given the 
predictors (age, height, sex, etc.), where the predictions will be real 
numbers, and not integers.  The hope is that these predictions should 
provide unbiased estimates of the "unrounded" walking speed. These 
sounds like a measurement error problem, where the measurement error is 
due to rounding and hence would be uniformly distributed (-0.5, 0.5).




Not the usual "measurement error model" problem, though, where the errors 
are in X and not independent of XB.


Look back at the proof of the unbiasedness of least squares under the 
Gauss-Markov setup. The errors in Y need to have expectation zero.


From your description (but see caveat below) this is true of walking 
*time*, but not not exactly true of walking *speed* (modulo the usual 
assumptions if they apply to time). In fact if E(epsilon) = 0 were true of 
unrounded time, it would not be true of unrounded speed (and vice versa).




Are there any canonical approaches for handling this type of a problem?


Work out the bias analytically? Parametric bootstrap? Data augmentation 
and friends?



What is wrong with just doing the standard linear regression?



Well, what do the actual values look like?

If half the subjects have a value of 5 seconds and the rest are split 
between 4 and 6, your assertion that rounding induces an error of 
dunif(epsilon,-0.5,0.5) is surely wrong (more positive errors in the 6 
second group and more negative errors in the 4 second group under any 
plausible model).



HTH,

Chuck

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Linear regression with a rounded response variable

2015-10-21 Thread Victor Tian
Hi Ravi,

Thanks for this interesting question. My thoughts are given below.

If you believe the rounding is indeed uniformly distributed, then the
problem is equivalent with adding a uniform random error between (-0.5,
0.5) for every observation in addition to the standard normal error, which
will make the new error term have a mixture distribution.

Intuitively, the impact of this newly added term depends on the relative
scale of the original normal and the new uniform error terms. To see the
exact impact, you can simulate sets of new response variables by adding
uniform errors from (-0.5, 0.5) to the original response variables and see
the results.

I wish I could have more theoretical answers and hope this helps as well.

Best,
Xu

Xu Tian, Ph.D.
Senior Statistician
Validus Research
New York, NY 10005

On Wed, Oct 21, 2015 at 10:53 AM, Ravi Varadhan 
wrote:

> Hi,
> I am dealing with a regression problem where the response variable, time
> (second) to walk 15 ft, is rounded to the nearest integer.  I do not care
> for the regression coefficients per se, but my main interest is in getting
> the prediction equation for walking speed, given the predictors (age,
> height, sex, etc.), where the predictions will be real numbers, and not
> integers.  The hope is that these predictions should provide unbiased
> estimates of the "unrounded" walking speed. These sounds like a measurement
> error problem, where the measurement error is due to rounding and hence
> would be uniformly distributed (-0.5, 0.5).
>
> Are there any canonical approaches for handling this type of a problem?
> What is wrong with just doing the standard linear regression?
>
> I googled and saw that this question was asked by someone else in a
> stackexchange post, but it was unanswered.  Any suggestions?
>
> Thank you,
> Ravi
>
> Ravi Varadhan, Ph.D. (Biostatistics), Ph.D. (Environmental Engg)
> Associate Professor,  Department of Oncology
> Division of Biostatistics & Bionformatics
> Sidney Kimmel Comprehensive Cancer Center
> Johns Hopkins University
> 550 N. Broadway, Suite -E
> Baltimore, MD 21205
> 410-502-2619
>
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
*Xu Tian*

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Linear regression with a rounded response variable

2015-10-21 Thread peter salzman
here is one thought:

if you plug in your numbers into any kind of regression you will get
prediction that are real numbers and not necessarily integers, it may be
that you predictions are good enough with this approximate value of Y. you
could test this by randomly shuffling your data by +- 0.5 and compare the
results with the original result.

let me add another idea:

if data is not fully observed this falls under the umbrella of censored
data, in this case you have interval censoring. if you see 5 then the
observations is in interval [4.5, 5.5]
i'm not familiar with the field but i'd search for 'regression with
interval censoring'


peter


On Wed, Oct 21, 2015 at 10:53 AM, Ravi Varadhan 
wrote:

> Hi,
> I am dealing with a regression problem where the response variable, time
> (second) to walk 15 ft, is rounded to the nearest integer.  I do not care
> for the regression coefficients per se, but my main interest is in getting
> the prediction equation for walking speed, given the predictors (age,
> height, sex, etc.), where the predictions will be real numbers, and not
> integers.  The hope is that these predictions should provide unbiased
> estimates of the "unrounded" walking speed. These sounds like a measurement
> error problem, where the measurement error is due to rounding and hence
> would be uniformly distributed (-0.5, 0.5).
>
> Are there any canonical approaches for handling this type of a problem?
> What is wrong with just doing the standard linear regression?
>
> I googled and saw that this question was asked by someone else in a
> stackexchange post, but it was unanswered.  Any suggestions?
>
> Thank you,
> Ravi
>
> Ravi Varadhan, Ph.D. (Biostatistics), Ph.D. (Environmental Engg)
> Associate Professor,  Department of Oncology
> Division of Biostatistics & Bionformatics
> Sidney Kimmel Comprehensive Cancer Center
> Johns Hopkins University
> 550 N. Broadway, Suite -E
> Baltimore, MD 21205
> 410-502-2619
>
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Peter Salzman, PhD
Department of Biostatistics and Computational Biology
University of Rochester

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Linear regression with a rounded response variable

2015-10-21 Thread Gabor Grothendieck
This could be modeled directly using Bayesian techniques. Consider the
Bayesian version of the following model where we only observe y and X.  y0
is not observed.

   y0 <- X b + error
   y <- round(y0)

The following code is based on modifying the code in the README of the CRAN
rcppbugs R package.


library(rcppbugs)
set.seed(123)

# set up the test data - y and X are observed but not y0
NR <- 1e2L
NC <- 2L
X <- cbind(1, rnorm(10))
y0 <- X %*% 1:2
y <- round(y0)

# for comparison run a normal linear model w/ lm.fit using X and y
lm.res <- lm.fit(X,y)
print(coef(lm.res))
##x1x2
## 0.9569366 1.9170808

# RCppBugs Model
b <- mcmc.normal(rnorm(NC),mu=0,tau=0.0001)
tau.y <- mcmc.gamma(sd(as.vector(y)),alpha=0.1,beta=0.1)
y.hat <- deterministic(function(X,b) { round(X %*% b) }, X, b)
y.lik <- mcmc.normal(y,mu=y.hat,tau=tau.y,observed=TRUE)
m <- create.model(b, tau.y, y.hat, y.lik)

# run the Bayesian model based on y and X
cat("running model...\n")
runtime <- system.time(ans <- run.model(m, iterations=1e5L, burn=1e4L,
adapt=1e3L, thin=10L))
print(apply(ans[["b"]],2,mean))
## [1] 0.9882485 2.0009989


On Wed, Oct 21, 2015 at 10:53 AM, Ravi Varadhan 
wrote:

> Hi,
> I am dealing with a regression problem where the response variable, time
> (second) to walk 15 ft, is rounded to the nearest integer.  I do not care
> for the regression coefficients per se, but my main interest is in getting
> the prediction equation for walking speed, given the predictors (age,
> height, sex, etc.), where the predictions will be real numbers, and not
> integers.  The hope is that these predictions should provide unbiased
> estimates of the "unrounded" walking speed. These sounds like a measurement
> error problem, where the measurement error is due to rounding and hence
> would be uniformly distributed (-0.5, 0.5).
>
> Are there any canonical approaches for handling this type of a problem?
> What is wrong with just doing the standard linear regression?
>
> I googled and saw that this question was asked by someone else in a
> stackexchange post, but it was unanswered.  Any suggestions?
>
> Thank you,
> Ravi
>
> Ravi Varadhan, Ph.D. (Biostatistics), Ph.D. (Environmental Engg)
> Associate Professor,  Department of Oncology
> Division of Biostatistics & Bionformatics
> Sidney Kimmel Comprehensive Cancer Center
> Johns Hopkins University
> 550 N. Broadway, Suite -E
> Baltimore, MD 21205
> 410-502-2619
>
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Linear regression with a rounded response variable

2015-10-21 Thread peter dalgaard

> On 21 Oct 2015, at 19:57 , Charles C. Berry  wrote:
> 
> On Wed, 21 Oct 2015, Ravi Varadhan wrote:
> 
>> [snippage]
> 
> If half the subjects have a value of 5 seconds and the rest are split between 
> 4 and 6, your assertion that rounding induces an error of 
> dunif(epsilon,-0.5,0.5) is surely wrong (more positive errors in the 6 second 
> group and more negative errors in the 4 second group under any plausible 
> model).

Yes, and I think that the suggestion in another post to look at censored 
regression is more in the right direction. 

In general, I'd expect the bias caused by rounding the response to quite small, 
except at very high granularity. I did a few small experiments with the 
simplest possible linear model: estimating a mean based on highly rounded data,

> y <- round(rnorm(1e2,pi,.5))
> mean(y)
[1] 3.12
> table(y)
y
 2  3  4  5 
13 63 23  1 

Or, using a bigger sample:

> mean(round(rnorm(1e8,pi,.5)))
[1] 3.139843

in which there is a visible bias, but quite a small one: 

> pi - 3.139843
[1] 0.001749654

At lower granularity (sd=1 instead of .5), the bias has almost disappeared.

> mean(round(rnorm(1e8,pi,1)))
[1] 3.141577

If the granularity is increased sufficiently, you _will_ see a sizeable bias 
(because almost all observations will be round(pi)==3):

> mean(round(rnorm(1e8,pi,.1)))
[1] 3.00017


A full ML fit (with known sigma=1) is pretty easily done:

> library(stats4)
> mll <- function(mu)-sum(log(pnorm(y+.5,mu, .5)-pnorm(y-.5, mu, .5)))
> mle(mll,start=list(mu=3))

Call:
mle(minuslogl = mll, start = list(mu = 3))

Coefficients:
  mu 
3.122069 
> mean(y)
[1] 3.12

As you see, the difference is only 0.002. 

A small simulation (1000 repl.) gave (r[1,]==MLE ; r{2,]==mean)

> summary(r[1,]-r[2,])
 Min.   1st Qu.Median  Mean   3rd Qu.  Max. 
-0.004155  0.000702  0.001495  0.001671  0.002554  0.006860 

so the corrections relative to the crude mean stay within one unit in the 2nd 
place. Notice  that the corrections are pretty darn close to cancelling out the 
bias.

-pd

> 
> 
> HTH,
> 
> Chuck
> 
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd@cbs.dk  Priv: pda...@gmail.com

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.