Re: [R] Problem with a regression - Dataset Workinghours

2012-07-29 Thread peter dalgaard

On Jul 28, 2012, at 17:37 , Giorgio Monti wrote:

> I'm a student. I'm working on a research using the statistical program "R
> 2.15.1".
> Here's my problem: how i can do a regression considering only values over a
> certain limit?
> For example, considering the dataset "Workinghour" of the "Ecdat" package,
> is possible to build a predictive model that express the probability that a
> wife works more than 8 hours per day?
> The dataset includes 3382 observation on the number of hours spent working
> by wifes per year in USA.
> 
> hoursday=hours/240
> index<-which(hoursday>=8)
> hoursday[index]
> 
> As you see, I'm able to extract the values that in 'hoursday' (which is
> hours/240 working days in one year) are > 8,0 but obviously i can't do a
> regression cause the extracted data are a subset of the entire dataset (955
> observations), while the other variables, like age, occupation, income,
> etc. are still complete(3382).
> 
> So i can't do:
> lm = lm(hoursday[index] ~
> income+age+education+unemp+child5+child13+child17+nonwhite+owned+mortgage+occupation)
> In fact "R" gives me: Error in model.frame.default(formula =
> hoursday[index] ~ income, drop.unused.levels = TRUE) : variable lengths
> differ (found for 'income').
> 
> Can you help me?
> 

Yes: don't do that. You are not going to "build a predictive model that express 
the probability that a wife works more than 8 hours per day" from data where 
everyone works more than 8 hours by day!

You can either fit the model to all data and work out the probabilistic 
consequences, or if you don't quite believe the normality assumption of linear 
models, perhaps reduce the outcome to 0/1 and turn to logit or probit 
regression.

It is not technically hard to fit data to a subset, but it is a big no-no to 
subset on the dependent variable. Well, you can, and people do, actually do 
subsampling on the response variable, but the standard methods of analysis do 
not apply.


-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd@cbs.dk  Priv: pda...@gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Problem with a regression - Dataset Workinghours

2012-07-28 Thread Giorgio Monti
I'm a student. I'm working on a research using the statistical program "R
2.15.1".
Here's my problem: how i can do a regression considering only values over a
certain limit?
For example, considering the dataset "Workinghour" of the "Ecdat" package,
is possible to build a predictive model that express the probability that a
wife works more than 8 hours per day?
The dataset includes 3382 observation on the number of hours spent working
by wifes per year in USA.

hoursday=hours/240
index<-which(hoursday>=8)
hoursday[index]

As you see, I'm able to extract the values that in 'hoursday' (which is
hours/240 working days in one year) are > 8,0 but obviously i can't do a
regression cause the extracted data are a subset of the entire dataset (955
observations), while the other variables, like age, occupation, income,
etc. are still complete(3382).

So i can't do:
lm = lm(hoursday[index] ~
income+age+education+unemp+child5+child13+child17+nonwhite+owned+mortgage+occupation)
In fact "R" gives me: Error in model.frame.default(formula =
hoursday[index] ~ income, drop.unused.levels = TRUE) : variable lengths
differ (found for 'income').

Can you help me?

Thank you.

Giorgio

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.