Re: [R] two cols in a data frame are the same factor

2008-03-21 Thread Andres Legarra
Looks like it works, albeit the first level is automatically dropped
out by lm(). I'll manege to do something with that.
The second option looks good too.
Thanks

Andres

On Thu, Mar 20, 2008 at 6:21 PM, Greg Snow [EMAIL PROTECTED] wrote:
 Here is one approach:

  First run a regular lm command without the restrictions, but specify
  y=TRUE, x=TRUE.

  This will do the unconstrained regression, but part of the return object
  will be the y variable after subsetting, NA removal, etc. and the x
  matrix that was used, this x matrix will have your 2 factors converted
  into indicator/dummy variables (along with any other covariates
  mentioned).   Take the x and y components of that return and put them
  into a new data frame.

  Now do a regression using the new data frame as your data and include
  I(f1.1+f2.1) terms just like you would with numeric predictors to force
  the coefficients to be equal.

  You could also accomplish the same idea in the original regression using
  a formula like:

  Y ~ I( fac1=='A' + fac2=='A' ) + I( fac1=='B' + fac2=='B' ) + ...

  For each level (other than the baseline level, or including it if you
  leave out the intercept) of fac1 and fac2.  Both do essentially the same
  thing, create your own set of indicator variables rather than depending
  on R to do it.

  Hope this helps,

  --
  Gregory (Greg) L. Snow Ph.D.
  Statistical Data Center
  Intermountain Healthcare
  [EMAIL PROTECTED]
  (801) 408-8111





   -Original Message-
   From: [EMAIL PROTECTED]
   [mailto:[EMAIL PROTECTED] On Behalf Of Andres Legarra
   Sent: Thursday, March 20, 2008 2:25 AM
   To: Michael Dewey
   Cc: R-help@r-project.org
   Subject: Re: [R] two cols in a data frame are the same factor
  
   Hi,
   I am afraid you misunderstood it. I do not have repeated
   records, but for every record I have two, possibly different,
   simultaneously present, instanciations of an explanatory variable.
  
   My data is as follows :
  
   yield haplo1 haplo2
   100  A B
   151  B A
   212  A A
  
   So I have one effect (haplo), but two copies of each affect yield.
   If I use lm() I get:
   
   a=data.frame(yield=c(100,151,212),haplo1=c(A,B,A),haplo2=c(B,
A,A))
   Call:
   lm(formula = yield ~ -1 + haplo1 + haplo2, data = a)
  
   Coefficients:
haploA   haploB  haplo2B
   212  151 -112
  
  
   But I get different coefficients for the two As (in fact oe
   was set to 0) and the Two Bs . That is, the model has four
   unknowns but in my example I have just two!
  
   A least-squares solution is simple to do by hand:
  
X=matrix(c(1,1,1,1,2,0),ncol=2) #the incidence matrix
X
[,1] [,2]
   [1,]11
   [2,]12
   [3,]10
solve(crossprod(X,X),crossprod(X,a$yield))
[,1]
   [1,] 184.8333
   [2,] -30.5000
  
   where [1,] is the solution for A and [2,] is the solution for B
  
   This is not difficult to do by hand, but it is for a simple
   case and I miss all the machinery in lm()
  
   Thank you
   Andres
  
   On Wed, Mar 19, 2008 at 6:57 PM, Michael Dewey
   [EMAIL PROTECTED] wrote:
At 09:11 18/03/2008, Andres Legarra wrote:
 Dear all,
 I have a data set (QTL detection) where I have two cols
   of factors
in  the data frame that correspond logically (in my model) to the
same  factor. In fact these are haplotype classes.
 Another real-life example would be family gas consumption as a
function of car company (e.g. Ford, GM, and Honda)
   (assuming 2 cars
by  family).
   
 Unless I completely misunderstand this it looks like you have the
dataset in wide format when you really wanted it in long
   format (to
use the terminology of ?reshape). Then you would fit a
   model allowing
for the clustering by family.
   
   
   
   
 An artificial example follows:
 set.seed(1234)
 L3 - LETTERS[1:3]
 (d - data.frame( y=rnorm(10), fac=sample(L3, 10,
 repl=TRUE),fac1=sample(L3,10,repl=T)))
 
   lm(y ~ fac+fac1,data=d)
 
 and I get:
 
 Coefficients:
 (Intercept) facB facCfac1Bfac1C
   0.3612  -0.9359  -0.2004  -2.1376  -0.5438
 
 However, to respect my model, I need to constrain effects
   in fac and
 fac1 to be the same, i.e. facB=fac1B and facC=fac1C. There are
logically just 4 unknowns (average,A,B,C).
 With continuous covariates one might do y ~ I(cov1+cov2),
   but this
is  not the case.
 
 Is there any trick to do that?
 Thanks,
 
 Andres Legarra
 INRA-SAGA
 Toulouse, France
   
 Michael Dewey
 http://www.aghmed.fsnet.co.uk
   
   
  
   __
   R-help@r-project.org mailing list
   https://stat.ethz.ch/mailman/listinfo/r-help
   PLEASE do read the posting guide
   http://www.R-project.org/posting-guide.html
   and provide commented, minimal, self-contained, reproducible code.
  




Re: [R] two cols in a data frame are the same factor

2008-03-19 Thread Michael Dewey
At 09:11 18/03/2008, Andres Legarra wrote:
Dear all,
I have a data set (QTL detection) where I have two cols of factors in
the data frame that correspond logically (in my model) to the same
factor. In fact these are haplotype classes.
Another real-life example would be family gas consumption as a
function of car company (e.g. Ford, GM, and Honda) (assuming 2 cars by
family).

Unless I completely misunderstand this it looks like you have the 
dataset in wide format when you really wanted it in long format (to 
use the terminology of ?reshape). Then you would fit a model allowing 
for the clustering by family.


An artificial example follows:
set.seed(1234)
L3 - LETTERS[1:3]
(d - data.frame( y=rnorm(10), fac=sample(L3, 10,
repl=TRUE),fac1=sample(L3,10,repl=T)))

  lm(y ~ fac+fac1,data=d)

and I get:

Coefficients:
(Intercept) facB facCfac1Bfac1C
  0.3612  -0.9359  -0.2004  -2.1376  -0.5438

However, to respect my model, I need to constrain effects in fac and
fac1 to be the same, i.e. facB=fac1B and facC=fac1C. There are
logically just 4 unknowns (average,A,B,C).
With continuous covariates one might do y ~ I(cov1+cov2), but this is
not the case.

Is there any trick to do that?
Thanks,

Andres Legarra
INRA-SAGA
Toulouse, France

Michael Dewey
http://www.aghmed.fsnet.co.uk

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] two cols in a data frame are the same factor

2008-03-18 Thread Andres Legarra
Dear all,
I have a data set (QTL detection) where I have two cols of factors in
the data frame that correspond logically (in my model) to the same
factor. In fact these are haplotype classes.
Another real-life example would be family gas consumption as a
function of car company (e.g. Ford, GM, and Honda) (assuming 2 cars by
family).

An artificial example follows:
set.seed(1234)
L3 - LETTERS[1:3]
(d - data.frame( y=rnorm(10), fac=sample(L3, 10,
repl=TRUE),fac1=sample(L3,10,repl=T)))

 lm(y ~ fac+fac1,data=d)

and I get:

Coefficients:
(Intercept) facB facCfac1Bfac1C
 0.3612  -0.9359  -0.2004  -2.1376  -0.5438

However, to respect my model, I need to constrain effects in fac and
fac1 to be the same, i.e. facB=fac1B and facC=fac1C. There are
logically just 4 unknowns (average,A,B,C).
With continuous covariates one might do y ~ I(cov1+cov2), but this is
not the case.

Is there any trick to do that?
Thanks,

Andres Legarra
INRA-SAGA
Toulouse, France

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] two cols in a data frame are the same factor

2008-03-18 Thread Andres Legarra
Dear all,
 I have a data set (QTL detection) where I have two cols of factors in
 the data frame that correspond logically (in my model) to the same
 factor. In fact these are haplotype classes.
 Another real-life example would be family gas consumption as a
 function of car company (e.g. Ford, GM, and Honda) (assuming 2 cars by
 family).

 An artificial example follows:
 set.seed(1234)
 L3 - LETTERS[1:3]
 (d - data.frame( y=rnorm(10), fac=sample(L3, 10,
 repl=TRUE),fac1=sample(L3,10,repl=T)))

  lm(y ~ fac+fac1,data=d)

 and I get:

 Coefficients:
 (Intercept) facB facCfac1Bfac1C
 0.3612  -0.9359  -0.2004  -2.1376  -0.5438

 However, to respect my model, I need to constrain effects in fac and
 fac1 to be the same, i.e. facB=fac1B and facC=fac1C. There are
 logically just 4 unknowns (average,A,B,C).
 With continuous covariates one might do y ~ I(cov1+cov2), but this is
 not the case.

 Is there any trick to do that?
 Thanks,

 Andres Legarra
 INRA-SAGA
 Toulouse, France

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.