Re: [R] logistic regression - glm() - example in Dalgaard's book ISwR

2010-07-03 Thread Juliet Hannah
You may find both of Alan Agresti's books on categorcial data analysis
useful. Try googling both books and then
search the word grouped within each book. Agresti refers to the
difference you describe as
grouped versus ungrouped data. The likelihoods differ and all
summaries based on the likelihood will
also differ.

On Fri, Jul 2, 2010 at 11:33 PM, Paulo Barata pbar...@infolink.com.br wrote:

 Dear R-list members,

 I would like to pose a question about the use and results
 of the glm() function for logistic regression calculations.

 The question is based on an example provided on p. 229
 in P. Dalgaard, Introductory Statistics with R, 2nd. edition,
 Springer, 2008. By means of this example, I was trying to
 practice the different ways of entering data in glm().

 In his book, Dalgaard provides data from a case-based study
 about hypertension summarized in the form of a table. He shows
 two ways of entering the response (dependent) variable data
 in glm(): (1) as a matrix of successes/failures (diseased/
 healthy); (2) as the proportion of people diseased for each
 combination of independent variable's categories.

 I tried to enter the response variable in glm() in a third
 way: by reconstructing, from the table, the original data
 in a case-based format, that is, a data frame in which
 each row shows the data for one person. In this situation,
 the response variable would be coded as a numeric 0/1 vector,
 0=failure, 1=success, and so it would be entered in glm() as
 a numeric 0/1 vector.

 The program below presents the calculations for each of the
 three ways of entering data - the first and second methods
 were just copied from Dalgaard's book.

 The three methods produce the same results with regard
 to the estimated coefficients, when the output is seen
 with five decimals (although regression 3 seems to have
 produced slightly different std.errors).

 My main question is: Why are the residual deviance, its
 degrees of freedom and the AIC produced by regression 3
 completely different when compared to those produced by
 regressions 1 and 2 (which seem to produce equal results)?
 It seems that the degrees of freedom in regressions 1
 and 2 are based on the size (number of rows) of table d
 (see the output of the program below), but this table is
 just a way of summarizing the data. The degrees of
 freedom in regressions 1 and 2 are not based on the
 actual number of cases (people) examined, which is n=433.

 I understand that no matter the way of entering the data
 in glm(), we are always analyzing the same data, which
 are those presented in table format on Dalgaard's page
 229 (these data are in data.frame d in the program below).
 So I understand that the three ways of entering data
 in glm() should produce the same results.

 Secondarily, why are the std.errors in regression 3 slightly
 different from those calculated in regressions 1 and 2?

 I am using R version 2.11.1 on Windows XP.

 Thank you very much.

 Paulo Barata

 ##== begin =

 ## data in: P. Dalgaard, Introductory Statistics with R,
 ## 2nd. edition, Springer, 2008
 ## logistic regression - example in Dalgaard's Section 13.2,
 ## page 229

 rm(list=ls())

 ## data provided on Dalgaard's page 229:
 no.yes - c(No,Yes)
 smoking - gl(2,1,8,no.yes)
 obesity - gl(2,2,8,no.yes)
 snoring - gl(2,4,8,no.yes)
 n.tot - c(60,17,8,2,187,85,51,23)
 n.hyp - c(5,2,1,0,35,13,15,8)

 d - data.frame(smoking,obesity,snoring,n.tot,n.hyp)
 ## d is the data to be analyzed, in table format
 ## d is the first table on Dalgaard's page 229
 ## n.tot = total number of cases
 ## n.hyp = people with hypertension
 d

 ## regression 1: Dalgaard's page 230
 ## response variable entered in glm() as a matrix of
 ## successes/failures
 hyp.tbl - cbind(n.hyp,n.tot-n.hyp)
 regression1 - glm(hyp.tbl~smoking+obesity+snoring,
                   family=binomial(logit))

 ## regression 2: Dalgaard's page 230
 ## response variable entered in glm() as proportions
 prop.hyp - n.hyp/n.tot
 regression2 - glm(prop.hyp~smoking+obesity+snoring,
                   weights=n.tot,family=binomial(logit))

 ## regression 3 (well below): data entered in glm()
 ## by means of 'reconstructed' variables
 ## variables with names beginning with 'r' are
 ## 'reconstructed' from data in data.frame d.
 ## The objective is to reconstruct the original
 ## data from which the table on Dalgaard's page 229
 ## has been produced

 rsmoking - c(rep('No',d[1,4]),rep('Yes',d[2,4]),
              rep('No',d[3,4]),rep('Yes',d[4,4]),
              rep('No',d[5,4]),rep('Yes',d[6,4]),
              rep('No',d[7,4]),rep('Yes',d[8,4]))
 rsmoking - factor(rsmoking)
 length(rsmoking)  # just a check

 robesity - c(rep('No', d[1,4]),rep('No', d[2,4]),
              rep('Yes',d[3,4]),rep('Yes',d[4,4]),
              rep('No', d[5,4]),rep('No', d[6,4]),
              rep('Yes',d[7,4]),rep('Yes',d[8,4]))
 robesity - factor(robesity)
 length(robesity)  # just a check

 

Re: [R] logistic regression - glm() - example in Dalgaard's book ISwR

2010-07-03 Thread David Winsemius


On Jul 2, 2010, at 11:33 PM, Paulo Barata wrote:



Dear R-list members,

I would like to pose a question about the use and results
of the glm() function for logistic regression calculations.

The question is based on an example provided on p. 229
in P. Dalgaard, Introductory Statistics with R, 2nd. edition,
Springer, 2008. By means of this example, I was trying to
practice the different ways of entering data in glm().

In his book, Dalgaard provides data from a case-based study
about hypertension summarized in the form of a table. He shows
two ways of entering the response (dependent) variable data
in glm(): (1) as a matrix of successes/failures (diseased/
healthy); (2) as the proportion of people diseased for each
combination of independent variable's categories.

I tried to enter the response variable in glm() in a third
way: by reconstructing, from the table, the original data
in a case-based format, that is, a data frame in which
each row shows the data for one person. In this situation,
the response variable would be coded as a numeric 0/1 vector,
0=failure, 1=success, and so it would be entered in glm() as
a numeric 0/1 vector.

The program below presents the calculations for each of the
three ways of entering data - the first and second methods
were just copied from Dalgaard's book.

The three methods produce the same results with regard
to the estimated coefficients, when the output is seen
with five decimals (although regression 3 seems to have
produced slightly different std.errors).

My main question is: Why are the residual deviance, its
degrees of freedom and the AIC produced by regression 3
completely different when compared to those produced by
regressions 1 and 2 (which seem to produce equal results)?
It seems that the degrees of freedom in regressions 1
and 2 are based on the size (number of rows) of table d
(see the output of the program below), but this table is
just a way of summarizing the data. The degrees of
freedom in regressions 1 and 2 are not based on the
actual number of cases (people) examined, which is n=433.


I first encountered this phenomenon 25 years ago when using GLIM. The  
answer from my statistical betters was that the deviance is actually  
only established up to a constant and that it is only differences in  
deviance that can be properly interpreted. The same situation exists  
with indefinite integrals in calculus.


I understand that no matter the way of entering the data
in glm(), we are always analyzing the same data, which
are those presented in table format on Dalgaard's page
229 (these data are in data.frame d in the program below).
So I understand that the three ways of entering data
in glm() should produce the same results.


The differences between equivalent nested models should remain the  
same (up to numerical accuracy).


 411.42  on 432  degrees of freedom
-398.92  on 429
-
12.5   3

 14.1259  on 7  degrees of freedom
-1.6184   on 4
--
12.50753



Secondarily, why are the std.errors in regression 3 slightly
different from those calculated in regressions 1 and 2?


You mean the differences 4 places to the right of the decimal???



I am using R version 2.11.1 on Windows XP.

Thank you very much.

Paulo Barata

##== begin =

## data in: P. Dalgaard, Introductory Statistics with R,
## 2nd. edition, Springer, 2008
## logistic regression - example in Dalgaard's Section 13.2,
## page 229

rm(list=ls())


Personally, I rather annoyed when people post this particular line  
without commenting it out. It is basically saying that your code is  
terribly much more important than whatever else might be in a user's  
workspace.




## data provided on Dalgaard's page 229:
no.yes - c(No,Yes)
smoking - gl(2,1,8,no.yes)
obesity - gl(2,2,8,no.yes)
snoring - gl(2,4,8,no.yes)
n.tot - c(60,17,8,2,187,85,51,23)
n.hyp - c(5,2,1,0,35,13,15,8)

d - data.frame(smoking,obesity,snoring,n.tot,n.hyp)
## d is the data to be analyzed, in table format
## d is the first table on Dalgaard's page 229
## n.tot = total number of cases
## n.hyp = people with hypertension
d

## regression 1: Dalgaard's page 230
## response variable entered in glm() as a matrix of
## successes/failures
hyp.tbl - cbind(n.hyp,n.tot-n.hyp)
regression1 - glm(hyp.tbl~smoking+obesity+snoring,
  family=binomial(logit))

## regression 2: Dalgaard's page 230
## response variable entered in glm() as proportions
prop.hyp - n.hyp/n.tot
regression2 - glm(prop.hyp~smoking+obesity+snoring,
  weights=n.tot,family=binomial(logit))

## regression 3 (well below): data entered in glm()
## by means of 'reconstructed' variables
## variables with names beginning with 'r' are
## 'reconstructed' from data in data.frame d.
## The objective is to reconstruct the original
## data from which the table on Dalgaard's page 229
## has been produced

rsmoking - c(rep('No',d[1,4]),rep('Yes',d[2,4]),
 

Re: [R] logistic regression - glm() - example in Dalgaard's book ISwR

2010-07-03 Thread Marc Schwartz
On Jul 3, 2010, at 9:00 AM, David Winsemius wrote:

 
 On Jul 2, 2010, at 11:33 PM, Paulo Barata wrote:
 
 
 Dear R-list members,
 
 I would like to pose a question about the use and results
 of the glm() function for logistic regression calculations.
 
 The question is based on an example provided on p. 229
 in P. Dalgaard, Introductory Statistics with R, 2nd. edition,
 Springer, 2008. By means of this example, I was trying to
 practice the different ways of entering data in glm().
 
 In his book, Dalgaard provides data from a case-based study
 about hypertension summarized in the form of a table. He shows
 two ways of entering the response (dependent) variable data
 in glm(): (1) as a matrix of successes/failures (diseased/
 healthy); (2) as the proportion of people diseased for each
 combination of independent variable's categories.
 
 I tried to enter the response variable in glm() in a third
 way: by reconstructing, from the table, the original data
 in a case-based format, that is, a data frame in which
 each row shows the data for one person. In this situation,
 the response variable would be coded as a numeric 0/1 vector,
 0=failure, 1=success, and so it would be entered in glm() as
 a numeric 0/1 vector.
 
 The program below presents the calculations for each of the
 three ways of entering data - the first and second methods
 were just copied from Dalgaard's book.
 
 The three methods produce the same results with regard
 to the estimated coefficients, when the output is seen
 with five decimals (although regression 3 seems to have
 produced slightly different std.errors).
 
 My main question is: Why are the residual deviance, its
 degrees of freedom and the AIC produced by regression 3
 completely different when compared to those produced by
 regressions 1 and 2 (which seem to produce equal results)?
 It seems that the degrees of freedom in regressions 1
 and 2 are based on the size (number of rows) of table d
 (see the output of the program below), but this table is
 just a way of summarizing the data. The degrees of
 freedom in regressions 1 and 2 are not based on the
 actual number of cases (people) examined, which is n=433.
 
 I first encountered this phenomenon 25 years ago when using GLIM. The answer 
 from my statistical betters was that the deviance is actually only 
 established up to a constant and that it is only differences in deviance that 
 can be properly interpreted. The same situation exists with indefinite 
 integrals in calculus.
 
 I understand that no matter the way of entering the data
 in glm(), we are always analyzing the same data, which
 are those presented in table format on Dalgaard's page
 229 (these data are in data.frame d in the program below).
 So I understand that the three ways of entering data
 in glm() should produce the same results.
 
 The differences between equivalent nested models should remain the same (up 
 to numerical accuracy).
 
 411.42  on 432  degrees of freedom
 -398.92  on 429
 -
 12.5   3
 
 14.1259  on 7  degrees of freedom
 -1.6184   on 4
 --
 12.50753
 
 
 Secondarily, why are the std.errors in regression 3 slightly
 different from those calculated in regressions 1 and 2?
 
 You mean the differences 4 places to the right of the decimal???
 
 
 I am using R version 2.11.1 on Windows XP.
 
 Thank you very much.
 
 Paulo Barata
 
 ##== begin =
 
 ## data in: P. Dalgaard, Introductory Statistics with R,
 ## 2nd. edition, Springer, 2008
 ## logistic regression - example in Dalgaard's Section 13.2,
 ## page 229
 
 rm(list=ls())
 
 Personally, I rather annoyed when people post this particular line without 
 commenting it out. It is basically saying that your code is terribly much 
 more important than whatever else might be in a user's workspace.
 
 
 ## data provided on Dalgaard's page 229:
 no.yes - c(No,Yes)
 smoking - gl(2,1,8,no.yes)
 obesity - gl(2,2,8,no.yes)
 snoring - gl(2,4,8,no.yes)
 n.tot - c(60,17,8,2,187,85,51,23)
 n.hyp - c(5,2,1,0,35,13,15,8)
 
 d - data.frame(smoking,obesity,snoring,n.tot,n.hyp)
 ## d is the data to be analyzed, in table format
 ## d is the first table on Dalgaard's page 229
 ## n.tot = total number of cases
 ## n.hyp = people with hypertension
 d
 
 ## regression 1: Dalgaard's page 230
 ## response variable entered in glm() as a matrix of
 ## successes/failures
 hyp.tbl - cbind(n.hyp,n.tot-n.hyp)
 regression1 - glm(hyp.tbl~smoking+obesity+snoring,
  family=binomial(logit))
 
 ## regression 2: Dalgaard's page 230
 ## response variable entered in glm() as proportions
 prop.hyp - n.hyp/n.tot
 regression2 - glm(prop.hyp~smoking+obesity+snoring,
  weights=n.tot,family=binomial(logit))
 
 ## regression 3 (well below): data entered in glm()
 ## by means of 'reconstructed' variables
 ## variables with names beginning with 'r' are
 ## 'reconstructed' from data in data.frame d.
 ## The objective is to 

[R] logistic regression - glm() - example in Dalgaard's book ISwR

2010-07-02 Thread Paulo Barata


Dear R-list members,

I would like to pose a question about the use and results
of the glm() function for logistic regression calculations.

The question is based on an example provided on p. 229
in P. Dalgaard, Introductory Statistics with R, 2nd. edition,
Springer, 2008. By means of this example, I was trying to
practice the different ways of entering data in glm().

In his book, Dalgaard provides data from a case-based study
about hypertension summarized in the form of a table. He shows
two ways of entering the response (dependent) variable data
in glm(): (1) as a matrix of successes/failures (diseased/
healthy); (2) as the proportion of people diseased for each
combination of independent variable's categories.

I tried to enter the response variable in glm() in a third
way: by reconstructing, from the table, the original data
in a case-based format, that is, a data frame in which
each row shows the data for one person. In this situation,
the response variable would be coded as a numeric 0/1 vector,
0=failure, 1=success, and so it would be entered in glm() as
a numeric 0/1 vector.

The program below presents the calculations for each of the
three ways of entering data - the first and second methods
were just copied from Dalgaard's book.

The three methods produce the same results with regard
to the estimated coefficients, when the output is seen
with five decimals (although regression 3 seems to have
produced slightly different std.errors).

My main question is: Why are the residual deviance, its
degrees of freedom and the AIC produced by regression 3
completely different when compared to those produced by
regressions 1 and 2 (which seem to produce equal results)?
It seems that the degrees of freedom in regressions 1
and 2 are based on the size (number of rows) of table d
(see the output of the program below), but this table is
just a way of summarizing the data. The degrees of
freedom in regressions 1 and 2 are not based on the
actual number of cases (people) examined, which is n=433.

I understand that no matter the way of entering the data
in glm(), we are always analyzing the same data, which
are those presented in table format on Dalgaard's page
229 (these data are in data.frame d in the program below).
So I understand that the three ways of entering data
in glm() should produce the same results.

Secondarily, why are the std.errors in regression 3 slightly
different from those calculated in regressions 1 and 2?

I am using R version 2.11.1 on Windows XP.

Thank you very much.

Paulo Barata

##== begin =

## data in: P. Dalgaard, Introductory Statistics with R,
## 2nd. edition, Springer, 2008
## logistic regression - example in Dalgaard's Section 13.2,
## page 229

rm(list=ls())

## data provided on Dalgaard's page 229:
no.yes - c(No,Yes)
smoking - gl(2,1,8,no.yes)
obesity - gl(2,2,8,no.yes)
snoring - gl(2,4,8,no.yes)
n.tot - c(60,17,8,2,187,85,51,23)
n.hyp - c(5,2,1,0,35,13,15,8)

d - data.frame(smoking,obesity,snoring,n.tot,n.hyp)
## d is the data to be analyzed, in table format
## d is the first table on Dalgaard's page 229
## n.tot = total number of cases
## n.hyp = people with hypertension
d

## regression 1: Dalgaard's page 230
## response variable entered in glm() as a matrix of
## successes/failures
hyp.tbl - cbind(n.hyp,n.tot-n.hyp)
regression1 - glm(hyp.tbl~smoking+obesity+snoring,
   family=binomial(logit))

## regression 2: Dalgaard's page 230
## response variable entered in glm() as proportions
prop.hyp - n.hyp/n.tot
regression2 - glm(prop.hyp~smoking+obesity+snoring,
   weights=n.tot,family=binomial(logit))

## regression 3 (well below): data entered in glm()
## by means of 'reconstructed' variables
## variables with names beginning with 'r' are
## 'reconstructed' from data in data.frame d.
## The objective is to reconstruct the original
## data from which the table on Dalgaard's page 229
## has been produced

rsmoking - c(rep('No',d[1,4]),rep('Yes',d[2,4]),
  rep('No',d[3,4]),rep('Yes',d[4,4]),
  rep('No',d[5,4]),rep('Yes',d[6,4]),
  rep('No',d[7,4]),rep('Yes',d[8,4]))
rsmoking - factor(rsmoking)
length(rsmoking)  # just a check

robesity - c(rep('No', d[1,4]),rep('No', d[2,4]),
  rep('Yes',d[3,4]),rep('Yes',d[4,4]),
  rep('No', d[5,4]),rep('No', d[6,4]),
  rep('Yes',d[7,4]),rep('Yes',d[8,4]))
robesity - factor(robesity)
length(robesity)  # just a check

rsnoring - c(rep('No', d[1,4]),rep('No', d[2,4]),
  rep('No', d[3,4]),rep('No', d[4,4]),
  rep('Yes',d[5,4]),rep('Yes',d[6,4]),
  rep('Yes',d[7,4]),rep('Yes',d[8,4]))
rsnoring - factor(rsnoring)
length(rsnoring)  # just a check

rhyp - c(rep(1,d[1,5]),rep(0,d[1,4]-d[1,5]),
  rep(1,d[2,5]),rep(0,d[2,4]-d[2,5]),
  rep(1,d[3,5]),rep(0,d[3,4]-d[3,5]),
  rep(1,d[4,5]),rep(0,d[4,4]-d[4,5]),