[R] Cohen's d on Adjusted Means (UNCLASSIFIED)
Classification: UNCLASSIFIED Caveats: NONE I have a large group randomized trial (pre-post design) where the randomization was marginally successful. Given the pre-existing differences among groups, it makes sense to report adjusted means (aka least squares means though I estimated them via predictions from lme rather than lm). I would like to report effect sizes (Cohen's d) calculated as: (Mean1-Mean2)/sqrt(((N1*VAR1)+(N2*VAR2))/(N1+N2)) Can anyone provide advice on what would be a reasonable value to use for the variance estimate in the equation? Any other thoughts on how to calculate effect sizes on adjusted mean values? Thanks, Paul LTC Paul D. Bliese Commander, US Army Medical Research Unit - Europe Heidelberg, Germany DSN: 314-371-2626 COMM: +49-6221-172626 Classification: UNCLASSIFIED Caveats: NONE __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Why are lagged correlations typically negative?
Recently, I was working with some lagged designs where a vector of observations at one time was used to predict a vector of observations at another time using a lag 1 design. In the work, I noticed a lot of negative correlations, so I ran a simple simulation with 2 matched points. The crude simulation example below shows that the correlation can be -1 or +1, but interestingly if you do this basic simulation thousands of times, you get negative correlations 66 to 67% of the time. If you simulate three matched observations instead of three you get negative correlations about 74% of the time and then as you simulate 4 and more observations the number of negative correlations asymptotically approaches an equal 50% for negative versus positive correlations (though then with 100 observations one has 54% negative correlations). Creating T1 and T2 so they are related (and not correlated 1 as in the crude simulation) attenuates the effect. A more advanced simulation is provided below for those interested. Can anyone explain why this occurs in a way a non-mathematician is likely to understand? Thanks, Paul # # Crude simulation # (T1-rnorm(3)) [1] -0.1594703 -1.3340677 0.2924988 (T2-c(T1[2:3],NA)) [1] -1.3340677 0.2924988 NA cor(T1,T2, use=complete) [1] -1 (T1-rnorm(3)) [1] -0.84258593 -0.49161602 0.03805543 (T2-c(T1[2:3],NA)) [1] -0.49161602 0.03805543 NA cor(T1,T2, use=complete) [1] 1 ### # More advanced simulation example ### lags function(nobs,nreps,rho=1){ OUT-data.frame(NEG=rep(NA,nreps),COR=rep(NA,nreps)) nran-nobs+1 #need to generate 1 more random number than there are observations for(i in 1:nreps){ V1-rnorm(nran) V2-sqrt(1-rho^2)*rnorm(nran)+rho*V1 #print(cor(V1,V2)) V1-V1[1:nran-1] V2-V2[2:nran] OUT[i,1]-ifelse(cor(V1,V2)=0,1,0) OUT[i,2]-cor(V1,V2) } return(OUT) #out is a 1 if the corr is negative or 0; 0 if positive } LAGS.2-lags(2,1) #Number of observations matched = 2 mean(LAGS.2) NEG COR 0.6682 -0.3364 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Simulate dichotomous correlation matrix
Newsgroup members, Does anyone have a clever way to simulate a correlation matrix such that each column contains dichotomous variables (0,1) and where each column has different prevalence rates. For instance, I would like to simulate the following correlation matrix: CORMAT[1:4,1:4] PUREPTPTCUT2PHQCUT2T ALCCUTT2 PUREPT 1.000 0.5141552 0.1913139 0.1917923 PTCUT2 0.5141552 1.000 0.2913552 0.2204097 PHQCUT2T 0.1913139 0.2913552 1.000 0.1803987 ALCCUTT2 0.1917923 0.2204097 0.1803987 1.000 Where the prevalence for each variable is: prevvals=c(0.26,0.10,0.09,0.10) I can use the mvrnorm function in MASS to create a matrix containing random normal variables and dichotomize these variables into 0,1; however, this is a less than ideal solution as my observed correlation matrix is downwardly biased and the amount of the bias is related to the prevalence of each variable. Thanks, Paul D. Bliese Heidelberg, Germany COMM: +49-6221-172626 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Weighted Sensitivity, PPV etc.
All, Appreciate any leads on the following: In a recent blind-validation study of a depression screening instrument we used a two-stage sampling design. In stage 1, we used a broad paper-and-pencil screen to identify likely positives (say 30% of entire sample). In stage 2 we conducted in-depth interviews with the 30% of likely positives plus another 20% of the negatives as controls. We simply did not have resources to interview all negatives. A reviewer pointed out that measures such as sensitivity, specificity, PPV, NPV, AUC, etc. are potentially biased because they are based on a sample that is over-weighted by positives. That is, we let 50% of our original sample (many negatives) get away. The editor suggested weighting to provide unbiased estimates. Perhaps we are not looking in the right places, but we have not found good examples of how to do this weighting. Has anyone in the R world encountered a similar problem, and if so can they direct me to R packages that might help out? Paul [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] gam y-axis interpretation
Sorry if this is an obvious question... I'm estimating a simple binomial generalized additive model using the gam function in the package mgcv. The model makes sense given my data, and the predicted values also make sense given what I know about the data. However, I'm having trouble interpreting the y-axis of the plot of the gam object. The y-axis is labeled s(x,2.52) which I understand to basically mean a smoothing estimator with approximately 2.52 degrees of freedom. The y-axis in my case ranges from -2 to 6 and I thought that it would be possible to convert the Y axis estimate to a probability via exp(Y)/(1+exp(Y)). So for instance, my lowest y-axis estimate is -2 for a probability of: exp(-2)/(1+exp(-2)) [1] 0.1192029 However, if I use the predict function my lowest estimate is -3.53862893 for a probability of 2.8%. The 2.8% estimate is a much better estimate than 11.9% given my specific data, so I'm clearly not interpreting the plot correctly. The help files say plot.gam provides the component smooth functions that make it up, on the scale of the linear predictor. I'm just not sure what that description means. Does someone have another description that might help me grasp the plot? Similar plots are on page 286 of Venables and Ripley (3rd Edition)... Thanks, Paul [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] ylim problem in barplot
R Version 2.2.0 Platform: Windows When I use barplot but select a ylim value greater than zero, the graph is distorted. The bars extend below the bottom of the graph. For instance the command produces a problematic graph. barplot(c(200,300,250,350),ylim=c(150,400)) Any help would be appreciated. Paul [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Simulate phi-coefficient (correlation between dichotomous vars)
Newsgroup members, I appreciate the help on this topic. David Duffy provided a solution (below) that was quite helpful, and came close to what I needed. It did a great job creating two vectors of dichotomous variables with a known correlation (what I referred to as a phi-coefficient). My situation is a bit more complicated and I'm not sure it is easily solved. The problem is that I must assume one of the vectors is constant and generate one or more vectors that covary with the constant vector. In a continuous example I could use the following code that I got from the S-PLUS newsgroup year ago: sample.cor-function (x, rho) { y - (rho * (x - mean(x)))/sqrt(var(x)) + sqrt(1 - rho^2) * rnorm(length(x)) cat(Sample corr = , cor(x, y), \n) return(y) } X-rnorm(100) #a constant vector Y1-sample.cor(X,.30) # a new vector that correlates with X .30 Y2-sample.cor(X,.45) # a second vector that correlates with X .45 I can, of course, have X be a vector of zeros and ones, and I can dichotomize Y1 and Y2, but the program always returns a phi-coefficient correlation lower than the continuous correlation. Mathematically, I guess this is expected because the phi-coefficient is partially a function of the percentage of positive responses. This, in turn, explains Pearson's (1900) interest in the whole area of tetrachoric correlations -- a tetrachoric correlation being the Pearson product moment correlation that would have been observed had two dichotomously scored variables been measured on a continuous scale (Pearson, 1900). Appreciate any additional input or possible solutions. Paul -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of David Duffy Sent: Monday, September 12, 2005 1:34 AM To: r-help@stat.math.ethz.ch Subject: [R] Simulate phi-coefficient From: Bliese, Paul D LTC USAMH [EMAIL PROTECTED] Given a sample of zeros and ones, for example: VECTOR1-rep(c(1,0),c(15,10)) How would I create a new sample (VECTOR2) also containing zeros and ones, in which the phi-coefficient between the two sample vectors was drawn from a population with a known phi-coefficient value? I know there are ways to do this with normally distributed numbers (for example the mvrnorm function in MASS), but am stumped when dealing with dichotomous variables. Paul One way is to sample from the 2x2 table with the specified means and pearson correlation (phi): for a fourfold table, a b c d with marginal proportions p1 and p2 cov - phi * sqrt(p1*(1-p1)*p2*(1-p2)) a - p1*p2 + cov b - p1*(1-p2) - cov c - (1-p1)*p2 - cov d - (1-p1)*(1-p2) + cov expand.grid(0:1,0:1)[sample(1:4, size=25, replace=TRUE, prob=c(a,b,c,d)),] David. | David Duffy (MBBS PhD) ,-_|\ | email: [EMAIL PROTECTED] ph: INT+61+7+3362-0217 fax: -0101 / * | Epidemiology Unit, Queensland Institute of Medical Research \_,-._/ | 300 Herston Rd, Brisbane, Queensland 4029, Australia GPG 4D0B994A v __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Simulate phi-coefficient
Looking for help with the following problem. Given a sample of zeros and ones, for example: VECTOR1-rep(c(1,0),c(15,10)) VECTOR1 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 How would I create a new sample (VECTOR2) also containing zeros and ones, in which the phi-coefficient between the two sample vectors was drawn from a population with a known phi-coefficient value? Basically, I have a vector of zeros and ones and want to simulate another vector such that the two vectors have a known phi-coefficient. I know there are ways to do this with normally distributed numbers (for example the mvrnorm function in MASS), but am stumped when dealing with dichotomous variables. Appreciate any thoughts. Paul [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] apply the function factor to multiple columns
I have a case where I would like to change multiple columns containing numbers to factors. I can change each column one at a time as in: TEMP.FACT$EXPOS01-factor(TEMP.FACT$EXPOS01,levels=c(1,2,3),labels=c(No ne,Low Impact,MedHigh Imp)) TEMP.FACT$EXPOS02-factor(TEMP.FACT$EXPOS02,levels=c(1,2,3),labels=c(No ne,Low Impact,MedHigh Imp)) TEMP.FACT$EXPOS03-factor(TEMP.FACT$EXPOS03,levels=c(1,2,3),labels=c(No ne,Low Impact,MedHigh Imp)) summary(TEMP.FACT[,1:3]) EXPOS01 EXPOS02 EXPOS03 None :219 None :432 None :377 Low Impact :428 Low Impact :248 Low Impact :297 MedHigh Imp:108 MedHigh Imp: 77 MedHigh Imp: 83 NA's : 25 NA's : 23 NA's : 23 It would be much easier, however to use apply as in: TEMP.FACT [,1:3]-apply(TEMP.FACT[,1:3],2,factor,labels=c(None,Low Impact,MedHigh Imp)) This appears to work (no error messages); however, this does not actually change the variables to factors. That is they are still treated as numbers: summary(TEMP.FACT[,1:3]) EXPOS01 EXPOS02 EXPOS03 Min. : 1.000 Min. : 1.000 Min. : 1.000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.000 Median : 2.000 Median : 1.000 Median : 2.000 Mean : 1.853 Mean : 1.531 Mean : 1.612 3rd Qu.: 2.000 3rd Qu.: 2.000 3rd Qu.: 2.000 Max. : 3.000 Max. : 3.000 Max. : 3.000 NA's :25.000 NA's :23.000 NA's :23.000 Any ideas on how I could efficiently change a lot of columns to factors? Thanks, PB [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] lars / lasso with glm
We have been using Least Angle Regression (lars) to help identify predictors in models where the outcome is continuous. To do so we have been relying on the lars package. Theoretically, it should be possible to use the lars procedure within a general linear model (glm) framework - we are particular interested in a logistic regression model. Does anyone have examples of using lars with logistic regression? Thanks, PB [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] read.spss in R 2.1.0 make basic dataframe
Recent changes to read.spss() in the foreign package return a dataframe containing additional attributes. For example, TEMP-read.spss(choose.files(), to.data.frame=T,use.value.labels=F) str(TEMP) `data.frame': 780 obs. of 8 variables: $ EXPOS01: atomic 1 1 2 1 2 3 2 4 2 1 ... ..- attr(*, value.labels)= Named num 5 4 3 2 1 .. ..- attr(*, names)= chr Yes, experienced it with Extreme Impact Yes, experienced it with Moderate Impact Yes, experienced it with A Little Impact Yes, experienced it with No Impact ... $ EXPOS02: atomic 1 1 1 1 1 1 1 1 1 1 ... ..- attr(*, value.labels)= Named num 5 4 3 2 1 .. ..- attr(*, names)= chr Yes, experienced it with Extreme Impact Yes, experienced it with Moderate Impact Yes, experienced it with A Little Impact Yes, experienced it with No Impact ... Unfortunately, these changes may be ahead of their time (certainly ahead of several functions). For instance edit balks at the changes: edit(TEMP) Error in edit.data.frame(TEMP) : can only handle vector and factor elements It used to be that the command as.data.frame or data.frame would return a fairly basic data.frame and fix the problem. However, this does not work (obviously because TEMP is already a data.frame). For example, TEMP-as.data.frame(TEMP) edit(TEMP) Error in edit.data.frame(TEMP) : can only handle vector and factor elements It is possible to use as.matrix, and then data.frame the result of as.matrix, but this gets a bit cumbersome. The question is: Is there a simple command to strip additional attribute characteristics from a data.frame and get a simple, easy to use, uncomplicated data.frame? On a related note, do other users routinely use read.spss with the defaults of to.data.frame=F or use.value.labels=T? My experience is that I am always using the non-default values in which case it would be helpful to change the defaults to to.data.frame=T and use.value.labels=F. It would also probably make sense to change the default for trim.factor.names=T. Interested in others' perspective. Appreciate all the great work Saikat DebRoy has done...just trying to improve an already useful function. Paul [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
RE: [R] Using R to illustrate the Central Limit Theorem
Interesting thread. The graphics are great, the only thing that might be worth doing for teaching purposes would be to illustrate the original distribution that is being averaged 1000 times. Below is one option based on Bill Venables code. Note that to do this I had to start with a k of 2. N - 1 for(k in 2:20) { graphics.off() par(mfrow = c(2,2), pty = s) hist(((runif(k))-0.5)*sqrt(12*k),main=Example Distribution 1) hist(((runif(k))-0.5)*sqrt(12*k),main=Example Distribution 2) m - replicate(N, (mean(runif(k))-0.5)*sqrt(12*k)) hist(m, breaks = FD, xlim = c(-4,4), main = k, prob = TRUE, ylim = c(0,0.5), col = lemonchiffon) pu - par(usr)[1:2] x - seq(pu[1], pu[2], len = 500) lines(x, dnorm(x), col = red) qqnorm(m, ylim = c(-4,4), xlim = c(-4,4), pch = ., col = blue) abline(0, 1, col = red) Sys.sleep(3) } By the way, I should probably know this but what is the logic of the sqrt(12*k) part of the example? Obviously as k increases the mean will approach .5 in a uniform distribution, so runif(k)-.5 will be close to zero, and sqrt(12*k) increases as k increases. Why 12, though? PB -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Kevin E. Thorpe Sent: Friday, May 13, 2005 12:03 AM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED]; r-help@stat.math.ethz.ch Subject: Re: [R] Using R to illustrate the Central Limit Theorem This thread was very timely for me since I will be teaching an introductory biostats course in the fall, including a session on CLT. I have shamelessly taken Dr. Venables' slick piece of code and wrapped it in a function so that I can vary the maximum sample size at will. I then created functions based on the first to sample from the exponential and the chi-squaredistributions. Lastly, I created a function to sample from a pdf with a parabolic shape (confirming in the process just how rusty my calculus is :-) ). My code is below for all to do with as they please. Now, at the risk of asking a really obvious question ... In my final function, I used the inverse probability integral transformation to sample from my parabolic distribution. I create a local function rparab shown here: rparab - function(nn) { u - 2*runif(nn) - 1 ifelse(u0,-(abs(u)^(1/3)),u^(1/3)) } It seems that in my version of R (2.0.1) on Linux, that calculating the cube root of a negative number using ^(1/3) returns NaN. I looked at the help in the arithmetic operators and did help.search(cube root), help.search(root) and help.search(cube) and recognised no alternatives. So I used an ifelse() to deal with the negatives. Have I missed something really elementary? Thanks clt.unif - function(n=20) { N - 1 graphics.off() par(mfrow = c(1,2), pty = s) for(k in 1:n) { m - (rowMeans(matrix(runif(N*k), N, k)) - 0.5)*sqrt(12*k) hist(m, breaks = FD, xlim = c(-4,4), main = paste(Uniform(0,1), n = ,k,sep=), prob = TRUE, ylim = c(0,0.5), col = lemonchiffon) pu - par(usr)[1:2] x - seq(pu[1], pu[2], len = 500) lines(x, dnorm(x), col = red) qqnorm(m, ylim = c(-4,4), xlim = c(-4,4), pch = ., col = blue) abline(0, 1, col = red) Sys.sleep(1) } } clt.exp - function(n=20,theta=1) { N - 1 graphics.off() par(mfrow = c(1,2), pty = s) for(k in 1:n) { m - (rowMeans(matrix(rexp(N*k,1/theta), N, k)) - theta)/sqrt(theta^2/k) hist(m, breaks = FD, xlim = c(-4,4), main = paste(exp(,theta,), n = ,k,sep=), prob = TRUE, ylim = c(0,0.5), col = lemonchiffon) pu - par(usr)[1:2] x - seq(pu[1], pu[2], len = 500) lines(x, dnorm(x), col = red) qqnorm(m, ylim = c(-4,4), xlim = c(-4,4), pch = ., col = blue) abline(0, 1, col = red) Sys.sleep(1) } } clt.chisq - function(n=20,nu=1) { N - 1 graphics.off() par(mfrow = c(1,2), pty = s) for(k in 1:n) { m - (rowMeans(matrix(rchisq(N*k,nu), N, k)) - nu)/sqrt(2*nu/k) hist(m, breaks = FD, xlim = c(-4,4), main = paste(Chi-Square(,nu,), n = ,k,sep=), prob = TRUE, ylim = c(0,0.5), col = lemonchiffon) pu - par(usr)[1:2] x - seq(pu[1], pu[2], len = 500) lines(x, dnorm(x), col = red) qqnorm(m, ylim = c(-4,4), xlim = c(-4,4), pch = ., col = blue) abline(0, 1, col = red) Sys.sleep(1) } } clt.parab - function(n=20) { rparab - function(nn) { u - 2*runif(nn) - 1 ifelse(u0,-(abs(u)^(1/3)),u^(1/3)) } N - 1 graphics.off() par(mfrow = c(1,2), pty = s) for(k in 1:n) { m - (rowMeans(matrix(rparab(N*k), N, k)))/sqrt(3/(5*k)) hist(m, breaks = FD, xlim = c(-4,4), main = paste(n = ,k,sep=), prob = TRUE, ylim = c(0,0.5), col = lemonchiffon) pu - par(usr)[1:2] x - seq(pu[1], pu[2], len = 500) lines(x, dnorm(x), col = red) qqnorm(m, ylim = c(-4,4), xlim = c(-4,4), pch = ., col = blue) abline(0, 1, col = red) Sys.sleep(1) } } [EMAIL PROTECTED] wrote: Here's a bit of a refinement on Ted's first suggestion. N - 1 graphics.off() par(mfrow = c(1,2), pty = s) for(k in 1:20) { m - (rowMeans(matrix(runif(M*k), N, k)) - 0.5)*sqrt(12*k) hist(m, breaks = FD, xlim = c(-4,4), main = k, prob = TRUE, ylim =
[R] mcnemar.test odds ratios, CI, etc.
Does anyone know of another version of the Mcnemar test that provides: 1. Odds Ratios 2. 95% Confidence intervals of the Odds Ratios 3. Sample probability 4. 95% Confidence intervals of the sample probability Obviously the Odds Ratios and Sample probabilities are easy to calculate from the contingency table, but I would appreciate any help on how to calculate the confidence intervals. Below is a simple example of the test, and the corresponding output with the function mcnemar.test. xtabs(~PLC50.T1+PLC50.T2,data=LANCET.DAT) PLC50.T2 PLC50.T1 0 1 0 464 22 1 6 1 mcnemar.test(xtabs(~PLC50.T1+PLC50.T2,data=LANCET.DAT)) McNemar's Chi-squared test with continuity correction data: xtabs(~PLC50.T1 + PLC50.T2, data = LANCET.DAT) McNemar's chi-squared = 8.0357, df = 1, p-value = 0.004586 Paul [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html