Re: [R] Correlation question
As Kehl pointed out, any linear function of the independent variable (speed) will have the same squared correlation with the dependent variable (dist), but only one linear function minimizes the squared deviations between the fitted values and the original values. The equation you are using is only applicable to that function, not to any of the others. In fact, some linear functions will produce negative values: fitted.new - 6*cars$speed cor(cbind(fitted.new, fitted.right, fitted.wrong, cars$dist)) fitted.new fitted.right fitted.wrong fitted.new1.0001.0001.000 0.8068949 fitted.right 1.0001.0001.000 0.8068949 fitted.wrong 1.0001.0001.000 0.8068949 0.80689490.80689490.8068949 1.000 1-sum((cars$dist-fitted.new)^2)/sum((cars$dist-mean(cars$dist))^2) [1] -3.281849 David L. Carlson Department of Anthropology Texas AM University -Original Message- From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Jonathan Thayn Sent: Sunday, February 22, 2015 12:01 AM To: Kehl Dániel Cc: r-help@r-project.org Subject: Re: [R] Correlation question Of course! Thank you, I knew I was missing something painfully obvious. Its seems, then, that this line 1-sum((cars$dist-fitted.wrong)^2)/sum((cars$dist-mean(cars$dist))^2) is finding something other than the traditional correlation. I found this in a lecture introducing correlation, but , now, I'm not sure what it is. It does do a better job of showing that the fitted.wrong variable is not a good prediction of the distance. On Feb 21, 2015, at 4:36 PM, Kehl Dániel wrote: Hi, try cor(fitted.right,fitted.wrong) should give 1 as both are a linear function of speed! Hence cor(cars$dist,fitted.right)^2 and cor(x=cars$dist,y=fitted.wrong)^2 must be the same. HTH d Feladó: R-help [r-help-boun...@r-project.org] ; meghatalmaz#243;: Jonathan Thayn [jth...@ilstu.edu] Küldve: 2015. február 21. 22:42 To: r-help@r-project.org Tárgy: [R] Correlation question I recently compared two different approaches to calculating the correlation of two variables, and I cannot explain the different results: data(cars) model - lm(dist~speed,data=cars) coef(model) fitted.right - model$fitted fitted.wrong - -17+5*cars$speed When using the OLS fitted values, the lines below all return the same R2 value: 1-sum((cars$dist-fitted.right)^2)/sum((cars$dist-mean(cars$dist))^2) cor(cars$dist,fitted.right)^2 (sum((cars$dist-mean(cars$dist))*(fitted.right-mean(fitted.right)))/(49*sd(cars$dist)*sd(fitted.right)))^2 However, when I use my estimated parameters to find the fitted values, fitted.wrong, the first equation returns a much lower R2 value, which I would expect since the fit is worse, but the other lines return the same R2 that I get when using the OLS fitted values. 1-sum((cars$dist-fitted.wrong)^2)/sum((cars$dist-mean(cars$dist))^2) cor(x=cars$dist,y=fitted.wrong)^2 (sum((cars$dist-mean(cars$dist))*(fitted.wrong-mean(fitted.wrong)))/(49*sd(cars$dist)*sd(fitted.wrong)))^2 I'm sure I'm missing something simple, but can someone explain the difference between these two methods of finding R2? Thanks. Jon [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Correlation question
I recently compared two different approaches to calculating the correlation of two variables, and I cannot explain the different results: data(cars) model - lm(dist~speed,data=cars) coef(model) fitted.right - model$fitted fitted.wrong - -17+5*cars$speed When using the OLS fitted values, the lines below all return the same R2 value: 1-sum((cars$dist-fitted.right)^2)/sum((cars$dist-mean(cars$dist))^2) cor(cars$dist,fitted.right)^2 (sum((cars$dist-mean(cars$dist))*(fitted.right-mean(fitted.right)))/(49*sd(cars$dist)*sd(fitted.right)))^2 However, when I use my estimated parameters to find the fitted values, fitted.wrong, the first equation returns a much lower R2 value, which I would expect since the fit is worse, but the other lines return the same R2 that I get when using the OLS fitted values. 1-sum((cars$dist-fitted.wrong)^2)/sum((cars$dist-mean(cars$dist))^2) cor(x=cars$dist,y=fitted.wrong)^2 (sum((cars$dist-mean(cars$dist))*(fitted.wrong-mean(fitted.wrong)))/(49*sd(cars$dist)*sd(fitted.wrong)))^2 I'm sure I'm missing something simple, but can someone explain the difference between these two methods of finding R2? Thanks. Jon [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Correlation question
Hi, try cor(fitted.right,fitted.wrong) should give 1 as both are a linear function of speed! Hence cor(cars$dist,fitted.right)^2 and cor(x=cars$dist,y=fitted.wrong)^2 must be the same. HTH d Feladó: R-help [r-help-boun...@r-project.org] ; meghatalmaz#243;: Jonathan Thayn [jth...@ilstu.edu] Küldve: 2015. február 21. 22:42 To: r-help@r-project.org Tárgy: [R] Correlation question I recently compared two different approaches to calculating the correlation of two variables, and I cannot explain the different results: data(cars) model - lm(dist~speed,data=cars) coef(model) fitted.right - model$fitted fitted.wrong - -17+5*cars$speed When using the OLS fitted values, the lines below all return the same R2 value: 1-sum((cars$dist-fitted.right)^2)/sum((cars$dist-mean(cars$dist))^2) cor(cars$dist,fitted.right)^2 (sum((cars$dist-mean(cars$dist))*(fitted.right-mean(fitted.right)))/(49*sd(cars$dist)*sd(fitted.right)))^2 However, when I use my estimated parameters to find the fitted values, fitted.wrong, the first equation returns a much lower R2 value, which I would expect since the fit is worse, but the other lines return the same R2 that I get when using the OLS fitted values. 1-sum((cars$dist-fitted.wrong)^2)/sum((cars$dist-mean(cars$dist))^2) cor(x=cars$dist,y=fitted.wrong)^2 (sum((cars$dist-mean(cars$dist))*(fitted.wrong-mean(fitted.wrong)))/(49*sd(cars$dist)*sd(fitted.wrong)))^2 I'm sure I'm missing something simple, but can someone explain the difference between these two methods of finding R2? Thanks. Jon [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Correlation question
Of course! Thank you, I knew I was missing something painfully obvious. Its seems, then, that this line 1-sum((cars$dist-fitted.wrong)^2)/sum((cars$dist-mean(cars$dist))^2) is finding something other than the traditional correlation. I found this in a lecture introducing correlation, but , now, I'm not sure what it is. It does do a better job of showing that the fitted.wrong variable is not a good prediction of the distance. On Feb 21, 2015, at 4:36 PM, Kehl Dániel wrote: Hi, try cor(fitted.right,fitted.wrong) should give 1 as both are a linear function of speed! Hence cor(cars$dist,fitted.right)^2 and cor(x=cars$dist,y=fitted.wrong)^2 must be the same. HTH d Feladó: R-help [r-help-boun...@r-project.org] ; meghatalmaz#243;: Jonathan Thayn [jth...@ilstu.edu] Küldve: 2015. február 21. 22:42 To: r-help@r-project.org Tárgy: [R] Correlation question I recently compared two different approaches to calculating the correlation of two variables, and I cannot explain the different results: data(cars) model - lm(dist~speed,data=cars) coef(model) fitted.right - model$fitted fitted.wrong - -17+5*cars$speed When using the OLS fitted values, the lines below all return the same R2 value: 1-sum((cars$dist-fitted.right)^2)/sum((cars$dist-mean(cars$dist))^2) cor(cars$dist,fitted.right)^2 (sum((cars$dist-mean(cars$dist))*(fitted.right-mean(fitted.right)))/(49*sd(cars$dist)*sd(fitted.right)))^2 However, when I use my estimated parameters to find the fitted values, fitted.wrong, the first equation returns a much lower R2 value, which I would expect since the fit is worse, but the other lines return the same R2 that I get when using the OLS fitted values. 1-sum((cars$dist-fitted.wrong)^2)/sum((cars$dist-mean(cars$dist))^2) cor(x=cars$dist,y=fitted.wrong)^2 (sum((cars$dist-mean(cars$dist))*(fitted.wrong-mean(fitted.wrong)))/(49*sd(cars$dist)*sd(fitted.wrong)))^2 I'm sure I'm missing something simple, but can someone explain the difference between these two methods of finding R2? Thanks. Jon [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Correlation question
Did you try taking out P7, which is text? Moreover, if you get a message saying ' the standard deviation is zero', it means that the entire column is constant. By definition, the covariance of a constant with a random variable is 0, but your data consists of values, so cor() understandably throws a warning that one or more of your columns are constant. Applying the following to your data (which I named expd instead), we get sapply(expd[, -12], var) P1 P2 P3 P4 P5 P6 5.43e-01 1.08e+00 5.77e-01 1.08e+00 6.43e-01 5.57e-01 P8 P9 P10 P11 P12 SITE 5.73e-01 3.19e+00 5.07e-01 2.50e-01 5.50e+00 2.49e+00 Errors warnings ManualTotalH_tot HP1.1 9.072840e+03 2.081334e+04 7.43e-01 3.823500e+04 3.880250e+03 2.676667e+00 HP1.2HP1.3HP1.4 HP_totHO1.1 HO1.2 0.00e+00 2.008440e+03 3.057067e+02 3.827250e+03 8.40e-01 0.00e+00 HO1.3HO1.4 HO_totHU1.1HU1.2 HU1.3 0.00e+00 0.00e+00 8.40e-01 0.00e+00 2.10e-01 2.27e-01 HU_tot HRL_totLP1.1LP1.2 LP1.3 6.23e-01 7.43e-01 3.754610e+03 3.209333e+01 0.00e+00 2.065010e+03 LP1.4 LP_totLO1.1LO1.2LO1.3 LO1.4 2.246233e+02 3.590040e+03 3.684000e+01 0.00e+00 0.00e+00 2.84e+00 LO_totLU1.1LU1.2LU1.3 LU_tot LR_tot 6.00e+01 0.00e+00 1.44e+00 3.626667e+00 8.37e+00 4.94e+00 SP_totSP1.1SP1.2SP1.3SP1.4 SP_tot.1 6.911067e+02 4.225000e+01 0.00e+00 1.009600e+02 4.161600e+02 3.071600e+02 SO1.1SO1.2SO1.3SO1.4 SO_tot SU1.1 4.54e+00 2.50e-01 0.00e+00 2.10e-01 5.25e+00 0.00e+00 SU1.2SU1.3 SU_tot SR 1.556667e+00 4.225000e+01 3.504000e+01 4.225000e+01 Which columns are constant? which(sapply(expd[, -12], var) .Machine$double.eps) HP1.2 HO1.2 HO1.3 HO1.4 HU1.1 LP1.2 LO1.2 LO1.3 LU1.1 SP1.2 SO1.3 SU1.1 192425262835404144515760 I suspect that in your real data set, there aren't so many constant columns, but this is one way to check. HTH, Dennis On Wed, Sep 8, 2010 at 12:35 PM, Stephane Vaucher vauch...@iro.umontreal.ca wrote: Hi everyone, I'm observing what I believe is weird behaviour when attempting to do something very simple. I want a correlation matrix, but my matrix seems to contain correlation values that are not found when executed on pairs: test2$P2 [1] 2 2 4 4 1 3 2 4 3 3 2 3 4 1 2 2 4 3 4 1 2 3 2 1 3 test2$HP_tot [1] 10 10 10 10 10 10 10 10 136 136 136 136 136 136 136 136 136 136 15 [20] 15 15 15 15 15 15 c=cor(test2$P3,test2$HP_tot,method='spearman') c [1] -0.2182876 c=cor(test2,method='spearman') Warning message: In cor(test2, method = spearman) : the standard deviation is zero write(c,file='out.csv') from my spreadsheet -0.25028783918741 Most cells are correct, but not that one. If this is expected behaviour, I apologise for bothering you, I read the documentation, but I do not know if the calculation of matrices and pairs is done using the same function (eg, with respect to equal value observations). If this is not a desired behaviour, I noticed that it only occurs with a relatively large matrix (I couldn't reproduce on a simple 2 column data set). There might be a naming error. names(test2) [1] ID NOMBRE MAIL [4] Age SEXO Studies [7] Hours_Internet Vision.Disabilities Other.disabilities [10] Technology_Knowledge Start_Time End_Time [13] Duration P1 P1Book [16] P1DVDP2 P3 [19] P4 P5 P6 [22] P8 P9 P10 [25] P11 P12 P7 [28] SITE Errors warnings [31] Manual TotalH_tot [34] HP1.1HP1.2HP1.3 [37] HP1.4HP_tot HO1.1 [40] HO1.2HO1.3HO1.4 [43] HO_tot HU1.1HU1.2 [46] HU1.3HU_tot HR [49] L_totLP1.1LP1.2 [52] LP1.3LP1.4LP_tot [55] LO1.1LO1.2LO1.3 [58] LO1.4LO_tot LU1.1 [61] LU1.2LU1.3LU_tot [64] LR_tot SP_tot SP1.1 [67] SP1.2SP1.3SP1.4 [70] SP_tot.1 SO1.1SO1.2 [73] SO1.3SO1.4
Re: [R] Correlation question
Thank you Dennis, You identified a factor (text column) that I was concerned with. I simplified my example to try and factor out possible causes. I eliminated the recurring values in columns (which were not the columns that caused problems). I produced three examples with simple data sets. 1. Correct output, 2 columns only: test.notext = read.csv('test-notext.csv') cor(test.notext, method='spearman') P3 HP_tot P3 1.000 -0.2182876 HP_tot -0.2182876 1.000 dput(test.notext) structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L, 2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L), HP_tot = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 15L, 15L, 15L, 15L, 15L, 15L, 15L)), .Names = c(P3, HP_tot ), class = data.frame, row.names = c(NA, -25L)) 2. Incorrect output where I introduced my P7 column containing text only the 'a' character: test = read.csv('test.csv') cor(test, method='spearman') P3 P7 HP_tot P3 1.000 NA -0.2502878 P7 NA 1 NA HP_tot -0.2502878 NA 1.000 Warning message: In cor(test, method = spearman) : the standard deviation is zero dput(test) structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L, 2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L), P7 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L ), .Label = a, class = factor), HP_tot = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 15L, 15L, 15L, 15L, 15L, 15L, 15L)), .Names = c(P3, P7, HP_tot), class = data.frame, row.names = c(NA, -25L)) 3. Incorrect output with P7 containing a variety of alpha-numeric characters (ascii), to factor out equal valued column issue. Notice that the text column is interpreted as a numeric value. test.number = read.csv('test-alpha.csv') cor(test.number, method='spearman') P3 P7 HP_tot P3 1.000 0.4093108 -0.2502878 P7 0.4093108 1.000 -0.3807193 HP_tot -0.2502878 -0.3807193 1.000 dput(test.number) structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L, 2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L), P7 = structure(c(11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L), .Label = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o), class = factor), HP_tot = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 15L, 15L, 15L, 15L, 15L, 15L, 15L)), .Names = c(P3, P7, HP_tot), class = data.frame, row.names = c(NA, -25L)) Correct output is obtained by avoiding matrix computation of correlation: cor(test.number$P3, test.number$HP_tot, method='spearman') [1] -0.2182876 It seems that a text column corrupts my correlation calculation (only in a matrix calculation). I assumed that text columns would not influence the result of the calculations. Is this a correct behaviour? If not,I can submit a bug report? If it is, is there a known workaround? cheers, Stephane Vaucher On Thu, 9 Sep 2010, Dennis Murphy wrote: Did you try taking out P7, which is text? Moreover, if you get a message saying ' the standard deviation is zero', it means that the entire column is constant. By definition, the covariance of a constant with a random variable is 0, but your data consists of values, so cor() understandably throws a warning that one or more of your columns are constant. Applying the following to your data (which I named expd instead), we get sapply(expd[, -12], var) P1 P2 P3 P4 P5 P6 5.43e-01 1.08e+00 5.77e-01 1.08e+00 6.43e-01 5.57e-01 P8 P9 P10 P11 P12 SITE 5.73e-01 3.19e+00 5.07e-01 2.50e-01 5.50e+00 2.49e+00 Errors warnings ManualTotalH_tot HP1.1 9.072840e+03 2.081334e+04 7.43e-01 3.823500e+04 3.880250e+03 2.676667e+00 HP1.2HP1.3HP1.4 HP_totHO1.1 HO1.2 0.00e+00 2.008440e+03 3.057067e+02 3.827250e+03 8.40e-01 0.00e+00 HO1.3HO1.4 HO_totHU1.1HU1.2 HU1.3 0.00e+00 0.00e+00 8.40e-01 0.00e+00 2.10e-01 2.27e-01 HU_tot HRL_totLP1.1LP1.2 LP1.3 6.23e-01 7.43e-01 3.754610e+03 3.209333e+01 0.00e+00 2.065010e+03 LP1.4 LP_totLO1.1LO1.2LO1.3 LO1.4 2.246233e+02 3.590040e+03 3.684000e+01 0.00e+00 0.00e+00 2.84e+00 LO_totLU1.1LU1.2LU1.3 LU_tot LR_tot 6.00e+01 0.00e+00 1.44e+00 3.626667e+00 8.37e+00 4.94e+00 SP_tot
Re: [R] Correlation question
Hi Stephane, When I use your sample data (e.g., test, test.number), cor() throws an error that x must be numeric (because of the factor or character data). Are you not getting any errors when trying to calculate the correlation on these data? If you are not, I wonder what version of R are you using? The quickest way to find out is sessionInfo(). As far as a work around, it would be relative simple to find out which columns of your data frame were not numeric or integer and exclude those (I'm happy to provide that code if you want). Best regards, Josh On Thu, Sep 9, 2010 at 7:50 AM, Stephane Vaucher vauch...@iro.umontreal.ca wrote: Thank you Dennis, You identified a factor (text column) that I was concerned with. I simplified my example to try and factor out possible causes. I eliminated the recurring values in columns (which were not the columns that caused problems). I produced three examples with simple data sets. 1. Correct output, 2 columns only: test.notext = read.csv('test-notext.csv') cor(test.notext, method='spearman') P3 HP_tot P3 1.000 -0.2182876 HP_tot -0.2182876 1.000 dput(test.notext) structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L, 2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L), HP_tot = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 15L, 15L, 15L, 15L, 15L, 15L, 15L)), .Names = c(P3, HP_tot ), class = data.frame, row.names = c(NA, -25L)) 2. Incorrect output where I introduced my P7 column containing text only the 'a' character: test = read.csv('test.csv') cor(test, method='spearman') P3 P7 HP_tot P3 1.000 NA -0.2502878 P7 NA 1 NA HP_tot -0.2502878 NA 1.000 Warning message: In cor(test, method = spearman) : the standard deviation is zero dput(test) structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L, 2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L), P7 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L ), .Label = a, class = factor), HP_tot = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 15L, 15L, 15L, 15L, 15L, 15L, 15L)), .Names = c(P3, P7, HP_tot), class = data.frame, row.names = c(NA, -25L)) 3. Incorrect output with P7 containing a variety of alpha-numeric characters (ascii), to factor out equal valued column issue. Notice that the text column is interpreted as a numeric value. test.number = read.csv('test-alpha.csv') cor(test.number, method='spearman') P3 P7 HP_tot P3 1.000 0.4093108 -0.2502878 P7 0.4093108 1.000 -0.3807193 HP_tot -0.2502878 -0.3807193 1.000 dput(test.number) structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L, 2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L), P7 = structure(c(11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L), .Label = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o), class = factor), HP_tot = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 15L, 15L, 15L, 15L, 15L, 15L, 15L)), .Names = c(P3, P7, HP_tot), class = data.frame, row.names = c(NA, -25L)) Correct output is obtained by avoiding matrix computation of correlation: cor(test.number$P3, test.number$HP_tot, method='spearman') [1] -0.2182876 It seems that a text column corrupts my correlation calculation (only in a matrix calculation). I assumed that text columns would not influence the result of the calculations. Is this a correct behaviour? If not,I can submit a bug report? If it is, is there a known workaround? cheers, Stephane Vaucher On Thu, 9 Sep 2010, Dennis Murphy wrote: Did you try taking out P7, which is text? Moreover, if you get a message saying ' the standard deviation is zero', it means that the entire column is constant. By definition, the covariance of a constant with a random variable is 0, but your data consists of values, so cor() understandably throws a warning that one or more of your columns are constant. Applying the following to your data (which I named expd instead), we get sapply(expd[, -12], var) P1 P2 P3 P4 P5 P6 5.43e-01 1.08e+00 5.77e-01 1.08e+00 6.43e-01 5.57e-01 P8 P9 P10 P11 P12 SITE 5.73e-01 3.19e+00 5.07e-01 2.50e-01 5.50e+00 2.49e+00 Errors warnings Manual Total H_tot HP1.1 9.072840e+03 2.081334e+04 7.43e-01 3.823500e+04 3.880250e+03 2.676667e+00 HP1.2 HP1.3 HP1.4 HP_tot HO1.1
Re: [R] Correlation question
Hi Josh, Initially, I was expecting R to simply ignore non-numeric data. I guess I was wrong... I copy-pasted what I observe, and I do not get an error when calculating correlations with text data. I can also do cor(test.n$P3, test$P7) without an error. If you have a function to select only numeric columns that you can share with me (and the list), that would be great. Of course, I'm wondering why your version of R produces different results from mine. I don't know if I should open a bug report. It would be good if someone (other than me) observed this problem in their environment. Here is what I am currently using: R version 2.10.1 (2009-12-14) x86_64-pc-linux-gnu locale: [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_CA.UTF-8LC_COLLATE=en_CA.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_CA.UTF-8 [7] LC_PAPER=en_CA.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base The behaviour has been observed on: sessionInfo() Version 2.3.1 (2006-06-01) x86_64-redhat-linux-gnu attached base packages: [1] methods stats graphics grDevices utils datasets [7] base As well as on a 32 bit linux arch v2.9.0. Sincere regards, sv On Thu, 9 Sep 2010, Joshua Wiley wrote: Hi Stephane, When I use your sample data (e.g., test, test.number), cor() throws an error that x must be numeric (because of the factor or character data). Are you not getting any errors when trying to calculate the correlation on these data? If you are not, I wonder what version of R are you using? The quickest way to find out is sessionInfo(). As far as a work around, it would be relative simple to find out which columns of your data frame were not numeric or integer and exclude those (I'm happy to provide that code if you want). Best regards, Josh On Thu, Sep 9, 2010 at 7:50 AM, Stephane Vaucher vauch...@iro.umontreal.ca wrote: Thank you Dennis, You identified a factor (text column) that I was concerned with. I simplified my example to try and factor out possible causes. I eliminated the recurring values in columns (which were not the columns that caused problems). I produced three examples with simple data sets. 1. Correct output, 2 columns only: test.notext = read.csv('test-notext.csv') cor(test.notext, method='spearman') P3 HP_tot P3 1.000 -0.2182876 HP_tot -0.2182876 1.000 dput(test.notext) structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L, 2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L), HP_tot = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 15L, 15L, 15L, 15L, 15L, 15L, 15L)), .Names = c(P3, HP_tot ), class = data.frame, row.names = c(NA, -25L)) 2. Incorrect output where I introduced my P7 column containing text only the 'a' character: test = read.csv('test.csv') cor(test, method='spearman') P3 P7 HP_tot P3 1.000 NA -0.2502878 P7 NA 1 NA HP_tot -0.2502878 NA 1.000 Warning message: In cor(test, method = spearman) : the standard deviation is zero dput(test) structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L, 2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L), P7 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L ), .Label = a, class = factor), HP_tot = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 15L, 15L, 15L, 15L, 15L, 15L, 15L)), .Names = c(P3, P7, HP_tot), class = data.frame, row.names = c(NA, -25L)) 3. Incorrect output with P7 containing a variety of alpha-numeric characters (ascii), to factor out equal valued column issue. Notice that the text column is interpreted as a numeric value. test.number = read.csv('test-alpha.csv') cor(test.number, method='spearman') P3 P7 HP_tot P3 1.000 0.4093108 -0.2502878 P7 0.4093108 1.000 -0.3807193 HP_tot -0.2502878 -0.3807193 1.000 dput(test.number) structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L, 2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L), P7 = structure(c(11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L), .Label = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o), class = factor), HP_tot = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 15L, 15L, 15L, 15L, 15L, 15L, 15L)), .Names = c(P3, P7, HP_tot), class = data.frame, row.names = c(NA, -25L)) Correct output is obtained by avoiding matrix computation of correlation: cor(test.number$P3, test.number$HP_tot, method='spearman')
Re: [R] Correlation question
Hi Stephane, According to the NEWS file, as of 2.11.0: cor() and cov() now test for misuse with non-numeric arguments, such as the non-bug report PR#14207 so there is no need for a new bug report. Here is a simple way to select only numeric columns: # Sample data dat - data.frame(a = 1:10L, b = runif(10), c = paste(1:10), d = rep(TRUE, 10), e = factor(rep(a, 10)), stringsAsFactors = FALSE) # (this includes numeric and integer, btw) dat[, sapply(dat, is.numeric)] # if you wanted to include logicals (which cor() will work with) class.test - function(x) { output - FALSE if(is.numeric(x) | is.logical(x)) { output - TRUE} return(output) } # Columns that are numeric or logical dat[, sapply(dat, class.test)] HTH, Josh On Thu, Sep 9, 2010 at 10:53 AM, Stephane Vaucher vauch...@iro.umontreal.ca wrote: Hi Josh, Initially, I was expecting R to simply ignore non-numeric data. I guess I was wrong... I copy-pasted what I observe, and I do not get an error when calculating correlations with text data. I can also do cor(test.n$P3, test$P7) without an error. If you have a function to select only numeric columns that you can share with me (and the list), that would be great. Of course, I'm wondering why your version of R produces different results from mine. I don't know if I should open a bug report. It would be good if someone (other than me) observed this problem in their environment. Here is what I am currently using: R version 2.10.1 (2009-12-14) x86_64-pc-linux-gnu locale: [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_CA.UTF-8 [7] LC_PAPER=en_CA.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base The behaviour has been observed on: sessionInfo() Version 2.3.1 (2006-06-01) x86_64-redhat-linux-gnu attached base packages: [1] methods stats graphics grDevices utils datasets [7] base As well as on a 32 bit linux arch v2.9.0. Sincere regards, sv On Thu, 9 Sep 2010, Joshua Wiley wrote: Hi Stephane, When I use your sample data (e.g., test, test.number), cor() throws an error that x must be numeric (because of the factor or character data). Are you not getting any errors when trying to calculate the correlation on these data? If you are not, I wonder what version of R are you using? The quickest way to find out is sessionInfo(). As far as a work around, it would be relative simple to find out which columns of your data frame were not numeric or integer and exclude those (I'm happy to provide that code if you want). Best regards, Josh On Thu, Sep 9, 2010 at 7:50 AM, Stephane Vaucher vauch...@iro.umontreal.ca wrote: Thank you Dennis, You identified a factor (text column) that I was concerned with. I simplified my example to try and factor out possible causes. I eliminated the recurring values in columns (which were not the columns that caused problems). I produced three examples with simple data sets. 1. Correct output, 2 columns only: test.notext = read.csv('test-notext.csv') cor(test.notext, method='spearman') P3 HP_tot P3 1.000 -0.2182876 HP_tot -0.2182876 1.000 dput(test.notext) structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L, 2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L), HP_tot = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 15L, 15L, 15L, 15L, 15L, 15L, 15L)), .Names = c(P3, HP_tot ), class = data.frame, row.names = c(NA, -25L)) 2. Incorrect output where I introduced my P7 column containing text only the 'a' character: test = read.csv('test.csv') cor(test, method='spearman') P3 P7 HP_tot P3 1.000 NA -0.2502878 P7 NA 1 NA HP_tot -0.2502878 NA 1.000 Warning message: In cor(test, method = spearman) : the standard deviation is zero dput(test) structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L, 2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L), P7 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L ), .Label = a, class = factor), HP_tot = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 15L, 15L, 15L, 15L, 15L, 15L, 15L)), .Names = c(P3, P7, HP_tot), class = data.frame, row.names = c(NA, -25L)) 3. Incorrect output with P7 containing a variety of alpha-numeric characters (ascii), to factor out equal valued column issue. Notice that the text column is interpreted as a numeric value. test.number = read.csv('test-alpha.csv') cor(test.number,
Re: [R] Correlation question
On 2010-09-09 11:53, Stephane Vaucher wrote: Hi Josh, Initially, I was expecting R to simply ignore non-numeric data. I guess I was wrong... I copy-pasted what I observe, and I do not get an error when The first thing to do when you get results that you don't expect is to check the help page. The page for cor clearly states that its input is to a *numeric* vector, matrix or data frame (my emphasis). I would not be happy if R simply ignored non-numeric data. After all, it's trivial to ensure that you feed only numeric data to cor(). Having said that, I guess others have found cor() problematic when non-valid input is supplied and so R now (as of 2.11.0) issues an error message that 'x' must be numeric. You should always check the latest released version to see if changes have been made. The NEWS file for 2.11.0 contains this: cor() and cov() now test for misuse with non-numeric arguments, such as the non-bug report PR#14207. calculating correlations with text data. I can also do cor(test.n$P3, test$P7) without an error. If you have a function to select only numeric columns that you can share with me (and the list), that would be great. Of course, I'm wondering why your version of R produces different results from mine. I don't know if I should open a bug report. It would be good if someone You're doing the right thing by asking here first before reporting. It would definitely not be a good idea to report a (non-)bug in an outdated version of R. -Peter Ehlers [rest snipped; not relevant to my comments.] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Correlation question
Hi everyone, Thanks for the help. On Thu, 9 Sep 2010, Peter Ehlers wrote: The first thing to do when you get results that you don't expect is to check the help page. The page for cor clearly states that its input is to a *numeric* vector, matrix or data frame (my emphasis). I would not be happy if R simply ignored non-numeric data. After all, it's trivial to ensure that you feed only numeric data to cor(). Indeed, the documentation states that it takes a numeric input. It does not state how it would react to an inappropriate input type. That's why I expected either to produce an error message or accurate results. I did not expect an incorrect result. I should not have assume that my expectations would be correct. Having said that, I guess others have found cor() problematic when non-valid input is supplied and so R now (as of 2.11.0) issues an error message that 'x' must be numeric. You should always check the latest released version to see if changes have been made. The NEWS file for 2.11.0 contains this: cor() and cov() now test for misuse with non-numeric arguments, such as the non-bug report PR#14207. You're doing the right thing by asking here first before reporting. It would definitely not be a good idea to report a (non-)bug in an outdated version of R. Since my manipulations were simple, I assumed that others would have observed the same behaviour. In any case, I'm happy that the function checks the respect of the preconditions preconditions. Otherwise, it would have been good to add to the documentation and state that when there are non-numeric data, cor() can compute garbage. cheers, Stephane -Peter Ehlers [rest snipped; not relevant to my comments.] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Correlation question
Hi everyone, I'm observing what I believe is weird behaviour when attempting to do something very simple. I want a correlation matrix, but my matrix seems to contain correlation values that are not found when executed on pairs: test2$P2 [1] 2 2 4 4 1 3 2 4 3 3 2 3 4 1 2 2 4 3 4 1 2 3 2 1 3 test2$HP_tot [1] 10 10 10 10 10 10 10 10 136 136 136 136 136 136 136 136 136 136 15 [20] 15 15 15 15 15 15 c=cor(test2$P3,test2$HP_tot,method='spearman') c [1] -0.2182876 c=cor(test2,method='spearman') Warning message: In cor(test2, method = spearman) : the standard deviation is zero write(c,file='out.csv') from my spreadsheet -0.25028783918741 Most cells are correct, but not that one. If this is expected behaviour, I apologise for bothering you, I read the documentation, but I do not know if the calculation of matrices and pairs is done using the same function (eg, with respect to equal value observations). If this is not a desired behaviour, I noticed that it only occurs with a relatively large matrix (I couldn't reproduce on a simple 2 column data set). There might be a naming error. names(test2) [1] ID NOMBRE MAIL [4] Age SEXO Studies [7] Hours_Internet Vision.Disabilities Other.disabilities [10] Technology_Knowledge Start_Time End_Time [13] Duration P1 P1Book [16] P1DVDP2 P3 [19] P4 P5 P6 [22] P8 P9 P10 [25] P11 P12 P7 [28] SITE Errors warnings [31] Manual TotalH_tot [34] HP1.1HP1.2HP1.3 [37] HP1.4HP_tot HO1.1 [40] HO1.2HO1.3HO1.4 [43] HO_tot HU1.1HU1.2 [46] HU1.3HU_tot HR [49] L_totLP1.1LP1.2 [52] LP1.3LP1.4LP_tot [55] LO1.1LO1.2LO1.3 [58] LO1.4LO_tot LU1.1 [61] LU1.2LU1.3LU_tot [64] LR_tot SP_tot SP1.1 [67] SP1.2SP1.3SP1.4 [70] SP_tot.1 SO1.1SO1.2 [73] SO1.3SO1.4SO_tot [76] SU1.1SU1.2SU1.3 [79] SU_tot SR Thank you in advance, Stephane Vaucher __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Correlation question
Hi, Does your data have missing values? I am not sure it would change anything, but perhaps try adding: cor(test2, method = spearman, use = pairwise.complete.obs) or something of the like. I am not sure what R does by default. My reasoning stems from this particular passage in the documentation: If ‘use’ is ‘everything’, ‘NA’s will propagate conceptually, i.e., a resulting value will be ‘NA’ whenever one of its contributing observations is ‘NA’. I do not think the names should make a difference (unless you're talking about human error). Best regards, Josh On Wed, Sep 8, 2010 at 12:35 PM, Stephane Vaucher vauch...@iro.umontreal.ca wrote: Hi everyone, I'm observing what I believe is weird behaviour when attempting to do something very simple. I want a correlation matrix, but my matrix seems to contain correlation values that are not found when executed on pairs: test2$P2 [1] 2 2 4 4 1 3 2 4 3 3 2 3 4 1 2 2 4 3 4 1 2 3 2 1 3 test2$HP_tot [1] 10 10 10 10 10 10 10 10 136 136 136 136 136 136 136 136 136 136 15 [20] 15 15 15 15 15 15 c=cor(test2$P3,test2$HP_tot,method='spearman') c [1] -0.2182876 c=cor(test2,method='spearman') Warning message: In cor(test2, method = spearman) : the standard deviation is zero write(c,file='out.csv') from my spreadsheet -0.25028783918741 Most cells are correct, but not that one. If this is expected behaviour, I apologise for bothering you, I read the documentation, but I do not know if the calculation of matrices and pairs is done using the same function (eg, with respect to equal value observations). If this is not a desired behaviour, I noticed that it only occurs with a relatively large matrix (I couldn't reproduce on a simple 2 column data set). There might be a naming error. names(test2) [1] ID NOMBRE MAIL [4] Age SEXO Studies [7] Hours_Internet Vision.Disabilities Other.disabilities [10] Technology_Knowledge Start_Time End_Time [13] Duration P1 P1Book [16] P1DVD P2 P3 [19] P4 P5 P6 [22] P8 P9 P10 [25] P11 P12 P7 [28] SITE Errors warnings [31] Manual Total H_tot [34] HP1.1 HP1.2 HP1.3 [37] HP1.4 HP_tot HO1.1 [40] HO1.2 HO1.3 HO1.4 [43] HO_tot HU1.1 HU1.2 [46] HU1.3 HU_tot HR [49] L_tot LP1.1 LP1.2 [52] LP1.3 LP1.4 LP_tot [55] LO1.1 LO1.2 LO1.3 [58] LO1.4 LO_tot LU1.1 [61] LU1.2 LU1.3 LU_tot [64] LR_tot SP_tot SP1.1 [67] SP1.2 SP1.3 SP1.4 [70] SP_tot.1 SO1.1 SO1.2 [73] SO1.3 SO1.4 SO_tot [76] SU1.1 SU1.2 SU1.3 [79] SU_tot SR Thank you in advance, Stephane Vaucher __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Joshua Wiley Ph.D. Student, Health Psychology University of California, Los Angeles http://www.joshuawiley.com/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Correlation question
Hi everyone, First of all, thanks for the quick responses. I appreciate the help. Before answering questions, I wanted to mention that I tested this behaviour on 2.3.1 and 2.10.1 on a x86_64 linux arch, and on version 2.9.0 on a 32 bit arch. Now for the answers (batch version): 1/ I received another message stating that I mislabeled the data in my previous message. That was a retranscription error. I have included a sample of the data causing the problem 2/ On Wed, 8 Sep 2010, Joshua Wiley wrote: Does your data have missing values? I am not sure it would change anything, but perhaps try adding: cor(test2, method = spearman, use = pairwise.complete.obs) Tried, no difference. The specific pairwise comparisons do not contain missing values (other parts of my data, yes) 3/ From: Kjetil Halvorsen kjetilbrinchmannhalvor...@gmail.com Do dput(test2) and copypaste the output into the email message. I had to clean up my data before sending it (had to remove names/emails). Problems are visible with (spearman) correlations with P3 with H_tot and HP_tot. In my correlation matrix, these are both: -0.25028783918741 instead of -0.2182876. Most other correlations are however accurate. Here is the data: dput(exp) names(exp) [1] P1 P2 P3 P4 P5 P6 [7] P8 P9 P10 P11 P12 P7 [13] SITE Errors warnings Manual TotalH_tot [19] HP1.1HP1.2HP1.3HP1.4HP_tot HO1.1 [25] HO1.2HO1.3HO1.4HO_tot HU1.1HU1.2 [31] HU1.3HU_tot HR L_totLP1.1LP1.2 [37] LP1.3LP1.4LP_tot LO1.1LO1.2LO1.3 [43] LO1.4LO_tot LU1.1LU1.2LU1.3LU_tot [49] LR_tot SP_tot SP1.1SP1.2SP1.3SP1.4 [55] SP_tot.1 SO1.1SO1.2SO1.3SO1.4SO_tot [61] SU1.1SU1.2SU1.3SU_tot SR dput(exp) structure(list(P1 = c(2L, 1L, 3L, 3L, 2L, 3L, 2L, 3L, 3L, 2L, 2L, 2L, 4L, 1L, 3L, 2L, 3L, 2L, 3L, 2L, 2L, 2L, 2L, 1L, 2L), P2 = c(2L, 2L, 4L, 4L, 1L, 3L, 2L, 4L, 3L, 3L, 2L, 3L, 4L, 1L, 2L, 2L, 4L, 3L, 4L, 1L, 2L, 3L, 2L, 1L, 3L), P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L, 2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L), P4 = c(1L, 3L, 3L, 4L, 2L, 3L, 1L, 4L, 3L, 3L, 2L, 3L, 4L, 1L, 3L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L), P5 = c(2L, 1L, 4L, 1L, 2L, 2L, 2L, 3L, 3L, 2L, 2L, 3L, 4L, 1L, 2L, 2L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 3L), P6 = c(2L, 2L, 4L, 1L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 4L, 1L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 3L ), P8 = c(2L, 2L, 4L, 2L, 2L, 2L, 2L, 4L, 3L, 2L, 2L, 2L, 4L, 1L, 3L, 2L, 3L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L), P9 = c(4L, 0L, 4L, 0L, 0L, 2L, 3L, 0L, 0L, 4L, 0L, 4L, 3L, 2L, 4L, 0L, 0L, 0L, 3L, 4L, 3L, 0L, 4L, 0L, 3L), P10 = c(3L, 3L, 2L, 2L, 3L, 3L, 0L, 2L, 2L, 2L, 3L, 2L, 2L, 3L, 3L, 3L, 2L, 2L, 2L, 3L, 3L, 2L, 3L, 3L, 3L), P11 = c(1L, 1L, 2L, 2L, 1L, 1L, 0L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), P12 = c(9L, 10L, 6L, 5L, 9L, 8L, 0L, 5L, 4L, 6L, 8L, 7L, 3L, 10L, 7L, 9L, 7L, 6L, 7L, 10L, 8L, 7L, 8L, 9L, 7L), P7 = structure(c(1L, 9L, 7L, 8L, 1L, 3L, 1L, 1L, 5L, 4L, 1L, 1L, 1L, 1L, 1L, 6L, 1L, 1L, 1L, 11L, 10L, 1L, 2L, 1L, 1L), .Label = c( ,  al inicio de la págia conexión.,  al principio no sabes muy bien por donde empezar pero una vez aclarado es facil,  busqueda diferente alas conocidas,  CUando el tÃtulo contiene dos puntos,no responde.,  Lass,suelen dar facilidades para buscar lo que se necesita.,  La verdad es que está un poco confusa la web y, si la persona que tiene que acceder a ella no es experta, creo que lo tiene difÃcil,  muy complicado, el poder encontrar los citados documentos, en la UNIVERSIDAD citada,,  Ninguna,  Que no hay pestañas q de pelicula,  yo ninguna), class = factor), SITE = c(5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), Errors = c(201L, 201L, 201L, 201L, 201L, 201L, 201L, 201L, 369L, 369L, 369L, 369L, 369L, 369L, 369L, 369L, 369L, 369L, 159L, 159L, 159L, 159L, 159L, 159L, 159L), warnings = c(164L, 164L, 164L, 164L, 164L, 164L, 164L, 164L, 447L, 447L, 447L, 447L, 447L, 447L, 447L, 447L, 447L, 447L, 490L, 490L, 490L, 490L, 490L, 490L, 490L), Manual = c(44L, 44L, 44L, 44L, 44L, 44L, 44L, 44L, 46L, 46L, 46L, 46L, 46L, 46L, 46L, 46L, 46L, 46L, 45L, 45L, 45L, 45L, 45L, 45L, 45L ), Total = c(409L, 409L, 409L, 409L, 409L, 409L, 409L, 409L, 862L, 862L, 862L, 862L, 862L, 862L, 862L, 862L, 862L, 862L, 694L, 694L, 694L, 694L, 694L, 694L, 694L), H_tot = c(11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 140L, 140L, 140L, 140L, 140L, 140L, 140L, 140L, 140L, 140L, 21L, 21L, 21L, 21L, 21L, 21L, 21L), HP1.1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), HP1.2 = c(0L, 0L,
Re: [R] Correlation question (from a newbie)
Dear R-users please ignore my most recent posting.. Found the solution.. Thanks to David Winsemius.. Thanks, Santosh On Tue, Jul 14, 2009 at 9:14 PM, Santosh santosh2...@gmail.com wrote: Dear R-users.. I hope the following scenario is more explanatory of my question.. Continuous variables: AGE, WEIGHT, HEIGHT Categorical variables: Group, Sex, Race I would like to find a correlation between WEIGHT and AGE, grouped by Group,Sex, and Race. Is the following formula correct? tapply(dat$WEIGHT, by=list(dat$AGE,as.factor(dat$Group),as.factor(dat$EX),as.factor(dat$RACE)),cor) Thanks, Santosh On Tue, Jul 14, 2009 at 7:34 PM, Santosh santosh2...@gmail.com wrote: Hi R-users, Was wondering if there is a way to quickly compute correlations between continuous variables grouped by some categorical variables? What function do I use? Thanks much in advance. Regards, Santosh [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Correlation question (from a newbie)
Hi R-users, Was wondering if there is a way to quickly compute correlations between continuous variables grouped by some categorical variables? What function do I use? Thanks much in advance. Regards, Santosh [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Correlation question (from a newbie)
On Jul 14, 2009, at 10:34 PM, Santosh wrote: Hi R-users, Was wondering if there is a way to quickly compute correlations between continuous variables grouped by some categorical variables? What function do I use? ?tapply ?by Thanks much in advance. Regards, Santosh [[alternative HTML version deleted]] David Winsemius, MD Heritage Laboratories West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Correlation question (from a newbie)
Dear R-users.. I hope the following scenario is more explanatory of my question.. Continuous variables: AGE, WEIGHT, HEIGHT Categorical variables: Group, Sex, Race I would like to find a correlation between WEIGHT and AGE, grouped by Group,Sex, and Race. Is the following formula correct? tapply(dat$WEIGHT, by=list(dat$AGE,as.factor(dat$Group),as.factor(dat$EX),as.factor(dat$RACE)),cor) Thanks, Santosh On Tue, Jul 14, 2009 at 7:34 PM, Santosh santosh2...@gmail.com wrote: Hi R-users, Was wondering if there is a way to quickly compute correlations between continuous variables grouped by some categorical variables? What function do I use? Thanks much in advance. Regards, Santosh [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.