subject:"\[R\] Correlation question"

Re: [R] Correlation question

2015-02-22 Thread David L Carlson

As Kehl pointed out, any linear function of the independent variable (speed) 
will have the same squared correlation with the dependent variable (dist), but 
only one linear function minimizes the squared deviations between the fitted 
values and the original values. The equation you are using is only applicable 
to that function, not to any of the others. In fact, some linear functions will 
produce negative values:

 fitted.new - 6*cars$speed
 cor(cbind(fitted.new, fitted.right, fitted.wrong, cars$dist))
 fitted.new fitted.right fitted.wrong  
fitted.new1.0001.0001.000 0.8068949
fitted.right  1.0001.0001.000 0.8068949
fitted.wrong  1.0001.0001.000 0.8068949
  0.80689490.80689490.8068949 1.000
 1-sum((cars$dist-fitted.new)^2)/sum((cars$dist-mean(cars$dist))^2)
[1] -3.281849

David L. Carlson
Department of Anthropology
Texas AM University

-Original Message-
From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Jonathan Thayn
Sent: Sunday, February 22, 2015 12:01 AM
To: Kehl Dániel
Cc: r-help@r-project.org
Subject: Re: [R] Correlation question

Of course! Thank you, I knew I was missing something painfully obvious. Its 
seems, then, that this line

1-sum((cars$dist-fitted.wrong)^2)/sum((cars$dist-mean(cars$dist))^2)

is finding something other than the traditional correlation. I found this in a 
lecture introducing correlation, but , now, I'm not sure what it is. It does do 
a better job of showing that the fitted.wrong variable is not a good prediction 
of the distance. 



On Feb 21, 2015, at 4:36 PM, Kehl Dániel wrote:

 Hi,
 
 try
 
 cor(fitted.right,fitted.wrong)
 
 should give 1 as both are a linear function of speed! Hence 
 cor(cars$dist,fitted.right)^2 and cor(x=cars$dist,y=fitted.wrong)^2 must be 
 the same.
 
 HTH
 d
 
 Feladó: R-help [r-help-boun...@r-project.org] ; meghatalmaz#243;: Jonathan 
 Thayn [jth...@ilstu.edu]
 Küldve: 2015. február 21. 22:42
 To: r-help@r-project.org
 Tárgy: [R] Correlation question
 
 I recently compared two different approaches to calculating the correlation 
 of two variables, and I cannot explain the different results:
 
 data(cars)
 model - lm(dist~speed,data=cars)
 coef(model)
 fitted.right - model$fitted
 fitted.wrong - -17+5*cars$speed
 
 
 When using the OLS fitted values, the lines below all return the same R2 
 value:
 
 1-sum((cars$dist-fitted.right)^2)/sum((cars$dist-mean(cars$dist))^2)
 cor(cars$dist,fitted.right)^2
 (sum((cars$dist-mean(cars$dist))*(fitted.right-mean(fitted.right)))/(49*sd(cars$dist)*sd(fitted.right)))^2
 
 
 However, when I use my estimated parameters to find the fitted values, 
 fitted.wrong, the first equation returns a much lower R2 value, which I 
 would expect since the fit is worse, but the other lines return the same R2 
 that I get when using the OLS fitted values.
 
 1-sum((cars$dist-fitted.wrong)^2)/sum((cars$dist-mean(cars$dist))^2)
 cor(x=cars$dist,y=fitted.wrong)^2
 (sum((cars$dist-mean(cars$dist))*(fitted.wrong-mean(fitted.wrong)))/(49*sd(cars$dist)*sd(fitted.wrong)))^2
 
 
 I'm sure I'm missing something simple, but can someone explain the difference 
 between these two methods of finding R2? Thanks.
 
 Jon
[[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Correlation question

2015-02-21 Thread Jonathan Thayn

I recently compared two different approaches to calculating the correlation of 
two variables, and I cannot explain the different results: 

data(cars)
model - lm(dist~speed,data=cars)
coef(model)
fitted.right - model$fitted
fitted.wrong - -17+5*cars$speed


When using the OLS fitted values, the lines below all return the same R2 value:

1-sum((cars$dist-fitted.right)^2)/sum((cars$dist-mean(cars$dist))^2)
cor(cars$dist,fitted.right)^2
(sum((cars$dist-mean(cars$dist))*(fitted.right-mean(fitted.right)))/(49*sd(cars$dist)*sd(fitted.right)))^2


However, when I use my estimated parameters to find the fitted values, 
fitted.wrong, the first equation returns a much lower R2 value, which I would 
expect since the fit is worse, but the other lines return the same R2 that I 
get when using the OLS fitted values.

1-sum((cars$dist-fitted.wrong)^2)/sum((cars$dist-mean(cars$dist))^2)
cor(x=cars$dist,y=fitted.wrong)^2
(sum((cars$dist-mean(cars$dist))*(fitted.wrong-mean(fitted.wrong)))/(49*sd(cars$dist)*sd(fitted.wrong)))^2


I'm sure I'm missing something simple, but can someone explain the difference 
between these two methods of finding R2? Thanks.

Jon
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Correlation question

2015-02-21 Thread Kehl Dániel

Hi,

try

cor(fitted.right,fitted.wrong)

should give 1 as both are a linear function of speed! Hence 
cor(cars$dist,fitted.right)^2 and cor(x=cars$dist,y=fitted.wrong)^2 must be the 
same.

HTH
d

Feladó: R-help [r-help-boun...@r-project.org] ; meghatalmaz#243;: Jonathan 
Thayn [jth...@ilstu.edu]
Küldve: 2015. február 21. 22:42
To: r-help@r-project.org
Tárgy: [R] Correlation question

I recently compared two different approaches to calculating the correlation of 
two variables, and I cannot explain the different results:

data(cars)
model - lm(dist~speed,data=cars)
coef(model)
fitted.right - model$fitted
fitted.wrong - -17+5*cars$speed


When using the OLS fitted values, the lines below all return the same R2 value:

1-sum((cars$dist-fitted.right)^2)/sum((cars$dist-mean(cars$dist))^2)
cor(cars$dist,fitted.right)^2
(sum((cars$dist-mean(cars$dist))*(fitted.right-mean(fitted.right)))/(49*sd(cars$dist)*sd(fitted.right)))^2


However, when I use my estimated parameters to find the fitted values, 
fitted.wrong, the first equation returns a much lower R2 value, which I would 
expect since the fit is worse, but the other lines return the same R2 that I 
get when using the OLS fitted values.

1-sum((cars$dist-fitted.wrong)^2)/sum((cars$dist-mean(cars$dist))^2)
cor(x=cars$dist,y=fitted.wrong)^2
(sum((cars$dist-mean(cars$dist))*(fitted.wrong-mean(fitted.wrong)))/(49*sd(cars$dist)*sd(fitted.wrong)))^2


I'm sure I'm missing something simple, but can someone explain the difference 
between these two methods of finding R2? Thanks.

Jon
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Correlation question

2015-02-21 Thread Jonathan Thayn

Of course! Thank you, I knew I was missing something painfully obvious. Its 
seems, then, that this line

1-sum((cars$dist-fitted.wrong)^2)/sum((cars$dist-mean(cars$dist))^2)

is finding something other than the traditional correlation. I found this in a 
lecture introducing correlation, but , now, I'm not sure what it is. It does do 
a better job of showing that the fitted.wrong variable is not a good prediction 
of the distance. 



On Feb 21, 2015, at 4:36 PM, Kehl Dániel wrote:

 Hi,
 
 try
 
 cor(fitted.right,fitted.wrong)
 
 should give 1 as both are a linear function of speed! Hence 
 cor(cars$dist,fitted.right)^2 and cor(x=cars$dist,y=fitted.wrong)^2 must be 
 the same.
 
 HTH
 d
 
 Feladó: R-help [r-help-boun...@r-project.org] ; meghatalmaz#243;: Jonathan 
 Thayn [jth...@ilstu.edu]
 Küldve: 2015. február 21. 22:42
 To: r-help@r-project.org
 Tárgy: [R] Correlation question
 
 I recently compared two different approaches to calculating the correlation 
 of two variables, and I cannot explain the different results:
 
 data(cars)
 model - lm(dist~speed,data=cars)
 coef(model)
 fitted.right - model$fitted
 fitted.wrong - -17+5*cars$speed
 
 
 When using the OLS fitted values, the lines below all return the same R2 
 value:
 
 1-sum((cars$dist-fitted.right)^2)/sum((cars$dist-mean(cars$dist))^2)
 cor(cars$dist,fitted.right)^2
 (sum((cars$dist-mean(cars$dist))*(fitted.right-mean(fitted.right)))/(49*sd(cars$dist)*sd(fitted.right)))^2
 
 
 However, when I use my estimated parameters to find the fitted values, 
 fitted.wrong, the first equation returns a much lower R2 value, which I 
 would expect since the fit is worse, but the other lines return the same R2 
 that I get when using the OLS fitted values.
 
 1-sum((cars$dist-fitted.wrong)^2)/sum((cars$dist-mean(cars$dist))^2)
 cor(x=cars$dist,y=fitted.wrong)^2
 (sum((cars$dist-mean(cars$dist))*(fitted.wrong-mean(fitted.wrong)))/(49*sd(cars$dist)*sd(fitted.wrong)))^2
 
 
 I'm sure I'm missing something simple, but can someone explain the difference 
 between these two methods of finding R2? Thanks.
 
 Jon
[[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Correlation question

2010-09-09 Thread Dennis Murphy

Did you try taking out P7, which is text? Moreover, if you get a message
saying ' the standard deviation is zero', it means that the entire column is
constant. By definition, the covariance of a constant with a random variable
is 0, but your data consists of values, so cor() understandably throws a
warning that one or more of your columns are constant. Applying the
following to your data (which I named expd instead),  we get

sapply(expd[, -12], var)
  P1   P2   P3   P4   P5
P6
5.43e-01 1.08e+00 5.77e-01 1.08e+00 6.43e-01
5.57e-01
  P8   P9  P10  P11  P12
SITE
5.73e-01 3.19e+00 5.07e-01 2.50e-01 5.50e+00
2.49e+00
  Errors warnings   ManualTotalH_tot
HP1.1
9.072840e+03 2.081334e+04 7.43e-01 3.823500e+04 3.880250e+03
2.676667e+00
   HP1.2HP1.3HP1.4   HP_totHO1.1
HO1.2
0.00e+00 2.008440e+03 3.057067e+02 3.827250e+03 8.40e-01
0.00e+00
   HO1.3HO1.4   HO_totHU1.1HU1.2
HU1.3
0.00e+00 0.00e+00 8.40e-01 0.00e+00 2.10e-01
2.27e-01
  HU_tot   HRL_totLP1.1LP1.2
LP1.3
6.23e-01 7.43e-01 3.754610e+03 3.209333e+01 0.00e+00
2.065010e+03
   LP1.4   LP_totLO1.1LO1.2LO1.3
LO1.4
2.246233e+02 3.590040e+03 3.684000e+01 0.00e+00 0.00e+00
2.84e+00
  LO_totLU1.1LU1.2LU1.3   LU_tot
LR_tot
6.00e+01 0.00e+00 1.44e+00 3.626667e+00 8.37e+00
4.94e+00
  SP_totSP1.1SP1.2SP1.3SP1.4
SP_tot.1
6.911067e+02 4.225000e+01 0.00e+00 1.009600e+02 4.161600e+02
3.071600e+02
   SO1.1SO1.2SO1.3SO1.4   SO_tot
SU1.1
4.54e+00 2.50e-01 0.00e+00 2.10e-01 5.25e+00
0.00e+00
   SU1.2SU1.3   SU_tot   SR
1.556667e+00 4.225000e+01 3.504000e+01 4.225000e+01

Which columns are constant?
which(sapply(expd[, -12], var)  .Machine$double.eps)
HP1.2 HO1.2 HO1.3 HO1.4 HU1.1 LP1.2 LO1.2 LO1.3 LU1.1 SP1.2 SO1.3 SU1.1
   192425262835404144515760

I suspect that in your real data set, there aren't so many constant columns,
but this is one way to check.

HTH,
Dennis

On Wed, Sep 8, 2010 at 12:35 PM, Stephane Vaucher vauch...@iro.umontreal.ca
 wrote:

 Hi everyone,

 I'm observing what I believe is weird behaviour when attempting to do
 something very simple. I want a correlation matrix, but my matrix seems to
 contain correlation values that are not found when executed on pairs:

  test2$P2

  [1] 2 2 4 4 1 3 2 4 3 3 2 3 4 1 2 2 4 3 4 1 2 3 2 1 3

 test2$HP_tot

  [1]  10  10  10  10  10  10  10  10 136 136 136 136 136 136 136 136 136
 136  15
 [20]  15  15  15  15  15  15 c=cor(test2$P3,test2$HP_tot,method='spearman')

 c

 [1] -0.2182876

 c=cor(test2,method='spearman')

 Warning message:
 In cor(test2, method = spearman) : the standard deviation is zero

 write(c,file='out.csv')


 from my spreadsheet
 -0.25028783918741

 Most cells are correct, but not that one.

 If this is expected behaviour, I apologise for bothering you, I read the
 documentation, but I do not know if the calculation of matrices and pairs is
 done using the same function (eg, with respect to equal value observations).

 If this is not a desired behaviour, I noticed that it only occurs with a
 relatively large matrix (I couldn't reproduce on a simple 2 column data
 set). There might be a naming error.

  names(test2)

  [1] ID   NOMBRE   MAIL
  [4] Age  SEXO Studies
  [7] Hours_Internet   Vision.Disabilities  Other.disabilities
 [10] Technology_Knowledge Start_Time   End_Time
 [13] Duration P1   P1Book
 [16] P1DVDP2   P3
 [19] P4   P5   P6
 [22] P8   P9   P10
 [25] P11  P12  P7
 [28] SITE Errors   warnings
 [31] Manual   TotalH_tot
 [34] HP1.1HP1.2HP1.3
 [37] HP1.4HP_tot   HO1.1
 [40] HO1.2HO1.3HO1.4
 [43] HO_tot   HU1.1HU1.2
 [46] HU1.3HU_tot   HR
 [49] L_totLP1.1LP1.2
 [52] LP1.3LP1.4LP_tot
 [55] LO1.1LO1.2LO1.3
 [58] LO1.4LO_tot   LU1.1
 [61] LU1.2LU1.3LU_tot
 [64] LR_tot   SP_tot   SP1.1
 [67] SP1.2SP1.3SP1.4
 [70] SP_tot.1 SO1.1SO1.2
 [73] SO1.3SO1.4

Re: [R] Correlation question

2010-09-09 Thread Stephane Vaucher


Thank you Dennis,

You identified a factor (text column) that I was concerned with. 
I simplified my example to try and factor out possible causes. I 
eliminated the recurring values in columns (which were not the columns 
that caused problems). I produced three examples with simple data sets.


1. Correct output, 2 columns only:


test.notext = read.csv('test-notext.csv')
cor(test.notext, method='spearman')

   P3 HP_tot
P3  1.000 -0.2182876
HP_tot -0.2182876  1.000

dput(test.notext)

structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L,
2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L),
HP_tot = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 136L,
136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 15L,
15L, 15L, 15L, 15L, 15L, 15L)), .Names = c(P3, HP_tot
), class = data.frame, row.names = c(NA, -25L))

2. Incorrect output where I introduced my P7 column containing text only 
the 'a' character:



test = read.csv('test.csv')
cor(test, method='spearman')

   P3 P7 HP_tot
P3  1.000 NA -0.2502878
P7 NA  1 NA
HP_tot -0.2502878 NA  1.000
Warning message:
In cor(test, method = spearman) : the standard deviation is zero

dput(test)

structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L,
2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L),
P7 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), .Label = a, class = factor), HP_tot = c(10L, 10L,
10L, 10L, 10L, 10L, 10L, 10L, 136L, 136L, 136L, 136L, 136L,
136L, 136L, 136L, 136L, 136L, 15L, 15L, 15L, 15L, 15L, 15L,
15L)), .Names = c(P3, P7, HP_tot), class = data.frame, 
row.names = c(NA,

-25L))

3. Incorrect output with P7 containing a variety of alpha-numeric 
characters (ascii), to factor out equal valued column issue. Notice that 
the text column is interpreted as a numeric value.



test.number = read.csv('test-alpha.csv')
cor(test.number, method='spearman')

   P3 P7 HP_tot
P3  1.000  0.4093108 -0.2502878
P7  0.4093108  1.000 -0.3807193
HP_tot -0.2502878 -0.3807193  1.000

dput(test.number)

structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L,
2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L),
P7 = structure(c(11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L,
19L, 20L, 21L, 22L, 23L, 24L, 25L, 1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L, 10L), .Label = c(0, 1, 2, 3, 4, 5,
6, 7, 8, 9, a, b, c, d, e, f, g, h,
i, j, k, l, m, n, o), class = factor), HP_tot = c(10L,
10L, 10L, 10L, 10L, 10L, 10L, 10L, 136L, 136L, 136L, 136L,
136L, 136L, 136L, 136L, 136L, 136L, 15L, 15L, 15L, 15L, 15L,
15L, 15L)), .Names = c(P3, P7, HP_tot), class = data.frame, 
row.names = c(NA,

-25L))

Correct output is obtained by avoiding matrix computation of correlation:

cor(test.number$P3, test.number$HP_tot, method='spearman')

[1] -0.2182876

It seems that a text column corrupts my correlation calculation (only in a 
matrix calculation). I assumed that text columns would not influence the 
result of the calculations.


Is this a correct behaviour? If not,I can submit a bug report? If it is, 
is there a known workaround?


cheers,
Stephane Vaucher

On Thu, 9 Sep 2010, Dennis Murphy wrote:


Did you try taking out P7, which is text? Moreover, if you get a message
saying ' the standard deviation is zero', it means that the entire column is
constant. By definition, the covariance of a constant with a random variable
is 0, but your data consists of values, so cor() understandably throws a
warning that one or more of your columns are constant. Applying the
following to your data (which I named expd instead),  we get

sapply(expd[, -12], var)
 P1   P2   P3   P4   P5
P6
5.43e-01 1.08e+00 5.77e-01 1.08e+00 6.43e-01
5.57e-01
 P8   P9  P10  P11  P12
SITE
5.73e-01 3.19e+00 5.07e-01 2.50e-01 5.50e+00
2.49e+00
 Errors warnings   ManualTotalH_tot
HP1.1
9.072840e+03 2.081334e+04 7.43e-01 3.823500e+04 3.880250e+03
2.676667e+00
  HP1.2HP1.3HP1.4   HP_totHO1.1
HO1.2
0.00e+00 2.008440e+03 3.057067e+02 3.827250e+03 8.40e-01
0.00e+00
  HO1.3HO1.4   HO_totHU1.1HU1.2
HU1.3
0.00e+00 0.00e+00 8.40e-01 0.00e+00 2.10e-01
2.27e-01
 HU_tot   HRL_totLP1.1LP1.2
LP1.3
6.23e-01 7.43e-01 3.754610e+03 3.209333e+01 0.00e+00
2.065010e+03
  LP1.4   LP_totLO1.1LO1.2LO1.3
LO1.4
2.246233e+02 3.590040e+03 3.684000e+01 0.00e+00 0.00e+00
2.84e+00
 LO_totLU1.1LU1.2LU1.3   LU_tot
LR_tot
6.00e+01 0.00e+00 1.44e+00 3.626667e+00 8.37e+00
4.94e+00
 SP_tot

Re: [R] Correlation question

2010-09-09 Thread Joshua Wiley

Hi Stephane,

When I use your sample data (e.g., test, test.number), cor() throws an
error that x must be numeric (because of the factor or character
data).  Are you not getting any errors when trying to calculate the
correlation on these data?  If you are not, I wonder what version of R
are you using?  The quickest way to find out is sessionInfo().

As far as a work around, it would be relative simple to find out which
columns of your data frame were not numeric or integer and exclude
those (I'm happy to provide that code if you want).

Best regards,

Josh

On Thu, Sep 9, 2010 at 7:50 AM, Stephane Vaucher
vauch...@iro.umontreal.ca wrote:
 Thank you Dennis,

 You identified a factor (text column) that I was concerned with. I
 simplified my example to try and factor out possible causes. I eliminated
 the recurring values in columns (which were not the columns that caused
 problems). I produced three examples with simple data sets.

 1. Correct output, 2 columns only:

 test.notext = read.csv('test-notext.csv')
 cor(test.notext, method='spearman')

               P3     HP_tot
 P3      1.000 -0.2182876
 HP_tot -0.2182876  1.000

 dput(test.notext)

 structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L,
 2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L),
    HP_tot = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 136L,
    136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 15L,
    15L, 15L, 15L, 15L, 15L, 15L)), .Names = c(P3, HP_tot
 ), class = data.frame, row.names = c(NA, -25L))

 2. Incorrect output where I introduced my P7 column containing text only the
 'a' character:

 test = read.csv('test.csv')
 cor(test, method='spearman')

               P3 P7     HP_tot
 P3      1.000 NA -0.2502878
 P7             NA  1         NA
 HP_tot -0.2502878 NA  1.000
 Warning message:
 In cor(test, method = spearman) : the standard deviation is zero

 dput(test)

 structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L,
 2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L),
    P7 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
    ), .Label = a, class = factor), HP_tot = c(10L, 10L,
    10L, 10L, 10L, 10L, 10L, 10L, 136L, 136L, 136L, 136L, 136L,
    136L, 136L, 136L, 136L, 136L, 15L, 15L, 15L, 15L, 15L, 15L,
    15L)), .Names = c(P3, P7, HP_tot), class = data.frame, row.names
 = c(NA,
 -25L))

 3. Incorrect output with P7 containing a variety of alpha-numeric characters
 (ascii), to factor out equal valued column issue. Notice that the text
 column is interpreted as a numeric value.

 test.number = read.csv('test-alpha.csv')
 cor(test.number, method='spearman')

               P3         P7     HP_tot
 P3      1.000  0.4093108 -0.2502878
 P7      0.4093108  1.000 -0.3807193
 HP_tot -0.2502878 -0.3807193  1.000

 dput(test.number)

 structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L,
 2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L),
    P7 = structure(c(11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L,
    19L, 20L, 21L, 22L, 23L, 24L, 25L, 1L, 2L, 3L, 4L, 5L, 6L,
    7L, 8L, 9L, 10L), .Label = c(0, 1, 2, 3, 4, 5,
    6, 7, 8, 9, a, b, c, d, e, f, g, h,
    i, j, k, l, m, n, o), class = factor), HP_tot = c(10L,
    10L, 10L, 10L, 10L, 10L, 10L, 10L, 136L, 136L, 136L, 136L,
    136L, 136L, 136L, 136L, 136L, 136L, 15L, 15L, 15L, 15L, 15L,
    15L, 15L)), .Names = c(P3, P7, HP_tot), class = data.frame,
 row.names = c(NA,
 -25L))

 Correct output is obtained by avoiding matrix computation of correlation:

 cor(test.number$P3, test.number$HP_tot, method='spearman')

 [1] -0.2182876

 It seems that a text column corrupts my correlation calculation (only in a
 matrix calculation). I assumed that text columns would not influence the
 result of the calculations.

 Is this a correct behaviour? If not,I can submit a bug report? If it is, is
 there a known workaround?

 cheers,
 Stephane Vaucher

 On Thu, 9 Sep 2010, Dennis Murphy wrote:

 Did you try taking out P7, which is text? Moreover, if you get a message
 saying ' the standard deviation is zero', it means that the entire column
 is
 constant. By definition, the covariance of a constant with a random
 variable
 is 0, but your data consists of values, so cor() understandably throws a
 warning that one or more of your columns are constant. Applying the
 following to your data (which I named expd instead),  we get

 sapply(expd[, -12], var)
         P1           P2           P3           P4           P5
 P6
 5.43e-01 1.08e+00 5.77e-01 1.08e+00 6.43e-01
 5.57e-01
         P8           P9          P10          P11          P12
 SITE
 5.73e-01 3.19e+00 5.07e-01 2.50e-01 5.50e+00
 2.49e+00
     Errors     warnings       Manual        Total        H_tot
 HP1.1
 9.072840e+03 2.081334e+04 7.43e-01 3.823500e+04 3.880250e+03
 2.676667e+00
      HP1.2        HP1.3        HP1.4       HP_tot        HO1.1

Re: [R] Correlation question

2010-09-09 Thread Stephane Vaucher


Hi Josh,

Initially, I was expecting R to simply ignore non-numeric data. I guess I 
was wrong... I copy-pasted what I observe, and I do not get an error when 
calculating correlations with text data. I can also do cor(test.n$P3, 
test$P7) without an error.


If you have a function to select only numeric columns that 
you can share with me (and the list), that would be great. Of course, I'm 
wondering why your version of R produces different results from mine. I 
don't know if I should open a bug report. It would be good if someone 
(other than me) observed this problem in their environment.


Here is what I am currently using:

R version 2.10.1 (2009-12-14)
x86_64-pc-linux-gnu

locale:
 [1] LC_CTYPE=en_CA.UTF-8   LC_NUMERIC=C
 [3] LC_TIME=en_CA.UTF-8LC_COLLATE=en_CA.UTF-8
 [5] LC_MONETARY=C  LC_MESSAGES=en_CA.UTF-8
 [7] LC_PAPER=en_CA.UTF-8   LC_NAME=C
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

The behaviour has been observed on:

sessionInfo()

Version 2.3.1 (2006-06-01)
x86_64-redhat-linux-gnu

attached base packages:
[1] methods   stats graphics  grDevices utils datasets
[7] base

As well as on a 32 bit linux arch v2.9.0.

Sincere regards,
sv

On Thu, 9 Sep 2010, Joshua Wiley wrote:


Hi Stephane,

When I use your sample data (e.g., test, test.number), cor() throws an
error that x must be numeric (because of the factor or character
data).  Are you not getting any errors when trying to calculate the
correlation on these data?  If you are not, I wonder what version of R
are you using?  The quickest way to find out is sessionInfo().

As far as a work around, it would be relative simple to find out which
columns of your data frame were not numeric or integer and exclude
those (I'm happy to provide that code if you want).

Best regards,

Josh

On Thu, Sep 9, 2010 at 7:50 AM, Stephane Vaucher
vauch...@iro.umontreal.ca wrote:

Thank you Dennis,

You identified a factor (text column) that I was concerned with. I
simplified my example to try and factor out possible causes. I eliminated
the recurring values in columns (which were not the columns that caused
problems). I produced three examples with simple data sets.

1. Correct output, 2 columns only:


test.notext = read.csv('test-notext.csv')
cor(test.notext, method='spearman')


              P3     HP_tot
P3      1.000 -0.2182876
HP_tot -0.2182876  1.000


dput(test.notext)


structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L,
2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L),
   HP_tot = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 136L,
   136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 15L,
   15L, 15L, 15L, 15L, 15L, 15L)), .Names = c(P3, HP_tot
), class = data.frame, row.names = c(NA, -25L))

2. Incorrect output where I introduced my P7 column containing text only the
'a' character:


test = read.csv('test.csv')
cor(test, method='spearman')


              P3 P7     HP_tot
P3      1.000 NA -0.2502878
P7             NA  1         NA
HP_tot -0.2502878 NA  1.000
Warning message:
In cor(test, method = spearman) : the standard deviation is zero


dput(test)


structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L,
2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L),
   P7 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
   1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
   ), .Label = a, class = factor), HP_tot = c(10L, 10L,
   10L, 10L, 10L, 10L, 10L, 10L, 136L, 136L, 136L, 136L, 136L,
   136L, 136L, 136L, 136L, 136L, 15L, 15L, 15L, 15L, 15L, 15L,
   15L)), .Names = c(P3, P7, HP_tot), class = data.frame, row.names
= c(NA,
-25L))

3. Incorrect output with P7 containing a variety of alpha-numeric characters
(ascii), to factor out equal valued column issue. Notice that the text
column is interpreted as a numeric value.


test.number = read.csv('test-alpha.csv')
cor(test.number, method='spearman')


              P3         P7     HP_tot
P3      1.000  0.4093108 -0.2502878
P7      0.4093108  1.000 -0.3807193
HP_tot -0.2502878 -0.3807193  1.000


dput(test.number)


structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L,
2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L),
   P7 = structure(c(11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L,
   19L, 20L, 21L, 22L, 23L, 24L, 25L, 1L, 2L, 3L, 4L, 5L, 6L,
   7L, 8L, 9L, 10L), .Label = c(0, 1, 2, 3, 4, 5,
   6, 7, 8, 9, a, b, c, d, e, f, g, h,
   i, j, k, l, m, n, o), class = factor), HP_tot = c(10L,
   10L, 10L, 10L, 10L, 10L, 10L, 10L, 136L, 136L, 136L, 136L,
   136L, 136L, 136L, 136L, 136L, 136L, 15L, 15L, 15L, 15L, 15L,
   15L, 15L)), .Names = c(P3, P7, HP_tot), class = data.frame,
row.names = c(NA,
-25L))

Correct output is obtained by avoiding matrix computation of correlation:


cor(test.number$P3, test.number$HP_tot, method='spearman')

Re: [R] Correlation question

2010-09-09 Thread Joshua Wiley

Hi Stephane,

According to the NEWS file, as of 2.11.0: cor() and cov() now test
for misuse with non-numeric arguments, such as the non-bug report
PR#14207 so there is no need for a new bug report.

Here is a simple way to select only numeric columns:

# Sample data
dat - data.frame(a = 1:10L, b = runif(10), c = paste(1:10),
  d = rep(TRUE, 10), e = factor(rep(a, 10)),
  stringsAsFactors = FALSE)

# (this includes numeric and integer, btw)
dat[, sapply(dat, is.numeric)]

# if you wanted to include logicals (which cor() will work with)

class.test - function(x) {
  output - FALSE
  if(is.numeric(x) | is.logical(x)) {
output - TRUE}
  return(output)
}

# Columns that are numeric or logical
dat[, sapply(dat, class.test)]

HTH,


Josh

On Thu, Sep 9, 2010 at 10:53 AM, Stephane Vaucher
vauch...@iro.umontreal.ca wrote:
 Hi Josh,

 Initially, I was expecting R to simply ignore non-numeric data. I guess I
 was wrong... I copy-pasted what I observe, and I do not get an error when
 calculating correlations with text data. I can also do cor(test.n$P3,
 test$P7) without an error.

 If you have a function to select only numeric columns that you can share
 with me (and the list), that would be great. Of course, I'm wondering why
 your version of R produces different results from mine. I don't know if I
 should open a bug report. It would be good if someone (other than me)
 observed this problem in their environment.

 Here is what I am currently using:

 R version 2.10.1 (2009-12-14)
 x86_64-pc-linux-gnu

 locale:
  [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8
  [5] LC_MONETARY=C              LC_MESSAGES=en_CA.UTF-8
  [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
 [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C

 attached base packages:
 [1] stats     graphics  grDevices utils     datasets  methods   base

 The behaviour has been observed on:

 sessionInfo()

 Version 2.3.1 (2006-06-01)
 x86_64-redhat-linux-gnu

 attached base packages:
 [1] methods   stats     graphics  grDevices utils     datasets
 [7] base

 As well as on a 32 bit linux arch v2.9.0.

 Sincere regards,
 sv

 On Thu, 9 Sep 2010, Joshua Wiley wrote:

 Hi Stephane,

 When I use your sample data (e.g., test, test.number), cor() throws an
 error that x must be numeric (because of the factor or character
 data).  Are you not getting any errors when trying to calculate the
 correlation on these data?  If you are not, I wonder what version of R
 are you using?  The quickest way to find out is sessionInfo().

 As far as a work around, it would be relative simple to find out which
 columns of your data frame were not numeric or integer and exclude
 those (I'm happy to provide that code if you want).

 Best regards,

 Josh

 On Thu, Sep 9, 2010 at 7:50 AM, Stephane Vaucher
 vauch...@iro.umontreal.ca wrote:

 Thank you Dennis,

 You identified a factor (text column) that I was concerned with. I
 simplified my example to try and factor out possible causes. I eliminated
 the recurring values in columns (which were not the columns that caused
 problems). I produced three examples with simple data sets.

 1. Correct output, 2 columns only:

 test.notext = read.csv('test-notext.csv')
 cor(test.notext, method='spearman')

               P3     HP_tot
 P3      1.000 -0.2182876
 HP_tot -0.2182876  1.000

 dput(test.notext)

 structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L,
 2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L),
    HP_tot = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 136L,
    136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 136L, 15L,
    15L, 15L, 15L, 15L, 15L, 15L)), .Names = c(P3, HP_tot
 ), class = data.frame, row.names = c(NA, -25L))

 2. Incorrect output where I introduced my P7 column containing text only
 the
 'a' character:

 test = read.csv('test.csv')
 cor(test, method='spearman')

               P3 P7     HP_tot
 P3      1.000 NA -0.2502878
 P7             NA  1         NA
 HP_tot -0.2502878 NA  1.000
 Warning message:
 In cor(test, method = spearman) : the standard deviation is zero

 dput(test)

 structure(list(P3 = c(2L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L,
 2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L),
    P7 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
    ), .Label = a, class = factor), HP_tot = c(10L, 10L,
    10L, 10L, 10L, 10L, 10L, 10L, 136L, 136L, 136L, 136L, 136L,
    136L, 136L, 136L, 136L, 136L, 15L, 15L, 15L, 15L, 15L, 15L,
    15L)), .Names = c(P3, P7, HP_tot), class = data.frame,
 row.names
 = c(NA,
 -25L))

 3. Incorrect output with P7 containing a variety of alpha-numeric
 characters
 (ascii), to factor out equal valued column issue. Notice that the text
 column is interpreted as a numeric value.

 test.number = read.csv('test-alpha.csv')
 cor(test.number,

Re: [R] Correlation question

2010-09-09 Thread Peter Ehlers


On 2010-09-09 11:53, Stephane Vaucher wrote:

Hi Josh,

Initially, I was expecting R to simply ignore non-numeric data. I guess I
was wrong... I copy-pasted what I observe, and I do not get an error when


The first thing to do when you get results that you don't expect is
to check the help page. The page for cor clearly states that its
input is to a *numeric* vector, matrix or data frame (my emphasis).
I would not be happy if R simply ignored non-numeric data. After all,
it's trivial to ensure that you feed only numeric data to cor().

Having said that, I guess others have found cor() problematic when
non-valid input is supplied and so R now (as of 2.11.0) issues an
error message that 'x' must be numeric. You should always check the
latest released version to see if changes have been made. The NEWS
file for 2.11.0 contains this:

  cor() and cov() now test for misuse with non-numeric
  arguments, such as the non-bug report PR#14207.


calculating correlations with text data. I can also do cor(test.n$P3,
test$P7) without an error.

If you have a function to select only numeric columns that
you can share with me (and the list), that would be great. Of course, I'm
wondering why your version of R produces different results from mine. I
don't know if I should open a bug report. It would be good if someone


You're doing the right thing by asking here first before reporting.
It would definitely not be a good idea to report a (non-)bug
in an outdated version of R.

  -Peter Ehlers

[rest snipped; not relevant to my comments.]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Correlation question

2010-09-09 Thread Stephane Vaucher


Hi everyone,

Thanks for the help.

On Thu, 9 Sep 2010, Peter Ehlers wrote:


The first thing to do when you get results that you don't expect is
to check the help page. The page for cor clearly states that its
input is to a *numeric* vector, matrix or data frame (my emphasis).
I would not be happy if R simply ignored non-numeric data. After all,
it's trivial to ensure that you feed only numeric data to cor().


Indeed, the documentation states that it takes a numeric input. It 
does not state how it would react to an inappropriate input type. That's 
why I expected either to produce an error message or accurate results. I did 
not expect an incorrect result. I should not have assume that my 
expectations would be correct.



Having said that, I guess others have found cor() problematic when
non-valid input is supplied and so R now (as of 2.11.0) issues an
error message that 'x' must be numeric. You should always check the
latest released version to see if changes have been made. The NEWS
file for 2.11.0 contains this:
 cor() and cov() now test for misuse with non-numeric
 arguments, such as the non-bug report PR#14207.
You're doing the right thing by asking here first before reporting.
It would definitely not be a good idea to report a (non-)bug
in an outdated version of R.


Since my manipulations were simple, I assumed that others would have 
observed the same behaviour. In any case, I'm happy that the function 
checks the respect of the preconditions preconditions. Otherwise, it would 
have been good to add to the documentation and state that when there are 
non-numeric data, cor() can compute garbage.


cheers,
Stephane


 -Peter Ehlers

[rest snipped; not relevant to my comments.]



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Correlation question

2010-09-08 Thread Stephane Vaucher


Hi everyone,

I'm observing what I believe is weird behaviour when attempting to do 
something very simple. I want a correlation matrix, but my matrix seems to 
contain correlation values that are not found when executed on pairs:



test2$P2

 [1] 2 2 4 4 1 3 2 4 3 3 2 3 4 1 2 2 4 3 4 1 2 3 2 1 3

test2$HP_tot
 [1]  10  10  10  10  10  10  10  10 136 136 136 136 136 136 136 136 136 
136  15
[20]  15  15  15  15  15  15 
c=cor(test2$P3,test2$HP_tot,method='spearman')

c

[1] -0.2182876

c=cor(test2,method='spearman')

Warning message:
In cor(test2, method = spearman) : the standard deviation is zero

write(c,file='out.csv')


from my spreadsheet
-0.25028783918741

Most cells are correct, but not that one.

If this is expected behaviour, I apologise for bothering you, I read the 
documentation, but I do not know if the calculation of matrices and pairs 
is done using the same function (eg, with respect to equal value 
observations).


If this is not a desired behaviour, I noticed that it only occurs with a 
relatively large matrix (I couldn't reproduce on a simple 2 column data 
set). There might be a naming error.



names(test2)

 [1] ID   NOMBRE   MAIL
 [4] Age  SEXO Studies
 [7] Hours_Internet   Vision.Disabilities  Other.disabilities
[10] Technology_Knowledge Start_Time   End_Time
[13] Duration P1   P1Book
[16] P1DVDP2   P3
[19] P4   P5   P6
[22] P8   P9   P10
[25] P11  P12  P7
[28] SITE Errors   warnings
[31] Manual   TotalH_tot
[34] HP1.1HP1.2HP1.3
[37] HP1.4HP_tot   HO1.1
[40] HO1.2HO1.3HO1.4
[43] HO_tot   HU1.1HU1.2
[46] HU1.3HU_tot   HR
[49] L_totLP1.1LP1.2
[52] LP1.3LP1.4LP_tot
[55] LO1.1LO1.2LO1.3
[58] LO1.4LO_tot   LU1.1
[61] LU1.2LU1.3LU_tot
[64] LR_tot   SP_tot   SP1.1
[67] SP1.2SP1.3SP1.4
[70] SP_tot.1 SO1.1SO1.2
[73] SO1.3SO1.4SO_tot
[76] SU1.1SU1.2SU1.3
[79] SU_tot   SR

Thank you in advance,
Stephane Vaucher

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Correlation question

2010-09-08 Thread Joshua Wiley

Hi,

Does your data have missing values?  I am not sure it would change
anything, but perhaps try adding:

cor(test2, method = spearman, use = pairwise.complete.obs)

or something of the like.  I am not sure what R does by default.  My
reasoning stems from this particular passage in the documentation:

 If ‘use’ is ‘everything’, ‘NA’s will propagate conceptually,
 i.e., a resulting value will be ‘NA’ whenever one of its
 contributing observations is ‘NA’.

I do not think the names should make a difference (unless you're
talking about human error).

Best regards,

Josh

On Wed, Sep 8, 2010 at 12:35 PM, Stephane Vaucher
vauch...@iro.umontreal.ca wrote:
 Hi everyone,

 I'm observing what I believe is weird behaviour when attempting to do
 something very simple. I want a correlation matrix, but my matrix seems to
 contain correlation values that are not found when executed on pairs:

 test2$P2

  [1] 2 2 4 4 1 3 2 4 3 3 2 3 4 1 2 2 4 3 4 1 2 3 2 1 3

 test2$HP_tot

  [1]  10  10  10  10  10  10  10  10 136 136 136 136 136 136 136 136 136 136
  15
 [20]  15  15  15  15  15  15 c=cor(test2$P3,test2$HP_tot,method='spearman')

 c

 [1] -0.2182876

 c=cor(test2,method='spearman')

 Warning message:
 In cor(test2, method = spearman) : the standard deviation is zero

 write(c,file='out.csv')

 from my spreadsheet
 -0.25028783918741

 Most cells are correct, but not that one.

 If this is expected behaviour, I apologise for bothering you, I read the
 documentation, but I do not know if the calculation of matrices and pairs is
 done using the same function (eg, with respect to equal value observations).

 If this is not a desired behaviour, I noticed that it only occurs with a
 relatively large matrix (I couldn't reproduce on a simple 2 column data
 set). There might be a naming error.

 names(test2)

  [1] ID                   NOMBRE               MAIL
  [4] Age                  SEXO                 Studies
  [7] Hours_Internet       Vision.Disabilities  Other.disabilities
 [10] Technology_Knowledge Start_Time           End_Time
 [13] Duration             P1                   P1Book
 [16] P1DVD                P2                   P3
 [19] P4                   P5                   P6
 [22] P8                   P9                   P10
 [25] P11                  P12                  P7
 [28] SITE                 Errors               warnings
 [31] Manual               Total                H_tot
 [34] HP1.1                HP1.2                HP1.3
 [37] HP1.4                HP_tot               HO1.1
 [40] HO1.2                HO1.3                HO1.4
 [43] HO_tot               HU1.1                HU1.2
 [46] HU1.3                HU_tot               HR
 [49] L_tot                LP1.1                LP1.2
 [52] LP1.3                LP1.4                LP_tot
 [55] LO1.1                LO1.2                LO1.3
 [58] LO1.4                LO_tot               LU1.1
 [61] LU1.2                LU1.3                LU_tot
 [64] LR_tot               SP_tot               SP1.1
 [67] SP1.2                SP1.3                SP1.4
 [70] SP_tot.1             SO1.1                SO1.2
 [73] SO1.3                SO1.4                SO_tot
 [76] SU1.1                SU1.2                SU1.3
 [79] SU_tot               SR

 Thank you in advance,
 Stephane Vaucher

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Joshua Wiley
Ph.D. Student, Health Psychology
University of California, Los Angeles
http://www.joshuawiley.com/
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Correlation question

2010-09-08 Thread Stephane Vaucher


Hi everyone,

First of all, thanks for the quick responses. I appreciate the help.

Before answering questions, I wanted to mention that I tested this 
behaviour on 2.3.1 and 2.10.1 on a x86_64 linux arch, and on version 2.9.0 
on a 32 bit arch.


Now for the answers (batch version):

1/ I received another message stating that I mislabeled the data in my 
previous message. That was a retranscription error. I have included a 
sample of the data causing the problem


2/  On Wed, 8 Sep 2010, Joshua Wiley wrote:


Does your data have missing values?  I am not sure it would change
anything, but perhaps try adding:



cor(test2, method = spearman, use = pairwise.complete.obs)


Tried, no difference. The specific pairwise comparisons do not contain 
missing values (other parts of my data, yes)



3/ From: Kjetil Halvorsen kjetilbrinchmannhalvor...@gmail.com

Do dput(test2) and copypaste the output into the email message.


I had to clean up my data before sending it (had to remove names/emails).
Problems are visible with (spearman) correlations with P3 with H_tot and 
HP_tot. In my correlation matrix, these are both: -0.25028783918741 
instead of -0.2182876. Most other correlations are however accurate.


Here is the data:


dput(exp)
names(exp)

 [1] P1   P2   P3   P4   P5   P6
 [7] P8   P9   P10  P11  P12  P7
[13] SITE Errors   warnings Manual   TotalH_tot
[19] HP1.1HP1.2HP1.3HP1.4HP_tot   HO1.1
[25] HO1.2HO1.3HO1.4HO_tot   HU1.1HU1.2
[31] HU1.3HU_tot   HR   L_totLP1.1LP1.2
[37] LP1.3LP1.4LP_tot   LO1.1LO1.2LO1.3
[43] LO1.4LO_tot   LU1.1LU1.2LU1.3LU_tot
[49] LR_tot   SP_tot   SP1.1SP1.2SP1.3SP1.4
[55] SP_tot.1 SO1.1SO1.2SO1.3SO1.4SO_tot
[61] SU1.1SU1.2SU1.3SU_tot   SR

dput(exp)

structure(list(P1 = c(2L, 1L, 3L, 3L, 2L, 3L, 2L, 3L, 3L, 2L,
2L, 2L, 4L, 1L, 3L, 2L, 3L, 2L, 3L, 2L, 2L, 2L, 2L, 1L, 2L),
P2 = c(2L, 2L, 4L, 4L, 1L, 3L, 2L, 4L, 3L, 3L, 2L, 3L, 4L,
1L, 2L, 2L, 4L, 3L, 4L, 1L, 2L, 3L, 2L, 1L, 3L), P3 = c(2L,
2L, 2L, 4L, 2L, 3L, 2L, 1L, 3L, 2L, 2L, 2L, 3L, 1L, 2L, 1L,
1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L), P4 = c(1L, 3L, 3L, 4L,
2L, 3L, 1L, 4L, 3L, 3L, 2L, 3L, 4L, 1L, 3L, 2L, 2L, 1L, 2L,
1L, 1L, 2L, 1L, 1L, 2L), P5 = c(2L, 1L, 4L, 1L, 2L, 2L, 2L,
3L, 3L, 2L, 2L, 3L, 4L, 1L, 2L, 2L, 3L, 3L, 3L, 2L, 2L, 2L,
2L, 2L, 3L), P6 = c(2L, 2L, 4L, 1L, 2L, 2L, 2L, 2L, 3L, 2L,
2L, 2L, 4L, 1L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 3L
), P8 = c(2L, 2L, 4L, 2L, 2L, 2L, 2L, 4L, 3L, 2L, 2L, 2L,
4L, 1L, 3L, 2L, 3L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L), P9 = c(4L,
0L, 4L, 0L, 0L, 2L, 3L, 0L, 0L, 4L, 0L, 4L, 3L, 2L, 4L, 0L,
0L, 0L, 3L, 4L, 3L, 0L, 4L, 0L, 3L), P10 = c(3L, 3L, 2L,
2L, 3L, 3L, 0L, 2L, 2L, 2L, 3L, 2L, 2L, 3L, 3L, 3L, 2L, 2L,
2L, 3L, 3L, 2L, 3L, 3L, 3L), P11 = c(1L, 1L, 2L, 2L, 1L,
1L, 0L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), P12 = c(9L, 10L, 6L, 5L, 9L, 8L, 0L,
5L, 4L, 6L, 8L, 7L, 3L, 10L, 7L, 9L, 7L, 6L, 7L, 10L, 8L,
7L, 8L, 9L, 7L), P7 = structure(c(1L, 9L, 7L, 8L, 1L, 3L,
1L, 1L, 5L, 4L, 1L, 1L, 1L, 1L, 1L, 6L, 1L, 1L, 1L, 11L,
10L, 1L, 2L, 1L, 1L), .Label = c(Â , Â al inicio de la 
pÃ¡gia

conexiÃ³n.,
Â al principio no sabes muy bien por donde empezar pero una vez 
aclarado es facil,
Â busqueda diferente alas conocidas, Â CUando el tÃtulo 
contiene dos puntos,no responde.,
Â Lass,suelen dar facilidades para 
buscar lo que se necesita.,
Â La verdad es que estÃ¡ un poco confusa la web y, si la persona 
que tiene que acceder a ella no es experta, creo que lo tiene difÃcil,
Â muy complicado, el poder encontrar los citados documentos, en la 
UNIVERSIDAD citada,,

Â Ninguna, Â Que no hay pestaÃ±as q
de pelicula,
Â yo ninguna), class = factor), SITE = c(5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), Errors = c(201L, 201L, 201L,
201L, 201L, 201L, 201L, 201L, 369L, 369L, 369L, 369L, 369L,
369L, 369L, 369L, 369L, 369L, 159L, 159L, 159L, 159L, 159L,
159L, 159L), warnings = c(164L, 164L, 164L, 164L, 164L, 164L,
164L, 164L, 447L, 447L, 447L, 447L, 447L, 447L, 447L, 447L,
447L, 447L, 490L, 490L, 490L, 490L, 490L, 490L, 490L), Manual = c(44L,
44L, 44L, 44L, 44L, 44L, 44L, 44L, 46L, 46L, 46L, 46L, 46L,
46L, 46L, 46L, 46L, 46L, 45L, 45L, 45L, 45L, 45L, 45L, 45L
), Total = c(409L, 409L, 409L, 409L, 409L, 409L, 409L, 409L,
862L, 862L, 862L, 862L, 862L, 862L, 862L, 862L, 862L, 862L,
694L, 694L, 694L, 694L, 694L, 694L, 694L), H_tot = c(11L,
11L, 11L, 11L, 11L, 11L, 11L, 11L, 140L, 140L, 140L, 140L,
140L, 140L, 140L, 140L, 140L, 140L, 21L, 21L, 21L, 21L, 21L,
21L, 21L), HP1.1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 4L,
4L), HP1.2 = c(0L, 0L,

Re: [R] Correlation question (from a newbie)

2009-07-15 Thread Santosh

Dear R-users please ignore my most recent posting..

Found the solution.. Thanks to David Winsemius..


Thanks,
Santosh

On Tue, Jul 14, 2009 at 9:14 PM, Santosh santosh2...@gmail.com wrote:

 Dear R-users..

 I hope the following scenario is more explanatory of my question..

 Continuous variables: AGE, WEIGHT, HEIGHT
 Categorical variables: Group, Sex, Race
 I would like to find a correlation between WEIGHT and AGE, grouped by
 Group,Sex, and Race.

 Is the following formula correct?
 tapply(dat$WEIGHT,
 by=list(dat$AGE,as.factor(dat$Group),as.factor(dat$EX),as.factor(dat$RACE)),cor)


 Thanks,
 Santosh


 On Tue, Jul 14, 2009 at 7:34 PM, Santosh santosh2...@gmail.com wrote:

 Hi R-users,

 Was wondering if there is a way to quickly compute correlations between
 continuous variables grouped by some categorical variables?

 What function do I use?

 Thanks much in advance.

 Regards,
 Santosh




[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Correlation question (from a newbie)

2009-07-14 Thread Santosh

Hi R-users,

Was wondering if there is a way to quickly compute correlations between
continuous variables grouped by some categorical variables?

What function do I use?

Thanks much in advance.

Regards,
Santosh

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Correlation question (from a newbie)

2009-07-14 Thread David Winsemius



On Jul 14, 2009, at 10:34 PM, Santosh wrote:


Hi R-users,

Was wondering if there is a way to quickly compute correlations  
between

continuous variables grouped by some categorical variables?

What function do I use?



?tapply
?by


Thanks much in advance.

Regards,
Santosh

[[alternative HTML version deleted]]



David Winsemius, MD
Heritage Laboratories
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Correlation question (from a newbie)

2009-07-14 Thread Santosh

Dear R-users..

I hope the following scenario is more explanatory of my question..

Continuous variables: AGE, WEIGHT, HEIGHT
Categorical variables: Group, Sex, Race
I would like to find a correlation between WEIGHT and AGE, grouped by
Group,Sex, and Race.

Is the following formula correct?
tapply(dat$WEIGHT,
by=list(dat$AGE,as.factor(dat$Group),as.factor(dat$EX),as.factor(dat$RACE)),cor)


Thanks,
Santosh

On Tue, Jul 14, 2009 at 7:34 PM, Santosh santosh2...@gmail.com wrote:

 Hi R-users,

 Was wondering if there is a way to quickly compute correlations between
 continuous variables grouped by some categorical variables?

 What function do I use?

 Thanks much in advance.

 Regards,
 Santosh


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Correlation question

[R] Correlation question

Re: [R] Correlation question

Re: [R] Correlation question

Re: [R] Correlation question

Re: [R] Correlation question

Re: [R] Correlation question

Re: [R] Correlation question

Re: [R] Correlation question

Re: [R] Correlation question

Re: [R] Correlation question

[R] Correlation question

Re: [R] Correlation question

Re: [R] Correlation question

Re: [R] Correlation question (from a newbie)

[R] Correlation question (from a newbie)

Re: [R] Correlation question (from a newbie)

Re: [R] Correlation question (from a newbie)

18 matches

Site Navigation

Mail list logo

Footer information