Re: [R] two-sample KS test: data becomes significantly different after normalization

2015-01-15 Thread Monnand
Thank you, Chris and Martin!

On Wed Jan 14 2015 at 7:31:12 AM Andrews, Chris chri...@med.umich.edu
wrote:

 Your definition of p-value is not correct.  See, for example,
 http://en.wikipedia.org/wiki/P-value#Misunderstandings

 -Original Message-
 From: Monnand [mailto:monn...@gmail.com]
 Sent: Wednesday, January 14, 2015 2:17 AM
 To: Andrews, Chris
 Cc: r-help@r-project.org
 Subject: Re: [R] two-sample KS test: data becomes significantly different
 after normalization

 I know this must be a wrong method, but I cannot help to ask: Can I only
 use the p-value from KS test, saying if p-value is greater than \beta, then
 two samples are from the same distribution. If the definition of p-value is
 the probability that the null hypothesis is true, then why there's little
 people uses p-value as a true probability. e.g. normally, people will not
 multiply or add p-values to get the probability that two independent null
 hypothesis are both true or one of them is true. I had this question for
 very long time.

 -Monnand

 On Tue Jan 13 2015 at 2:47:30 PM Andrews, Chris chri...@med.umich.edu
 wrote:

  This sounds more like quality control than hypothesis testing.  Rather
  than statistical significance, you want to determine what is an
 acceptable
  difference (an 'equivalence margin', if you will).  And that is a
 question
  about the application, not a statistical one.
  
  From: Monnand [monn...@gmail.com]
  Sent: Monday, January 12, 2015 10:14 PM
  To: Andrews, Chris
  Cc: r-help@r-project.org
  Subject: Re: [R] two-sample KS test: data becomes significantly different
  after normalization
 
  Thank you, Chris!
 
  I think it is exactly the problem you mentioned. I did consider
  1000-point data is a large one at first.
 
  I down-sampled the data from 1000 points to 100 points and ran KS test
  again. It worked as expected. Is there any typical method to compare
  two large samples? I also tried KL diverge, but it only gives me some
  number but does not tell me how large the distance is should be
  considered as significantly different.
 
  Regards,
  -Monnand
 
  On Mon, Jan 12, 2015 at 9:32 AM, Andrews, Chris chri...@med.umich.edu
  wrote:
  
   The main issue is that the original distributions are the same, you
  shift the two samples *by different amounts* (about 0.01 SD), and you
 have
  a large (n=1000) sample size.  Thus the new distributions are not the
 same.
  
   This is a problem with testing for equality of distributions.  With
  large samples, even a small deviation is significant.
  
   Chris
  
   -Original Message-
   From: Monnand [mailto:monn...@gmail.com]
   Sent: Sunday, January 11, 2015 10:13 PM
   To: r-help@r-project.org
   Subject: [R] two-sample KS test: data becomes significantly different
  after normalization
  
   Hi all,
  
   This question is sort of related to R (I'm not sure if I used an R
  function
   correctly), but also related to stats in general. I'm sorry if this is
   considered as off-topic.
  
   I'm currently working on a data set with two sets of samples. The csv
  file
   of the data could be found here: http://pastebin.com/200v10py
  
   I would like to use KS test to see if these two sets of samples are
 from
   different distributions.
  
   I ran the following R script:
  
   # read data from the file
   data = read.csv('data.csv')
   ks.test(data[[1]], data[[2]])
   Two-sample Kolmogorov-Smirnov test
  
   data:  data[[1]] and data[[2]]
   D = 0.025, p-value = 0.9132
   alternative hypothesis: two-sided
   The KS test shows that these two samples are very similar. (In fact,
 they
   should come from same distribution.)
  
   However, due to some reasons, instead of the raw values, the actual
 data
   that I will get will be normalized (zero mean, unit variance). So I
 tried
   to normalize the raw data I have and run the KS test again:
  
   ks.test(scale(data[[1]]), scale(data[[2]]))
   Two-sample Kolmogorov-Smirnov test
  
   data:  scale(data[[1]]) and scale(data[[2]])
   D = 0.3273, p-value  2.2e-16
   alternative hypothesis: two-sided
   The p-value becomes almost zero after normalization indicating these
 two
   samples are significantly different (from different distributions).
  
   My question is: How the normalization could make two similar samples
   becomes different from each other? I can see that if two samples are
   different, then normalization could make them similar. However, if two
  sets
   of data are similar, then intuitively, applying same operation onto
 them
   should make them still similar, at least not different from each other
  too
   much.
  
   I did some further analysis about the data. I also tried to normalize
 the
   data into [0,1] range (using the formula (x-min(x))/(max(x)-min(x))),
 but
   same thing happened. At first, I thought it might be outliers caused
 this
   problem (I can see that an outlier may cause this problem if I
 normalize
   the data

Re: [R] two-sample KS test: data becomes significantly different after normalization

2015-01-14 Thread Martin Maechler
 Monnand  monn...@gmail.com
 on Wed, 14 Jan 2015 07:17:02 + writes:

 I know this must be a wrong method, but I cannot help to ask: Can I only
 use the p-value from KS test, saying if p-value is greater than \beta, 
then
 two samples are from the same distribution. If the definition of p-value 
is
 the probability that the null hypothesis is true, 

Ouch, ouch, ouch, ouch 

The worst misuse/misunderstanding of statistics  now even on R-help ...

--- please get help from a statistician !!

-- and erase that sentence from your mind (unless you are pro
and want to keep it for anectdotal or didactical purposes...) 

 then why there's little
 people uses p-value as a true probability. e.g. normally, people will 
not
 multiply or add p-values to get the probability that two independent null
 hypothesis are both true or one of them is true. I had this question for
 very long time.

 -Monnand

 On Tue Jan 13 2015 at 2:47:30 PM Andrews, Chris chri...@med.umich.edu
 wrote:

 This sounds more like quality control than hypothesis testing.  Rather
 than statistical significance, you want to determine what is an 
acceptable
 difference (an 'equivalence margin', if you will).  And that is a 
question
 about the application, not a statistical one.
 
 From: Monnand [monn...@gmail.com]
 Sent: Monday, January 12, 2015 10:14 PM
 To: Andrews, Chris
 Cc: r-help@r-project.org
 Subject: Re: [R] two-sample KS test: data becomes significantly different
 after normalization
 
 Thank you, Chris!
 
 I think it is exactly the problem you mentioned. I did consider
 1000-point data is a large one at first.
 
 I down-sampled the data from 1000 points to 100 points and ran KS test
 again. It worked as expected. Is there any typical method to compare
 two large samples? I also tried KL diverge, but it only gives me some
 number but does not tell me how large the distance is should be
 considered as significantly different.
 
 Regards,
 -Monnand
 
 On Mon, Jan 12, 2015 at 9:32 AM, Andrews, Chris chri...@med.umich.edu
 wrote:
 
  The main issue is that the original distributions are the same, you
 shift the two samples *by different amounts* (about 0.01 SD), and you 
have
 a large (n=1000) sample size.  Thus the new distributions are not the 
same.
 
  This is a problem with testing for equality of distributions.  With
 large samples, even a small deviation is significant.
 
  Chris
 
  -Original Message-
  From: Monnand [mailto:monn...@gmail.com]
  Sent: Sunday, January 11, 2015 10:13 PM
  To: r-help@r-project.org
  Subject: [R] two-sample KS test: data becomes significantly different
 after normalization
 
  Hi all,
 
  This question is sort of related to R (I'm not sure if I used an R
 function
  correctly), but also related to stats in general. I'm sorry if this is
  considered as off-topic.
 
  I'm currently working on a data set with two sets of samples. The csv
 file
  of the data could be found here: http://pastebin.com/200v10py
 
  I would like to use KS test to see if these two sets of samples are 
from
  different distributions.
 
  I ran the following R script:
 
  # read data from the file
  data = read.csv('data.csv')
  ks.test(data[[1]], data[[2]])
  Two-sample Kolmogorov-Smirnov test
 
  data:  data[[1]] and data[[2]]
  D = 0.025, p-value = 0.9132
  alternative hypothesis: two-sided
  The KS test shows that these two samples are very similar. (In fact, 
they
  should come from same distribution.)
 
  However, due to some reasons, instead of the raw values, the actual 
data
  that I will get will be normalized (zero mean, unit variance). So I 
tried
  to normalize the raw data I have and run the KS test again:
 
  ks.test(scale(data[[1]]), scale(data[[2]]))
  Two-sample Kolmogorov-Smirnov test
 
  data:  scale(data[[1]]) and scale(data[[2]])
  D = 0.3273, p-value  2.2e-16
  alternative hypothesis: two-sided
  The p-value becomes almost zero after normalization indicating these 
two
  samples are significantly different (from different distributions).
 
  My question is: How the normalization could make two similar samples
  becomes different from each other? I can see that if two samples are
  different, then normalization could make them similar. However, if two
 sets
  of data are similar, then intuitively, applying same operation onto 
them
  should make them still similar, at least not different from each other
 too
  much.
 
  I did some further analysis about the data. I also tried to normalize 
the
  data into [0,1] range

Re: [R] two-sample KS test: data becomes significantly different after normalization

2015-01-14 Thread Andrews, Chris
Your definition of p-value is not correct.  See, for example, 
http://en.wikipedia.org/wiki/P-value#Misunderstandings 

-Original Message-
From: Monnand [mailto:monn...@gmail.com] 
Sent: Wednesday, January 14, 2015 2:17 AM
To: Andrews, Chris
Cc: r-help@r-project.org
Subject: Re: [R] two-sample KS test: data becomes significantly different after 
normalization

I know this must be a wrong method, but I cannot help to ask: Can I only
use the p-value from KS test, saying if p-value is greater than \beta, then
two samples are from the same distribution. If the definition of p-value is
the probability that the null hypothesis is true, then why there's little
people uses p-value as a true probability. e.g. normally, people will not
multiply or add p-values to get the probability that two independent null
hypothesis are both true or one of them is true. I had this question for
very long time.

-Monnand

On Tue Jan 13 2015 at 2:47:30 PM Andrews, Chris chri...@med.umich.edu
wrote:

 This sounds more like quality control than hypothesis testing.  Rather
 than statistical significance, you want to determine what is an acceptable
 difference (an 'equivalence margin', if you will).  And that is a question
 about the application, not a statistical one.
 
 From: Monnand [monn...@gmail.com]
 Sent: Monday, January 12, 2015 10:14 PM
 To: Andrews, Chris
 Cc: r-help@r-project.org
 Subject: Re: [R] two-sample KS test: data becomes significantly different
 after normalization

 Thank you, Chris!

 I think it is exactly the problem you mentioned. I did consider
 1000-point data is a large one at first.

 I down-sampled the data from 1000 points to 100 points and ran KS test
 again. It worked as expected. Is there any typical method to compare
 two large samples? I also tried KL diverge, but it only gives me some
 number but does not tell me how large the distance is should be
 considered as significantly different.

 Regards,
 -Monnand

 On Mon, Jan 12, 2015 at 9:32 AM, Andrews, Chris chri...@med.umich.edu
 wrote:
 
  The main issue is that the original distributions are the same, you
 shift the two samples *by different amounts* (about 0.01 SD), and you have
 a large (n=1000) sample size.  Thus the new distributions are not the same.
 
  This is a problem with testing for equality of distributions.  With
 large samples, even a small deviation is significant.
 
  Chris
 
  -Original Message-
  From: Monnand [mailto:monn...@gmail.com]
  Sent: Sunday, January 11, 2015 10:13 PM
  To: r-help@r-project.org
  Subject: [R] two-sample KS test: data becomes significantly different
 after normalization
 
  Hi all,
 
  This question is sort of related to R (I'm not sure if I used an R
 function
  correctly), but also related to stats in general. I'm sorry if this is
  considered as off-topic.
 
  I'm currently working on a data set with two sets of samples. The csv
 file
  of the data could be found here: http://pastebin.com/200v10py
 
  I would like to use KS test to see if these two sets of samples are from
  different distributions.
 
  I ran the following R script:
 
  # read data from the file
  data = read.csv('data.csv')
  ks.test(data[[1]], data[[2]])
  Two-sample Kolmogorov-Smirnov test
 
  data:  data[[1]] and data[[2]]
  D = 0.025, p-value = 0.9132
  alternative hypothesis: two-sided
  The KS test shows that these two samples are very similar. (In fact, they
  should come from same distribution.)
 
  However, due to some reasons, instead of the raw values, the actual data
  that I will get will be normalized (zero mean, unit variance). So I tried
  to normalize the raw data I have and run the KS test again:
 
  ks.test(scale(data[[1]]), scale(data[[2]]))
  Two-sample Kolmogorov-Smirnov test
 
  data:  scale(data[[1]]) and scale(data[[2]])
  D = 0.3273, p-value  2.2e-16
  alternative hypothesis: two-sided
  The p-value becomes almost zero after normalization indicating these two
  samples are significantly different (from different distributions).
 
  My question is: How the normalization could make two similar samples
  becomes different from each other? I can see that if two samples are
  different, then normalization could make them similar. However, if two
 sets
  of data are similar, then intuitively, applying same operation onto them
  should make them still similar, at least not different from each other
 too
  much.
 
  I did some further analysis about the data. I also tried to normalize the
  data into [0,1] range (using the formula (x-min(x))/(max(x)-min(x))), but
  same thing happened. At first, I thought it might be outliers caused this
  problem (I can see that an outlier may cause this problem if I normalize
  the data into [0,1] range.) I deleted all data whose abs value is larger
  than 4 standard deviation. But it still didn't help.
 
  Plus, I even plotted the eCDFs, they *really* look the same to me even
  after normalization. Anything wrong

Re: [R] two-sample KS test: data becomes significantly different after normalization

2015-01-13 Thread Monnand
I know this must be a wrong method, but I cannot help to ask: Can I only
use the p-value from KS test, saying if p-value is greater than \beta, then
two samples are from the same distribution. If the definition of p-value is
the probability that the null hypothesis is true, then why there's little
people uses p-value as a true probability. e.g. normally, people will not
multiply or add p-values to get the probability that two independent null
hypothesis are both true or one of them is true. I had this question for
very long time.

-Monnand

On Tue Jan 13 2015 at 2:47:30 PM Andrews, Chris chri...@med.umich.edu
wrote:

 This sounds more like quality control than hypothesis testing.  Rather
 than statistical significance, you want to determine what is an acceptable
 difference (an 'equivalence margin', if you will).  And that is a question
 about the application, not a statistical one.
 
 From: Monnand [monn...@gmail.com]
 Sent: Monday, January 12, 2015 10:14 PM
 To: Andrews, Chris
 Cc: r-help@r-project.org
 Subject: Re: [R] two-sample KS test: data becomes significantly different
 after normalization

 Thank you, Chris!

 I think it is exactly the problem you mentioned. I did consider
 1000-point data is a large one at first.

 I down-sampled the data from 1000 points to 100 points and ran KS test
 again. It worked as expected. Is there any typical method to compare
 two large samples? I also tried KL diverge, but it only gives me some
 number but does not tell me how large the distance is should be
 considered as significantly different.

 Regards,
 -Monnand

 On Mon, Jan 12, 2015 at 9:32 AM, Andrews, Chris chri...@med.umich.edu
 wrote:
 
  The main issue is that the original distributions are the same, you
 shift the two samples *by different amounts* (about 0.01 SD), and you have
 a large (n=1000) sample size.  Thus the new distributions are not the same.
 
  This is a problem with testing for equality of distributions.  With
 large samples, even a small deviation is significant.
 
  Chris
 
  -Original Message-
  From: Monnand [mailto:monn...@gmail.com]
  Sent: Sunday, January 11, 2015 10:13 PM
  To: r-help@r-project.org
  Subject: [R] two-sample KS test: data becomes significantly different
 after normalization
 
  Hi all,
 
  This question is sort of related to R (I'm not sure if I used an R
 function
  correctly), but also related to stats in general. I'm sorry if this is
  considered as off-topic.
 
  I'm currently working on a data set with two sets of samples. The csv
 file
  of the data could be found here: http://pastebin.com/200v10py
 
  I would like to use KS test to see if these two sets of samples are from
  different distributions.
 
  I ran the following R script:
 
  # read data from the file
  data = read.csv('data.csv')
  ks.test(data[[1]], data[[2]])
  Two-sample Kolmogorov-Smirnov test
 
  data:  data[[1]] and data[[2]]
  D = 0.025, p-value = 0.9132
  alternative hypothesis: two-sided
  The KS test shows that these two samples are very similar. (In fact, they
  should come from same distribution.)
 
  However, due to some reasons, instead of the raw values, the actual data
  that I will get will be normalized (zero mean, unit variance). So I tried
  to normalize the raw data I have and run the KS test again:
 
  ks.test(scale(data[[1]]), scale(data[[2]]))
  Two-sample Kolmogorov-Smirnov test
 
  data:  scale(data[[1]]) and scale(data[[2]])
  D = 0.3273, p-value  2.2e-16
  alternative hypothesis: two-sided
  The p-value becomes almost zero after normalization indicating these two
  samples are significantly different (from different distributions).
 
  My question is: How the normalization could make two similar samples
  becomes different from each other? I can see that if two samples are
  different, then normalization could make them similar. However, if two
 sets
  of data are similar, then intuitively, applying same operation onto them
  should make them still similar, at least not different from each other
 too
  much.
 
  I did some further analysis about the data. I also tried to normalize the
  data into [0,1] range (using the formula (x-min(x))/(max(x)-min(x))), but
  same thing happened. At first, I thought it might be outliers caused this
  problem (I can see that an outlier may cause this problem if I normalize
  the data into [0,1] range.) I deleted all data whose abs value is larger
  than 4 standard deviation. But it still didn't help.
 
  Plus, I even plotted the eCDFs, they *really* look the same to me even
  after normalization. Anything wrong with my usage of the R function?
 
  Since the data contains ties, I also tried ks.boot (
  http://sekhon.berkeley.edu/matching/ks.boot.html ), but I got the same
  result.
 
  Could anyone help me to explain why it happened? Also, any suggestion
 about
  the hypothesis testing on normalized data? (The data I have right now is
  simulated data. In real world, I cannot get

Re: [R] two-sample KS test: data becomes significantly different after normalization

2015-01-13 Thread Andrews, Chris
This sounds more like quality control than hypothesis testing.  Rather than 
statistical significance, you want to determine what is an acceptable 
difference (an 'equivalence margin', if you will).  And that is a question 
about the application, not a statistical one.

From: Monnand [monn...@gmail.com]
Sent: Monday, January 12, 2015 10:14 PM
To: Andrews, Chris
Cc: r-help@r-project.org
Subject: Re: [R] two-sample KS test: data becomes significantly different after 
normalization

Thank you, Chris!

I think it is exactly the problem you mentioned. I did consider
1000-point data is a large one at first.

I down-sampled the data from 1000 points to 100 points and ran KS test
again. It worked as expected. Is there any typical method to compare
two large samples? I also tried KL diverge, but it only gives me some
number but does not tell me how large the distance is should be
considered as significantly different.

Regards,
-Monnand

On Mon, Jan 12, 2015 at 9:32 AM, Andrews, Chris chri...@med.umich.edu wrote:

 The main issue is that the original distributions are the same, you shift the 
 two samples *by different amounts* (about 0.01 SD), and you have a large 
 (n=1000) sample size.  Thus the new distributions are not the same.

 This is a problem with testing for equality of distributions.  With large 
 samples, even a small deviation is significant.

 Chris

 -Original Message-
 From: Monnand [mailto:monn...@gmail.com]
 Sent: Sunday, January 11, 2015 10:13 PM
 To: r-help@r-project.org
 Subject: [R] two-sample KS test: data becomes significantly different after 
 normalization

 Hi all,

 This question is sort of related to R (I'm not sure if I used an R function
 correctly), but also related to stats in general. I'm sorry if this is
 considered as off-topic.

 I'm currently working on a data set with two sets of samples. The csv file
 of the data could be found here: http://pastebin.com/200v10py

 I would like to use KS test to see if these two sets of samples are from
 different distributions.

 I ran the following R script:

 # read data from the file
 data = read.csv('data.csv')
 ks.test(data[[1]], data[[2]])
 Two-sample Kolmogorov-Smirnov test

 data:  data[[1]] and data[[2]]
 D = 0.025, p-value = 0.9132
 alternative hypothesis: two-sided
 The KS test shows that these two samples are very similar. (In fact, they
 should come from same distribution.)

 However, due to some reasons, instead of the raw values, the actual data
 that I will get will be normalized (zero mean, unit variance). So I tried
 to normalize the raw data I have and run the KS test again:

 ks.test(scale(data[[1]]), scale(data[[2]]))
 Two-sample Kolmogorov-Smirnov test

 data:  scale(data[[1]]) and scale(data[[2]])
 D = 0.3273, p-value  2.2e-16
 alternative hypothesis: two-sided
 The p-value becomes almost zero after normalization indicating these two
 samples are significantly different (from different distributions).

 My question is: How the normalization could make two similar samples
 becomes different from each other? I can see that if two samples are
 different, then normalization could make them similar. However, if two sets
 of data are similar, then intuitively, applying same operation onto them
 should make them still similar, at least not different from each other too
 much.

 I did some further analysis about the data. I also tried to normalize the
 data into [0,1] range (using the formula (x-min(x))/(max(x)-min(x))), but
 same thing happened. At first, I thought it might be outliers caused this
 problem (I can see that an outlier may cause this problem if I normalize
 the data into [0,1] range.) I deleted all data whose abs value is larger
 than 4 standard deviation. But it still didn't help.

 Plus, I even plotted the eCDFs, they *really* look the same to me even
 after normalization. Anything wrong with my usage of the R function?

 Since the data contains ties, I also tried ks.boot (
 http://sekhon.berkeley.edu/matching/ks.boot.html ), but I got the same
 result.

 Could anyone help me to explain why it happened? Also, any suggestion about
 the hypothesis testing on normalized data? (The data I have right now is
 simulated data. In real world, I cannot get raw data, but only normalized
 one.)

 Regards,
 -Monnand

 [[alternative HTML version deleted]]


 **
 Electronic Mail is not secure, may not be read every day, and should not be 
 used for urgent or sensitive issues
**
Electronic Mail is not secure, may not be read every day, and should not be 
used for urgent or sensitive issues 

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented

Re: [R] two-sample KS test: data becomes significantly different after normalization

2015-01-12 Thread Andrews, Chris

The main issue is that the original distributions are the same, you shift the 
two samples *by different amounts* (about 0.01 SD), and you have a large 
(n=1000) sample size.  Thus the new distributions are not the same.

This is a problem with testing for equality of distributions.  With large 
samples, even a small deviation is significant.

Chris

-Original Message-
From: Monnand [mailto:monn...@gmail.com] 
Sent: Sunday, January 11, 2015 10:13 PM
To: r-help@r-project.org
Subject: [R] two-sample KS test: data becomes significantly different after 
normalization

Hi all,

This question is sort of related to R (I'm not sure if I used an R function
correctly), but also related to stats in general. I'm sorry if this is
considered as off-topic.

I'm currently working on a data set with two sets of samples. The csv file
of the data could be found here: http://pastebin.com/200v10py

I would like to use KS test to see if these two sets of samples are from
different distributions.

I ran the following R script:

# read data from the file
 data = read.csv('data.csv')
 ks.test(data[[1]], data[[2]])
Two-sample Kolmogorov-Smirnov test

data:  data[[1]] and data[[2]]
D = 0.025, p-value = 0.9132
alternative hypothesis: two-sided
The KS test shows that these two samples are very similar. (In fact, they
should come from same distribution.)

However, due to some reasons, instead of the raw values, the actual data
that I will get will be normalized (zero mean, unit variance). So I tried
to normalize the raw data I have and run the KS test again:

 ks.test(scale(data[[1]]), scale(data[[2]]))
Two-sample Kolmogorov-Smirnov test

data:  scale(data[[1]]) and scale(data[[2]])
D = 0.3273, p-value  2.2e-16
alternative hypothesis: two-sided
The p-value becomes almost zero after normalization indicating these two
samples are significantly different (from different distributions).

My question is: How the normalization could make two similar samples
becomes different from each other? I can see that if two samples are
different, then normalization could make them similar. However, if two sets
of data are similar, then intuitively, applying same operation onto them
should make them still similar, at least not different from each other too
much.

I did some further analysis about the data. I also tried to normalize the
data into [0,1] range (using the formula (x-min(x))/(max(x)-min(x))), but
same thing happened. At first, I thought it might be outliers caused this
problem (I can see that an outlier may cause this problem if I normalize
the data into [0,1] range.) I deleted all data whose abs value is larger
than 4 standard deviation. But it still didn't help.

Plus, I even plotted the eCDFs, they *really* look the same to me even
after normalization. Anything wrong with my usage of the R function?

Since the data contains ties, I also tried ks.boot (
http://sekhon.berkeley.edu/matching/ks.boot.html ), but I got the same
result.

Could anyone help me to explain why it happened? Also, any suggestion about
the hypothesis testing on normalized data? (The data I have right now is
simulated data. In real world, I cannot get raw data, but only normalized
one.)

Regards,
-Monnand

[[alternative HTML version deleted]]


**
Electronic Mail is not secure, may not be read every day, and should not be 
used for urgent or sensitive issues 
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] two-sample KS test: data becomes significantly different after normalization

2015-01-12 Thread Monnand
Thank you, Chris!

I think it is exactly the problem you mentioned. I did consider
1000-point data is a large one at first.

I down-sampled the data from 1000 points to 100 points and ran KS test
again. It worked as expected. Is there any typical method to compare
two large samples? I also tried KL diverge, but it only gives me some
number but does not tell me how large the distance is should be
considered as significantly different.

Regards,
-Monnand

On Mon, Jan 12, 2015 at 9:32 AM, Andrews, Chris chri...@med.umich.edu wrote:

 The main issue is that the original distributions are the same, you shift the 
 two samples *by different amounts* (about 0.01 SD), and you have a large 
 (n=1000) sample size.  Thus the new distributions are not the same.

 This is a problem with testing for equality of distributions.  With large 
 samples, even a small deviation is significant.

 Chris

 -Original Message-
 From: Monnand [mailto:monn...@gmail.com]
 Sent: Sunday, January 11, 2015 10:13 PM
 To: r-help@r-project.org
 Subject: [R] two-sample KS test: data becomes significantly different after 
 normalization

 Hi all,

 This question is sort of related to R (I'm not sure if I used an R function
 correctly), but also related to stats in general. I'm sorry if this is
 considered as off-topic.

 I'm currently working on a data set with two sets of samples. The csv file
 of the data could be found here: http://pastebin.com/200v10py

 I would like to use KS test to see if these two sets of samples are from
 different distributions.

 I ran the following R script:

 # read data from the file
 data = read.csv('data.csv')
 ks.test(data[[1]], data[[2]])
 Two-sample Kolmogorov-Smirnov test

 data:  data[[1]] and data[[2]]
 D = 0.025, p-value = 0.9132
 alternative hypothesis: two-sided
 The KS test shows that these two samples are very similar. (In fact, they
 should come from same distribution.)

 However, due to some reasons, instead of the raw values, the actual data
 that I will get will be normalized (zero mean, unit variance). So I tried
 to normalize the raw data I have and run the KS test again:

 ks.test(scale(data[[1]]), scale(data[[2]]))
 Two-sample Kolmogorov-Smirnov test

 data:  scale(data[[1]]) and scale(data[[2]])
 D = 0.3273, p-value  2.2e-16
 alternative hypothesis: two-sided
 The p-value becomes almost zero after normalization indicating these two
 samples are significantly different (from different distributions).

 My question is: How the normalization could make two similar samples
 becomes different from each other? I can see that if two samples are
 different, then normalization could make them similar. However, if two sets
 of data are similar, then intuitively, applying same operation onto them
 should make them still similar, at least not different from each other too
 much.

 I did some further analysis about the data. I also tried to normalize the
 data into [0,1] range (using the formula (x-min(x))/(max(x)-min(x))), but
 same thing happened. At first, I thought it might be outliers caused this
 problem (I can see that an outlier may cause this problem if I normalize
 the data into [0,1] range.) I deleted all data whose abs value is larger
 than 4 standard deviation. But it still didn't help.

 Plus, I even plotted the eCDFs, they *really* look the same to me even
 after normalization. Anything wrong with my usage of the R function?

 Since the data contains ties, I also tried ks.boot (
 http://sekhon.berkeley.edu/matching/ks.boot.html ), but I got the same
 result.

 Could anyone help me to explain why it happened? Also, any suggestion about
 the hypothesis testing on normalized data? (The data I have right now is
 simulated data. In real world, I cannot get raw data, but only normalized
 one.)

 Regards,
 -Monnand

 [[alternative HTML version deleted]]


 **
 Electronic Mail is not secure, may not be read every day, and should not be 
 used for urgent or sensitive issues

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] two-sample KS test: data becomes significantly different after normalization

2015-01-11 Thread Monnand
Hi all,

This question is sort of related to R (I'm not sure if I used an R function
correctly), but also related to stats in general. I'm sorry if this is
considered as off-topic.

I'm currently working on a data set with two sets of samples. The csv file
of the data could be found here: http://pastebin.com/200v10py

I would like to use KS test to see if these two sets of samples are from
different distributions.

I ran the following R script:

# read data from the file
 data = read.csv('data.csv')
 ks.test(data[[1]], data[[2]])
Two-sample Kolmogorov-Smirnov test

data:  data[[1]] and data[[2]]
D = 0.025, p-value = 0.9132
alternative hypothesis: two-sided
The KS test shows that these two samples are very similar. (In fact, they
should come from same distribution.)

However, due to some reasons, instead of the raw values, the actual data
that I will get will be normalized (zero mean, unit variance). So I tried
to normalize the raw data I have and run the KS test again:

 ks.test(scale(data[[1]]), scale(data[[2]]))
Two-sample Kolmogorov-Smirnov test

data:  scale(data[[1]]) and scale(data[[2]])
D = 0.3273, p-value  2.2e-16
alternative hypothesis: two-sided
The p-value becomes almost zero after normalization indicating these two
samples are significantly different (from different distributions).

My question is: How the normalization could make two similar samples
becomes different from each other? I can see that if two samples are
different, then normalization could make them similar. However, if two sets
of data are similar, then intuitively, applying same operation onto them
should make them still similar, at least not different from each other too
much.

I did some further analysis about the data. I also tried to normalize the
data into [0,1] range (using the formula (x-min(x))/(max(x)-min(x))), but
same thing happened. At first, I thought it might be outliers caused this
problem (I can see that an outlier may cause this problem if I normalize
the data into [0,1] range.) I deleted all data whose abs value is larger
than 4 standard deviation. But it still didn't help.

Plus, I even plotted the eCDFs, they *really* look the same to me even
after normalization. Anything wrong with my usage of the R function?

Since the data contains ties, I also tried ks.boot (
http://sekhon.berkeley.edu/matching/ks.boot.html ), but I got the same
result.

Could anyone help me to explain why it happened? Also, any suggestion about
the hypothesis testing on normalized data? (The data I have right now is
simulated data. In real world, I cannot get raw data, but only normalized
one.)

Regards,
-Monnand

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.