Re: [R] two-sample KS test: data becomes significantly different after normalization
Thank you, Chris and Martin! On Wed Jan 14 2015 at 7:31:12 AM Andrews, Chris chri...@med.umich.edu wrote: Your definition of p-value is not correct. See, for example, http://en.wikipedia.org/wiki/P-value#Misunderstandings -Original Message- From: Monnand [mailto:monn...@gmail.com] Sent: Wednesday, January 14, 2015 2:17 AM To: Andrews, Chris Cc: r-help@r-project.org Subject: Re: [R] two-sample KS test: data becomes significantly different after normalization I know this must be a wrong method, but I cannot help to ask: Can I only use the p-value from KS test, saying if p-value is greater than \beta, then two samples are from the same distribution. If the definition of p-value is the probability that the null hypothesis is true, then why there's little people uses p-value as a true probability. e.g. normally, people will not multiply or add p-values to get the probability that two independent null hypothesis are both true or one of them is true. I had this question for very long time. -Monnand On Tue Jan 13 2015 at 2:47:30 PM Andrews, Chris chri...@med.umich.edu wrote: This sounds more like quality control than hypothesis testing. Rather than statistical significance, you want to determine what is an acceptable difference (an 'equivalence margin', if you will). And that is a question about the application, not a statistical one. From: Monnand [monn...@gmail.com] Sent: Monday, January 12, 2015 10:14 PM To: Andrews, Chris Cc: r-help@r-project.org Subject: Re: [R] two-sample KS test: data becomes significantly different after normalization Thank you, Chris! I think it is exactly the problem you mentioned. I did consider 1000-point data is a large one at first. I down-sampled the data from 1000 points to 100 points and ran KS test again. It worked as expected. Is there any typical method to compare two large samples? I also tried KL diverge, but it only gives me some number but does not tell me how large the distance is should be considered as significantly different. Regards, -Monnand On Mon, Jan 12, 2015 at 9:32 AM, Andrews, Chris chri...@med.umich.edu wrote: The main issue is that the original distributions are the same, you shift the two samples *by different amounts* (about 0.01 SD), and you have a large (n=1000) sample size. Thus the new distributions are not the same. This is a problem with testing for equality of distributions. With large samples, even a small deviation is significant. Chris -Original Message- From: Monnand [mailto:monn...@gmail.com] Sent: Sunday, January 11, 2015 10:13 PM To: r-help@r-project.org Subject: [R] two-sample KS test: data becomes significantly different after normalization Hi all, This question is sort of related to R (I'm not sure if I used an R function correctly), but also related to stats in general. I'm sorry if this is considered as off-topic. I'm currently working on a data set with two sets of samples. The csv file of the data could be found here: http://pastebin.com/200v10py I would like to use KS test to see if these two sets of samples are from different distributions. I ran the following R script: # read data from the file data = read.csv('data.csv') ks.test(data[[1]], data[[2]]) Two-sample Kolmogorov-Smirnov test data: data[[1]] and data[[2]] D = 0.025, p-value = 0.9132 alternative hypothesis: two-sided The KS test shows that these two samples are very similar. (In fact, they should come from same distribution.) However, due to some reasons, instead of the raw values, the actual data that I will get will be normalized (zero mean, unit variance). So I tried to normalize the raw data I have and run the KS test again: ks.test(scale(data[[1]]), scale(data[[2]])) Two-sample Kolmogorov-Smirnov test data: scale(data[[1]]) and scale(data[[2]]) D = 0.3273, p-value 2.2e-16 alternative hypothesis: two-sided The p-value becomes almost zero after normalization indicating these two samples are significantly different (from different distributions). My question is: How the normalization could make two similar samples becomes different from each other? I can see that if two samples are different, then normalization could make them similar. However, if two sets of data are similar, then intuitively, applying same operation onto them should make them still similar, at least not different from each other too much. I did some further analysis about the data. I also tried to normalize the data into [0,1] range (using the formula (x-min(x))/(max(x)-min(x))), but same thing happened. At first, I thought it might be outliers caused this problem (I can see that an outlier may cause this problem if I normalize the data
Re: [R] two-sample KS test: data becomes significantly different after normalization
Monnand monn...@gmail.com on Wed, 14 Jan 2015 07:17:02 + writes: I know this must be a wrong method, but I cannot help to ask: Can I only use the p-value from KS test, saying if p-value is greater than \beta, then two samples are from the same distribution. If the definition of p-value is the probability that the null hypothesis is true, Ouch, ouch, ouch, ouch The worst misuse/misunderstanding of statistics now even on R-help ... --- please get help from a statistician !! -- and erase that sentence from your mind (unless you are pro and want to keep it for anectdotal or didactical purposes...) then why there's little people uses p-value as a true probability. e.g. normally, people will not multiply or add p-values to get the probability that two independent null hypothesis are both true or one of them is true. I had this question for very long time. -Monnand On Tue Jan 13 2015 at 2:47:30 PM Andrews, Chris chri...@med.umich.edu wrote: This sounds more like quality control than hypothesis testing. Rather than statistical significance, you want to determine what is an acceptable difference (an 'equivalence margin', if you will). And that is a question about the application, not a statistical one. From: Monnand [monn...@gmail.com] Sent: Monday, January 12, 2015 10:14 PM To: Andrews, Chris Cc: r-help@r-project.org Subject: Re: [R] two-sample KS test: data becomes significantly different after normalization Thank you, Chris! I think it is exactly the problem you mentioned. I did consider 1000-point data is a large one at first. I down-sampled the data from 1000 points to 100 points and ran KS test again. It worked as expected. Is there any typical method to compare two large samples? I also tried KL diverge, but it only gives me some number but does not tell me how large the distance is should be considered as significantly different. Regards, -Monnand On Mon, Jan 12, 2015 at 9:32 AM, Andrews, Chris chri...@med.umich.edu wrote: The main issue is that the original distributions are the same, you shift the two samples *by different amounts* (about 0.01 SD), and you have a large (n=1000) sample size. Thus the new distributions are not the same. This is a problem with testing for equality of distributions. With large samples, even a small deviation is significant. Chris -Original Message- From: Monnand [mailto:monn...@gmail.com] Sent: Sunday, January 11, 2015 10:13 PM To: r-help@r-project.org Subject: [R] two-sample KS test: data becomes significantly different after normalization Hi all, This question is sort of related to R (I'm not sure if I used an R function correctly), but also related to stats in general. I'm sorry if this is considered as off-topic. I'm currently working on a data set with two sets of samples. The csv file of the data could be found here: http://pastebin.com/200v10py I would like to use KS test to see if these two sets of samples are from different distributions. I ran the following R script: # read data from the file data = read.csv('data.csv') ks.test(data[[1]], data[[2]]) Two-sample Kolmogorov-Smirnov test data: data[[1]] and data[[2]] D = 0.025, p-value = 0.9132 alternative hypothesis: two-sided The KS test shows that these two samples are very similar. (In fact, they should come from same distribution.) However, due to some reasons, instead of the raw values, the actual data that I will get will be normalized (zero mean, unit variance). So I tried to normalize the raw data I have and run the KS test again: ks.test(scale(data[[1]]), scale(data[[2]])) Two-sample Kolmogorov-Smirnov test data: scale(data[[1]]) and scale(data[[2]]) D = 0.3273, p-value 2.2e-16 alternative hypothesis: two-sided The p-value becomes almost zero after normalization indicating these two samples are significantly different (from different distributions). My question is: How the normalization could make two similar samples becomes different from each other? I can see that if two samples are different, then normalization could make them similar. However, if two sets of data are similar, then intuitively, applying same operation onto them should make them still similar, at least not different from each other too much. I did some further analysis about the data. I also tried to normalize the data into [0,1] range
Re: [R] two-sample KS test: data becomes significantly different after normalization
Your definition of p-value is not correct. See, for example, http://en.wikipedia.org/wiki/P-value#Misunderstandings -Original Message- From: Monnand [mailto:monn...@gmail.com] Sent: Wednesday, January 14, 2015 2:17 AM To: Andrews, Chris Cc: r-help@r-project.org Subject: Re: [R] two-sample KS test: data becomes significantly different after normalization I know this must be a wrong method, but I cannot help to ask: Can I only use the p-value from KS test, saying if p-value is greater than \beta, then two samples are from the same distribution. If the definition of p-value is the probability that the null hypothesis is true, then why there's little people uses p-value as a true probability. e.g. normally, people will not multiply or add p-values to get the probability that two independent null hypothesis are both true or one of them is true. I had this question for very long time. -Monnand On Tue Jan 13 2015 at 2:47:30 PM Andrews, Chris chri...@med.umich.edu wrote: This sounds more like quality control than hypothesis testing. Rather than statistical significance, you want to determine what is an acceptable difference (an 'equivalence margin', if you will). And that is a question about the application, not a statistical one. From: Monnand [monn...@gmail.com] Sent: Monday, January 12, 2015 10:14 PM To: Andrews, Chris Cc: r-help@r-project.org Subject: Re: [R] two-sample KS test: data becomes significantly different after normalization Thank you, Chris! I think it is exactly the problem you mentioned. I did consider 1000-point data is a large one at first. I down-sampled the data from 1000 points to 100 points and ran KS test again. It worked as expected. Is there any typical method to compare two large samples? I also tried KL diverge, but it only gives me some number but does not tell me how large the distance is should be considered as significantly different. Regards, -Monnand On Mon, Jan 12, 2015 at 9:32 AM, Andrews, Chris chri...@med.umich.edu wrote: The main issue is that the original distributions are the same, you shift the two samples *by different amounts* (about 0.01 SD), and you have a large (n=1000) sample size. Thus the new distributions are not the same. This is a problem with testing for equality of distributions. With large samples, even a small deviation is significant. Chris -Original Message- From: Monnand [mailto:monn...@gmail.com] Sent: Sunday, January 11, 2015 10:13 PM To: r-help@r-project.org Subject: [R] two-sample KS test: data becomes significantly different after normalization Hi all, This question is sort of related to R (I'm not sure if I used an R function correctly), but also related to stats in general. I'm sorry if this is considered as off-topic. I'm currently working on a data set with two sets of samples. The csv file of the data could be found here: http://pastebin.com/200v10py I would like to use KS test to see if these two sets of samples are from different distributions. I ran the following R script: # read data from the file data = read.csv('data.csv') ks.test(data[[1]], data[[2]]) Two-sample Kolmogorov-Smirnov test data: data[[1]] and data[[2]] D = 0.025, p-value = 0.9132 alternative hypothesis: two-sided The KS test shows that these two samples are very similar. (In fact, they should come from same distribution.) However, due to some reasons, instead of the raw values, the actual data that I will get will be normalized (zero mean, unit variance). So I tried to normalize the raw data I have and run the KS test again: ks.test(scale(data[[1]]), scale(data[[2]])) Two-sample Kolmogorov-Smirnov test data: scale(data[[1]]) and scale(data[[2]]) D = 0.3273, p-value 2.2e-16 alternative hypothesis: two-sided The p-value becomes almost zero after normalization indicating these two samples are significantly different (from different distributions). My question is: How the normalization could make two similar samples becomes different from each other? I can see that if two samples are different, then normalization could make them similar. However, if two sets of data are similar, then intuitively, applying same operation onto them should make them still similar, at least not different from each other too much. I did some further analysis about the data. I also tried to normalize the data into [0,1] range (using the formula (x-min(x))/(max(x)-min(x))), but same thing happened. At first, I thought it might be outliers caused this problem (I can see that an outlier may cause this problem if I normalize the data into [0,1] range.) I deleted all data whose abs value is larger than 4 standard deviation. But it still didn't help. Plus, I even plotted the eCDFs, they *really* look the same to me even after normalization. Anything wrong
Re: [R] two-sample KS test: data becomes significantly different after normalization
I know this must be a wrong method, but I cannot help to ask: Can I only use the p-value from KS test, saying if p-value is greater than \beta, then two samples are from the same distribution. If the definition of p-value is the probability that the null hypothesis is true, then why there's little people uses p-value as a true probability. e.g. normally, people will not multiply or add p-values to get the probability that two independent null hypothesis are both true or one of them is true. I had this question for very long time. -Monnand On Tue Jan 13 2015 at 2:47:30 PM Andrews, Chris chri...@med.umich.edu wrote: This sounds more like quality control than hypothesis testing. Rather than statistical significance, you want to determine what is an acceptable difference (an 'equivalence margin', if you will). And that is a question about the application, not a statistical one. From: Monnand [monn...@gmail.com] Sent: Monday, January 12, 2015 10:14 PM To: Andrews, Chris Cc: r-help@r-project.org Subject: Re: [R] two-sample KS test: data becomes significantly different after normalization Thank you, Chris! I think it is exactly the problem you mentioned. I did consider 1000-point data is a large one at first. I down-sampled the data from 1000 points to 100 points and ran KS test again. It worked as expected. Is there any typical method to compare two large samples? I also tried KL diverge, but it only gives me some number but does not tell me how large the distance is should be considered as significantly different. Regards, -Monnand On Mon, Jan 12, 2015 at 9:32 AM, Andrews, Chris chri...@med.umich.edu wrote: The main issue is that the original distributions are the same, you shift the two samples *by different amounts* (about 0.01 SD), and you have a large (n=1000) sample size. Thus the new distributions are not the same. This is a problem with testing for equality of distributions. With large samples, even a small deviation is significant. Chris -Original Message- From: Monnand [mailto:monn...@gmail.com] Sent: Sunday, January 11, 2015 10:13 PM To: r-help@r-project.org Subject: [R] two-sample KS test: data becomes significantly different after normalization Hi all, This question is sort of related to R (I'm not sure if I used an R function correctly), but also related to stats in general. I'm sorry if this is considered as off-topic. I'm currently working on a data set with two sets of samples. The csv file of the data could be found here: http://pastebin.com/200v10py I would like to use KS test to see if these two sets of samples are from different distributions. I ran the following R script: # read data from the file data = read.csv('data.csv') ks.test(data[[1]], data[[2]]) Two-sample Kolmogorov-Smirnov test data: data[[1]] and data[[2]] D = 0.025, p-value = 0.9132 alternative hypothesis: two-sided The KS test shows that these two samples are very similar. (In fact, they should come from same distribution.) However, due to some reasons, instead of the raw values, the actual data that I will get will be normalized (zero mean, unit variance). So I tried to normalize the raw data I have and run the KS test again: ks.test(scale(data[[1]]), scale(data[[2]])) Two-sample Kolmogorov-Smirnov test data: scale(data[[1]]) and scale(data[[2]]) D = 0.3273, p-value 2.2e-16 alternative hypothesis: two-sided The p-value becomes almost zero after normalization indicating these two samples are significantly different (from different distributions). My question is: How the normalization could make two similar samples becomes different from each other? I can see that if two samples are different, then normalization could make them similar. However, if two sets of data are similar, then intuitively, applying same operation onto them should make them still similar, at least not different from each other too much. I did some further analysis about the data. I also tried to normalize the data into [0,1] range (using the formula (x-min(x))/(max(x)-min(x))), but same thing happened. At first, I thought it might be outliers caused this problem (I can see that an outlier may cause this problem if I normalize the data into [0,1] range.) I deleted all data whose abs value is larger than 4 standard deviation. But it still didn't help. Plus, I even plotted the eCDFs, they *really* look the same to me even after normalization. Anything wrong with my usage of the R function? Since the data contains ties, I also tried ks.boot ( http://sekhon.berkeley.edu/matching/ks.boot.html ), but I got the same result. Could anyone help me to explain why it happened? Also, any suggestion about the hypothesis testing on normalized data? (The data I have right now is simulated data. In real world, I cannot get
Re: [R] two-sample KS test: data becomes significantly different after normalization
This sounds more like quality control than hypothesis testing. Rather than statistical significance, you want to determine what is an acceptable difference (an 'equivalence margin', if you will). And that is a question about the application, not a statistical one. From: Monnand [monn...@gmail.com] Sent: Monday, January 12, 2015 10:14 PM To: Andrews, Chris Cc: r-help@r-project.org Subject: Re: [R] two-sample KS test: data becomes significantly different after normalization Thank you, Chris! I think it is exactly the problem you mentioned. I did consider 1000-point data is a large one at first. I down-sampled the data from 1000 points to 100 points and ran KS test again. It worked as expected. Is there any typical method to compare two large samples? I also tried KL diverge, but it only gives me some number but does not tell me how large the distance is should be considered as significantly different. Regards, -Monnand On Mon, Jan 12, 2015 at 9:32 AM, Andrews, Chris chri...@med.umich.edu wrote: The main issue is that the original distributions are the same, you shift the two samples *by different amounts* (about 0.01 SD), and you have a large (n=1000) sample size. Thus the new distributions are not the same. This is a problem with testing for equality of distributions. With large samples, even a small deviation is significant. Chris -Original Message- From: Monnand [mailto:monn...@gmail.com] Sent: Sunday, January 11, 2015 10:13 PM To: r-help@r-project.org Subject: [R] two-sample KS test: data becomes significantly different after normalization Hi all, This question is sort of related to R (I'm not sure if I used an R function correctly), but also related to stats in general. I'm sorry if this is considered as off-topic. I'm currently working on a data set with two sets of samples. The csv file of the data could be found here: http://pastebin.com/200v10py I would like to use KS test to see if these two sets of samples are from different distributions. I ran the following R script: # read data from the file data = read.csv('data.csv') ks.test(data[[1]], data[[2]]) Two-sample Kolmogorov-Smirnov test data: data[[1]] and data[[2]] D = 0.025, p-value = 0.9132 alternative hypothesis: two-sided The KS test shows that these two samples are very similar. (In fact, they should come from same distribution.) However, due to some reasons, instead of the raw values, the actual data that I will get will be normalized (zero mean, unit variance). So I tried to normalize the raw data I have and run the KS test again: ks.test(scale(data[[1]]), scale(data[[2]])) Two-sample Kolmogorov-Smirnov test data: scale(data[[1]]) and scale(data[[2]]) D = 0.3273, p-value 2.2e-16 alternative hypothesis: two-sided The p-value becomes almost zero after normalization indicating these two samples are significantly different (from different distributions). My question is: How the normalization could make two similar samples becomes different from each other? I can see that if two samples are different, then normalization could make them similar. However, if two sets of data are similar, then intuitively, applying same operation onto them should make them still similar, at least not different from each other too much. I did some further analysis about the data. I also tried to normalize the data into [0,1] range (using the formula (x-min(x))/(max(x)-min(x))), but same thing happened. At first, I thought it might be outliers caused this problem (I can see that an outlier may cause this problem if I normalize the data into [0,1] range.) I deleted all data whose abs value is larger than 4 standard deviation. But it still didn't help. Plus, I even plotted the eCDFs, they *really* look the same to me even after normalization. Anything wrong with my usage of the R function? Since the data contains ties, I also tried ks.boot ( http://sekhon.berkeley.edu/matching/ks.boot.html ), but I got the same result. Could anyone help me to explain why it happened? Also, any suggestion about the hypothesis testing on normalized data? (The data I have right now is simulated data. In real world, I cannot get raw data, but only normalized one.) Regards, -Monnand [[alternative HTML version deleted]] ** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues ** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented
Re: [R] two-sample KS test: data becomes significantly different after normalization
The main issue is that the original distributions are the same, you shift the two samples *by different amounts* (about 0.01 SD), and you have a large (n=1000) sample size. Thus the new distributions are not the same. This is a problem with testing for equality of distributions. With large samples, even a small deviation is significant. Chris -Original Message- From: Monnand [mailto:monn...@gmail.com] Sent: Sunday, January 11, 2015 10:13 PM To: r-help@r-project.org Subject: [R] two-sample KS test: data becomes significantly different after normalization Hi all, This question is sort of related to R (I'm not sure if I used an R function correctly), but also related to stats in general. I'm sorry if this is considered as off-topic. I'm currently working on a data set with two sets of samples. The csv file of the data could be found here: http://pastebin.com/200v10py I would like to use KS test to see if these two sets of samples are from different distributions. I ran the following R script: # read data from the file data = read.csv('data.csv') ks.test(data[[1]], data[[2]]) Two-sample Kolmogorov-Smirnov test data: data[[1]] and data[[2]] D = 0.025, p-value = 0.9132 alternative hypothesis: two-sided The KS test shows that these two samples are very similar. (In fact, they should come from same distribution.) However, due to some reasons, instead of the raw values, the actual data that I will get will be normalized (zero mean, unit variance). So I tried to normalize the raw data I have and run the KS test again: ks.test(scale(data[[1]]), scale(data[[2]])) Two-sample Kolmogorov-Smirnov test data: scale(data[[1]]) and scale(data[[2]]) D = 0.3273, p-value 2.2e-16 alternative hypothesis: two-sided The p-value becomes almost zero after normalization indicating these two samples are significantly different (from different distributions). My question is: How the normalization could make two similar samples becomes different from each other? I can see that if two samples are different, then normalization could make them similar. However, if two sets of data are similar, then intuitively, applying same operation onto them should make them still similar, at least not different from each other too much. I did some further analysis about the data. I also tried to normalize the data into [0,1] range (using the formula (x-min(x))/(max(x)-min(x))), but same thing happened. At first, I thought it might be outliers caused this problem (I can see that an outlier may cause this problem if I normalize the data into [0,1] range.) I deleted all data whose abs value is larger than 4 standard deviation. But it still didn't help. Plus, I even plotted the eCDFs, they *really* look the same to me even after normalization. Anything wrong with my usage of the R function? Since the data contains ties, I also tried ks.boot ( http://sekhon.berkeley.edu/matching/ks.boot.html ), but I got the same result. Could anyone help me to explain why it happened? Also, any suggestion about the hypothesis testing on normalized data? (The data I have right now is simulated data. In real world, I cannot get raw data, but only normalized one.) Regards, -Monnand [[alternative HTML version deleted]] ** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] two-sample KS test: data becomes significantly different after normalization
Thank you, Chris! I think it is exactly the problem you mentioned. I did consider 1000-point data is a large one at first. I down-sampled the data from 1000 points to 100 points and ran KS test again. It worked as expected. Is there any typical method to compare two large samples? I also tried KL diverge, but it only gives me some number but does not tell me how large the distance is should be considered as significantly different. Regards, -Monnand On Mon, Jan 12, 2015 at 9:32 AM, Andrews, Chris chri...@med.umich.edu wrote: The main issue is that the original distributions are the same, you shift the two samples *by different amounts* (about 0.01 SD), and you have a large (n=1000) sample size. Thus the new distributions are not the same. This is a problem with testing for equality of distributions. With large samples, even a small deviation is significant. Chris -Original Message- From: Monnand [mailto:monn...@gmail.com] Sent: Sunday, January 11, 2015 10:13 PM To: r-help@r-project.org Subject: [R] two-sample KS test: data becomes significantly different after normalization Hi all, This question is sort of related to R (I'm not sure if I used an R function correctly), but also related to stats in general. I'm sorry if this is considered as off-topic. I'm currently working on a data set with two sets of samples. The csv file of the data could be found here: http://pastebin.com/200v10py I would like to use KS test to see if these two sets of samples are from different distributions. I ran the following R script: # read data from the file data = read.csv('data.csv') ks.test(data[[1]], data[[2]]) Two-sample Kolmogorov-Smirnov test data: data[[1]] and data[[2]] D = 0.025, p-value = 0.9132 alternative hypothesis: two-sided The KS test shows that these two samples are very similar. (In fact, they should come from same distribution.) However, due to some reasons, instead of the raw values, the actual data that I will get will be normalized (zero mean, unit variance). So I tried to normalize the raw data I have and run the KS test again: ks.test(scale(data[[1]]), scale(data[[2]])) Two-sample Kolmogorov-Smirnov test data: scale(data[[1]]) and scale(data[[2]]) D = 0.3273, p-value 2.2e-16 alternative hypothesis: two-sided The p-value becomes almost zero after normalization indicating these two samples are significantly different (from different distributions). My question is: How the normalization could make two similar samples becomes different from each other? I can see that if two samples are different, then normalization could make them similar. However, if two sets of data are similar, then intuitively, applying same operation onto them should make them still similar, at least not different from each other too much. I did some further analysis about the data. I also tried to normalize the data into [0,1] range (using the formula (x-min(x))/(max(x)-min(x))), but same thing happened. At first, I thought it might be outliers caused this problem (I can see that an outlier may cause this problem if I normalize the data into [0,1] range.) I deleted all data whose abs value is larger than 4 standard deviation. But it still didn't help. Plus, I even plotted the eCDFs, they *really* look the same to me even after normalization. Anything wrong with my usage of the R function? Since the data contains ties, I also tried ks.boot ( http://sekhon.berkeley.edu/matching/ks.boot.html ), but I got the same result. Could anyone help me to explain why it happened? Also, any suggestion about the hypothesis testing on normalized data? (The data I have right now is simulated data. In real world, I cannot get raw data, but only normalized one.) Regards, -Monnand [[alternative HTML version deleted]] ** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] two-sample KS test: data becomes significantly different after normalization
Hi all, This question is sort of related to R (I'm not sure if I used an R function correctly), but also related to stats in general. I'm sorry if this is considered as off-topic. I'm currently working on a data set with two sets of samples. The csv file of the data could be found here: http://pastebin.com/200v10py I would like to use KS test to see if these two sets of samples are from different distributions. I ran the following R script: # read data from the file data = read.csv('data.csv') ks.test(data[[1]], data[[2]]) Two-sample Kolmogorov-Smirnov test data: data[[1]] and data[[2]] D = 0.025, p-value = 0.9132 alternative hypothesis: two-sided The KS test shows that these two samples are very similar. (In fact, they should come from same distribution.) However, due to some reasons, instead of the raw values, the actual data that I will get will be normalized (zero mean, unit variance). So I tried to normalize the raw data I have and run the KS test again: ks.test(scale(data[[1]]), scale(data[[2]])) Two-sample Kolmogorov-Smirnov test data: scale(data[[1]]) and scale(data[[2]]) D = 0.3273, p-value 2.2e-16 alternative hypothesis: two-sided The p-value becomes almost zero after normalization indicating these two samples are significantly different (from different distributions). My question is: How the normalization could make two similar samples becomes different from each other? I can see that if two samples are different, then normalization could make them similar. However, if two sets of data are similar, then intuitively, applying same operation onto them should make them still similar, at least not different from each other too much. I did some further analysis about the data. I also tried to normalize the data into [0,1] range (using the formula (x-min(x))/(max(x)-min(x))), but same thing happened. At first, I thought it might be outliers caused this problem (I can see that an outlier may cause this problem if I normalize the data into [0,1] range.) I deleted all data whose abs value is larger than 4 standard deviation. But it still didn't help. Plus, I even plotted the eCDFs, they *really* look the same to me even after normalization. Anything wrong with my usage of the R function? Since the data contains ties, I also tried ks.boot ( http://sekhon.berkeley.edu/matching/ks.boot.html ), but I got the same result. Could anyone help me to explain why it happened? Also, any suggestion about the hypothesis testing on normalized data? (The data I have right now is simulated data. In real world, I cannot get raw data, but only normalized one.) Regards, -Monnand [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.