Hi, I have a question regarding the impact of the pilot experiment on the overall statistic interpretation of the subsequent work.
The context is as following: In a lab there are one Professor and three graduate students A, B, and C. They are working on analysis of some disease to discover genes differentiate + and - categories like disease or non-disease. Number of samples of (+) is about m while that of the samples of (-) is n. Both m, n are sufficiently large, e.g. bigger than 100. Pilot experiment: In order to save efforts and resources, the Professor decided to pool the samples in each category with equal amount such that he got only two pooled samples of (+) and (-). His argument is that if there is no difference in the pooled samples then he would decide to abandon the project. Graduate student A did a microarray analysis of the pooled (+) and (-) and found gene X, Y, Z have fold of change bigger than 100. Professor thought this was interesting and encouraging based upon his biological insight of gene X, Y and Z and the potential disease link of these genes. (1) He asked graduate student B to do a protein analysis, using a different technique (western blot), of all the original samples (m, n) and found gene Y is truly differential. Based upon the protein analysis data, graduate student A calculated P value using t test to describe the statistic significance of gene Y differentiating (+) and (-) categories. (2) Simultaneously, he also asked graduate student C to do a full scale microarray experiment using all m, n samples individually. It is a very laborious work but graduate student C finished everything and using some off the shelf microarray statistical packages, he calculated and found gene Y, Z and another un-identified gene W to be statistically significant. He calculated the false discovery rate and P value of these genes differentiating (+) and (-) categories. The professor presented his students A, B, C's work including the calculated statistics in a conference. In the audience, statistician D commented that professor has made a mistake here: because he is using the SAME samples, whether pooled or individual, in both the pilot and subsequent experiments, statistically the professor is "cheating" and his students' calculated statistics are no longer valid. Can statisticians in this mailing list comment on this story? One thing I want to emphasize here is that nobody disputes that it is highly critical to use a different set of samples to validate the discoveries. The question here is that contingent on the pilot experiment of the pooled samples, whether the subsequent full scale experiments using the SAME samples can yield meaningful statistics to describe the differences of the discovered features. Thanks. ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.