[R] Question of the impact of the pilot experiment on the overal statistic interpretation of the subsequent work

2007-04-17 Thread Bruce Ling
Hi,

I have a question regarding the impact of the pilot experiment on the
overall statistic interpretation of the subsequent work.

The context is as following:

In a lab there are one Professor and three graduate students A, B, and
C.  They are working on analysis of some disease to discover genes
differentiate + and - categories like disease or non-disease.  Number of
samples of (+) is about m while that of the samples of (-) is n.  Both
m, n are sufficiently large, e.g. bigger than 100.  

Pilot experiment:
In order to save efforts and resources, the Professor decided to pool
the samples in each category with equal amount such that he got only two
pooled samples of (+) and (-).  His argument is that if there is no
difference in the pooled samples then he would decide to abandon the
project.  Graduate student A did a microarray analysis of  the pooled
(+) and (-) and found gene X, Y, Z have fold of change bigger than
100.  

Professor thought this was interesting and encouraging based upon his
biological insight of gene X, Y and Z and the potential disease link of
these genes.  
(1) He asked graduate student B to do a protein analysis, using a
different technique (western blot), of all the original samples (m, n)
and found gene Y is truly differential.  Based upon the protein analysis
data, graduate student A calculated P value using t test to describe the
statistic significance of gene Y differentiating (+) and (-) categories.
(2) Simultaneously, he also asked graduate student C to do a full scale
microarray experiment using all m, n samples individually.  It is a very
laborious work but graduate student C finished everything and using some
off the shelf microarray statistical packages, he calculated and found
gene Y, Z and another un-identified gene W to be statistically
significant.  He calculated the false discovery rate and P value of
these genes differentiating (+) and (-) categories.

The professor presented his students A, B, C's work including the
calculated statistics in a conference.  In the audience, statistician D
commented that professor has made a mistake here: because he is using
the SAME samples, whether pooled or individual, in both the pilot and
subsequent experiments, statistically the professor is cheating and
his students' calculated statistics are no longer valid.

Can statisticians in this mailing list comment on this story? One thing
I want to emphasize here is that nobody disputes that it is highly
critical to use a different set of samples to validate the discoveries.
The question here is that contingent on the pilot experiment of the
pooled samples, whether the subsequent full scale experiments using the
SAME samples can yield meaningful statistics to describe the differences
of the discovered features.

Thanks.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Question of the impact of the pilot experiment on the overal statistic interpretation of the subsequent work

2007-04-17 Thread Charles C. Berry

Bruce,

Far below you ask for comment on an issue of statistical interpretation 
of a collection of biological experiments.

Your university (judging from your email handle) has one of the best 
statistics departments in the world and some of the best biostatisticians 
in the galaxy.

You would do far better (than posting your query here) to take your issue 
up with one of them.

In a face-to-face meeting with one of them you will get a much better 
analysis and discussion of the issues than you could hope for from a list 
like this, notwithstanding that some statisticians sometimes provide 
thoughtful answers to posts asking for statistical help.

Chuck

On Tue, 17 Apr 2007, Bruce Ling wrote:

 Hi,

 I have a question regarding the impact of the pilot experiment on the
 overall statistic interpretation of the subsequent work.

 The context is as following:

 In a lab there are one Professor and three graduate students A, B, and
 C.  They are working on analysis of some disease to discover genes
 differentiate + and - categories like disease or non-disease.  Number of
 samples of (+) is about m while that of the samples of (-) is n.  Both
 m, n are sufficiently large, e.g. bigger than 100.

 Pilot experiment:
 In order to save efforts and resources, the Professor decided to pool
 the samples in each category with equal amount such that he got only two
 pooled samples of (+) and (-).  His argument is that if there is no
 difference in the pooled samples then he would decide to abandon the
 project.  Graduate student A did a microarray analysis of  the pooled
 (+) and (-) and found gene X, Y, Z have fold of change bigger than
 100.

 Professor thought this was interesting and encouraging based upon his
 biological insight of gene X, Y and Z and the potential disease link of
 these genes.
 (1) He asked graduate student B to do a protein analysis, using a
 different technique (western blot), of all the original samples (m, n)
 and found gene Y is truly differential.  Based upon the protein analysis
 data, graduate student A calculated P value using t test to describe the
 statistic significance of gene Y differentiating (+) and (-) categories.
 (2) Simultaneously, he also asked graduate student C to do a full scale
 microarray experiment using all m, n samples individually.  It is a very
 laborious work but graduate student C finished everything and using some
 off the shelf microarray statistical packages, he calculated and found
 gene Y, Z and another un-identified gene W to be statistically
 significant.  He calculated the false discovery rate and P value of
 these genes differentiating (+) and (-) categories.

 The professor presented his students A, B, C's work including the
 calculated statistics in a conference.  In the audience, statistician D
 commented that professor has made a mistake here: because he is using
 the SAME samples, whether pooled or individual, in both the pilot and
 subsequent experiments, statistically the professor is cheating and
 his students' calculated statistics are no longer valid.

 Can statisticians in this mailing list comment on this story? One thing
 I want to emphasize here is that nobody disputes that it is highly
 critical to use a different set of samples to validate the discoveries.
 The question here is that contingent on the pilot experiment of the
 pooled samples, whether the subsequent full scale experiments using the
 SAME samples can yield meaningful statistics to describe the differences
 of the discovered features.

 Thanks.

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


Charles C. Berry(858) 534-2098
  Dept of Family/Preventive Medicine
E mailto:[EMAIL PROTECTED]   UC San Diego
http://biostat.ucsd.edu/~cberry/ La Jolla, San Diego 92093-0901

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.