Re: How to select a distribution?

2000-10-21 Thread Herman Rubin

In article [EMAIL PROTECTED],
Robert J. MacG. Dawson [EMAIL PROTECTED] wrote:


Herman Rubin wrote:

 In article 8smcpv$41r$[EMAIL PROTECTED],
 Choi, Young Sung [EMAIL PROTECTED] wrote:
 I am a statistically poor researcher and have a statistical problem.

 I have two candidate distributions, A(theta1) and B(theta1, theta2) to model
 my data.
 Then how should I determine the best distribution for my data?
 Suggest me an easy book that explain how to select a distribution when
 making a probability model and how to test the goodness of the selected
 distribution over other ones.

 The decision as to what probability models are appropriate
 must come from understanding your subject. not from any
 use of simple distributions from probability or statistics
 textbooks.  Above all, do not use what you know or do not
 know about statistical methods to influence this stage; a
 good statistician might be able to tell you that certain
 assumptions are NOT important, but as a statistician must
 not suggest a model.  However, he may be able to ask you
 the questions which must be answered to produce a good model.

   Herman's advice may be good in "mature" disciplines in which the
processes introducing randomness are truly and completely understood. 
Thermodynamics, for instance, or... I'm sure there was another one
somewhere?

   But what if one wants to model (say) rainfall, human heights, or the
number of ticks on a sheep? By the time one has a complete enough
understanding of meteorology, human growth processes, or tick ecology to
come up with an _a_priori_ model that one trusts at least as well as
well as one trusts the data, one doesn't really need to do statistics
any more, just probability theory. (As in thermodynamics...)

There are always constants to be estimated.  Also, that
great a trust of the data is rare; the only accurate
characterization of outliers is that of observations which
are not covered by the model.  If the observations were to
be trusted, there is no such thing.

   Suppose somebody *does* come up with a theoretical arguments that shows
that (say) birth weights ought to be normally distributed. And suppose
the data disagree?

I would be very surprised if someone came up with a good
theoretical argument to say that ANYTHING is normally
distributed; at best it could be approximately normal.

What should one do? It would seem as if Herman's
advice would lead one to say either "Then so much the worse for the
data", or "That is what comes of trying to do statistics when one is not
yet infallible", or at most "As our theoretical model does not fit the
data, we cannot proceed and will go out to the pub instead." 

One can have lots of theoretical models, including
approximate normality.  But beware of making too many
assumptions.  Sometimes it matters, and sometimes it
does not.  The early scientific investigators looked
for mathematically simple relations, but they had these
few "laws" in their minds.  This is still a theoretical
construction of the laws.  Planck's law of radiation, 
a much better fit than either of the two laws at high
and low frequencies he was interpolating, was not 
obtained from the data, and neither were either of the
two laws giving approximate fits.  They were obtained
from theoretical arguments, of course obtained from
previous studies.

Poor fits send people to reexamine their theories.  
In the case of the laws for imperfect gases, simple
theories gave fair fits, but while the data were
adequate to show that the theories were not quite
right, they were not adequate to come up with better
ones.  Quantitative fits required the use of better
nuclear theory.

   I would argue that in _most_ areas where statistics is needed, there
are not theories capable of justifying a certain model _a_priori_ and
there will never be. (There may be theories capable of justifying an
approximate model, but as argued above such a model must still be tested
to see if it works!)  Thus, in reality, the "understanding of your
subject" will reduce to using the distribution that your colleagues used
last year. And why did _they_ use it? Eventually, either because it fit
some related data set or for some worse reason.

We will never have exact theories; this went out in
physics with relativity and quantum mechanics.  However,
it seems that the social scientists believe that they
can do it, based on the normal distribution.  

   I would certainly agree that one must not choose models in the teeth of
the data _because_ they are simple, and one must not accept models
merely because one has a small and toothless data set that has not got
the power to defend itself against baseless allegations.   However, if
one has a large enough data set that one can say that any model that
fits it must be very _close_ to a certain simple model, I do not see the
harm (and I do see the utility) of using that simple model.

What about Ptolemaic astronomy?  It depends on what one
means by 

Re: How to select a distribution?

2000-10-21 Thread Eric Bohlman

Herman Rubin [EMAIL PROTECTED] wrote:
 As we get more complex situations, like those happening
 in biology, and especially in the social sciences, it is
 necessary to consider that models may have substantial
 errors and still be "accepted", as one can only get some
 understanding by using models.

"All models are wrong.  Some models are useful."
  -- George Box

I think what a lot of people forget (or never realized in the first place)
is that a model is by definition an oversimplification of the state of
nature.  A model that fit perfectly would be of no use, as it would be
just as complicated as the state of nature itself.  As Stephen Jay Gould
pointed out in his discussion of factor analysis in _The Mismeasure of
Man_, when we build models we are *deliberately* throwing out
*information* (not just "noise") in the hopes that we can deal
conceptually with what remains.  We really can't do otherwise simply
because our brains aren't infinitely powerful.  But we have to remember
that that's what we're doing, and (again a major point of Gould's)
disabuse ourselves of the notion that we're discovering something that's
more real than the real world.  Models are not Platonic ideals.  They are
conceptual shortcuts, heuristics if you will.  They help us cope with
uncertainty, but do not make it magically disappear.

(I find phraseology like "this data was generated by that model" extremely
offensive, as it subtly plays in to both the Platonic ideal notion and the
postmodern notion that reality is purely a social or linguistic
construct.)



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: How to select a distribution?

2000-10-20 Thread Robert J. MacG. Dawson



Herman Rubin wrote:
 
 In article 8smcpv$41r$[EMAIL PROTECTED],
 Choi, Young Sung [EMAIL PROTECTED] wrote:
 I am a statistically poor researcher and have a statistical problem.
 
 I have two candidate distributions, A(theta1) and B(theta1, theta2) to model
 my data.
 Then how should I determine the best distribution for my data?
 Suggest me an easy book that explain how to select a distribution when
 making a probability model and how to test the goodness of the selected
 distribution over other ones.
 
 The decision as to what probability models are appropriate
 must come from understanding your subject. not from any
 use of simple distributions from probability or statistics
 textbooks.  Above all, do not use what you know or do not
 know about statistical methods to influence this stage; a
 good statistician might be able to tell you that certain
 assumptions are NOT important, but as a statistician must
 not suggest a model.  However, he may be able to ask you
 the questions which must be answered to produce a good model.

Herman's advice may be good in "mature" disciplines in which the
processes introducing randomness are truly and completely understood. 
Thermodynamics, for instance, or... I'm sure there was another one
somewhere?

But what if one wants to model (say) rainfall, human heights, or the
number of ticks on a sheep? By the time one has a complete enough
understanding of meteorology, human growth processes, or tick ecology to
come up with an _a_priori_ model that one trusts at least as well as
well as one trusts the data, one doesn't really need to do statistics
any more, just probability theory. (As in thermodynamics...)

Suppose somebody *does* come up with a theoretical arguments that shows
that (say) birth weights ought to be normally distributed. And suppose
the data disagree? What should one do? It would seem as if Herman's
advice would lead one to say either "Then so much the worse for the
data", or "That is what comes of trying to do statistics when one is not
yet infallible", or at most "As our theoretical model does not fit the
data, we cannot proceed and will go out to the pub instead." 

I would argue that in _most_ areas where statistics is needed, there
are not theories capable of justifying a certain model _a_priori_ and
there will never be. (There may be theories capable of justifying an
approximate model, but as argued above such a model must still be tested
to see if it works!)  Thus, in reality, the "understanding of your
subject" will reduce to using the distribution that your colleagues used
last year. And why did _they_ use it? Eventually, either because it fit
some related data set or for some worse reason.

I would certainly agree that one must not choose models in the teeth of
the data _because_ they are simple, and one must not accept models
merely because one has a small and toothless data set that has not got
the power to defend itself against baseless allegations.   However, if
one has a large enough data set that one can say that any model that
fits it must be very _close_ to a certain simple model, I do not see the
harm (and I do see the utility) of using that simple model.

With small data sets, unless one has a model justified by a larger and
closely related data set, nonparametric or robust techniques are safer.
For very small data sets, in many cases, you cannot proceed and should
go off to the pub...

-Robert Dawson


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: How to select a distribution?

2000-10-19 Thread dennis roberts

as a general strategy ... you apply both models to the observed data ... 
look at the (squared) residuals of the fits to the real data points ... and 
see which model produces the smaller amount of squared error ...

sometimes this is rather obvious if you look at the data ... for example, 
what if you have a relationship graph ... X on the baseline and Y on the 
vertical ... and it has a curvilinear look to it ... kind of like a banana 
plot ...

you could try fitting straight line to the data ... find the squared 
residuals ... the go to a fancier exponential equation ... find the squared 
residuals ... and we would see in this case that the fancier model produces 
smaller errors, on average ...

now, this does not give you the BEST model perhaps but, it is the strategy 
one uses (iterating) to converge on what seems to be the best you (model) 
can do

At 05:53 PM 10/19/00 +0900, Choi, Young Sung wrote:
I am a statistically poor researcher and have a statistical problem.

I have two candidate distributions, A(theta1) and B(theta1, theta2) to model
my data.
Then how should I determine the best distribution for my data?
Suggest me an easy book that explain how to select a distribution when
making a probability model and how to test the goodness of the selected
distribution over other ones.

Thanks in advance.




=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
   http://jse.stat.ncsu.edu/
=



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: How to select a distribution?

2000-10-19 Thread Rich Ulrich

On Thu, 19 Oct 2000 17:53:41 +0900, "Choi, Young Sung"
[EMAIL PROTECTED] wrote:

 I am a statistically poor researcher and have a statistical problem.
 
 I have two candidate distributions, A(theta1) and B(theta1, theta2) to model
 my data.
 Then how should I determine the best distribution for my data?
 Suggest me an easy book that explain how to select a distribution when
 making a probability model and how to test the goodness of the selected
 distribution over other ones.
 

"Data Analysis, A Model Comparison Approach" by Judd and McClelland.

What you describe, assuming your notation is intentional, is a nesting
of one model within another.  So the one with greater number of
parameters will have the better "fit" (at least, no worse) in an
absolute sense, and the question is whether the fit that is achieved
by using more parameters  improves more than you should expect, for
that increase in parameters.

Assume that "fit" is measured by finding parameters satisfying
least-squares error, or by maximum-likelihood.  (There are 
other possibilities, but a similar logic generally applies.)  
If we further assume independence and homogeneity, then the
improvement can be tested.  Testing is often  by an F-test that uses
the number of added-parameters as the number of "degrees of freedom"
in the numerator.  Various texts will have this as the "Chow" test.

Finally, you SELECT a distribution according to what sense it makes,
and what purpose is served, and whether any good purpose is served by
using the more complex parameterization.  In some fashion, you need to
justify the complexity or other costs of using more parameters.  See
Robert Abelson, "Statistics as Principled Argument."

-- 
Rich Ulrich, [EMAIL PROTECTED]
http://www.pitt.edu/~wpilib/index.html


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: How to select a distribution?

2000-10-19 Thread Herman Rubin

In article 8smcpv$41r$[EMAIL PROTECTED],
Choi, Young Sung [EMAIL PROTECTED] wrote:
I am a statistically poor researcher and have a statistical problem.

I have two candidate distributions, A(theta1) and B(theta1, theta2) to model
my data.
Then how should I determine the best distribution for my data?
Suggest me an easy book that explain how to select a distribution when
making a probability model and how to test the goodness of the selected
distribution over other ones.

The decision as to what probability models are appropriate
must come from understanding your subject. not from any
use of simple distributions from probability or statistics
textbooks.  Above all, do not use what you know or do not
know about statistical methods to influence this stage; a
good statistician might be able to tell you that certain
assumptions are NOT important, but as a statistician must
not suggest a model.  However, he may be able to ask you
the questions which must be answered to produce a good model.

As for the choice among formulated models, I suggest you
consult a statistician after the models are formulated.
In this generality, it is not possible to advise you in
a short article.

-- 
This address is for information only.  I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
[EMAIL PROTECTED] Phone: (765)494-6054   FAX: (765)494-0558


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=