Re: How to select a distribution?
In article [EMAIL PROTECTED], Robert J. MacG. Dawson [EMAIL PROTECTED] wrote: Herman Rubin wrote: In article 8smcpv$41r$[EMAIL PROTECTED], Choi, Young Sung [EMAIL PROTECTED] wrote: I am a statistically poor researcher and have a statistical problem. I have two candidate distributions, A(theta1) and B(theta1, theta2) to model my data. Then how should I determine the best distribution for my data? Suggest me an easy book that explain how to select a distribution when making a probability model and how to test the goodness of the selected distribution over other ones. The decision as to what probability models are appropriate must come from understanding your subject. not from any use of simple distributions from probability or statistics textbooks. Above all, do not use what you know or do not know about statistical methods to influence this stage; a good statistician might be able to tell you that certain assumptions are NOT important, but as a statistician must not suggest a model. However, he may be able to ask you the questions which must be answered to produce a good model. Herman's advice may be good in "mature" disciplines in which the processes introducing randomness are truly and completely understood. Thermodynamics, for instance, or... I'm sure there was another one somewhere? But what if one wants to model (say) rainfall, human heights, or the number of ticks on a sheep? By the time one has a complete enough understanding of meteorology, human growth processes, or tick ecology to come up with an _a_priori_ model that one trusts at least as well as well as one trusts the data, one doesn't really need to do statistics any more, just probability theory. (As in thermodynamics...) There are always constants to be estimated. Also, that great a trust of the data is rare; the only accurate characterization of outliers is that of observations which are not covered by the model. If the observations were to be trusted, there is no such thing. Suppose somebody *does* come up with a theoretical arguments that shows that (say) birth weights ought to be normally distributed. And suppose the data disagree? I would be very surprised if someone came up with a good theoretical argument to say that ANYTHING is normally distributed; at best it could be approximately normal. What should one do? It would seem as if Herman's advice would lead one to say either "Then so much the worse for the data", or "That is what comes of trying to do statistics when one is not yet infallible", or at most "As our theoretical model does not fit the data, we cannot proceed and will go out to the pub instead." One can have lots of theoretical models, including approximate normality. But beware of making too many assumptions. Sometimes it matters, and sometimes it does not. The early scientific investigators looked for mathematically simple relations, but they had these few "laws" in their minds. This is still a theoretical construction of the laws. Planck's law of radiation, a much better fit than either of the two laws at high and low frequencies he was interpolating, was not obtained from the data, and neither were either of the two laws giving approximate fits. They were obtained from theoretical arguments, of course obtained from previous studies. Poor fits send people to reexamine their theories. In the case of the laws for imperfect gases, simple theories gave fair fits, but while the data were adequate to show that the theories were not quite right, they were not adequate to come up with better ones. Quantitative fits required the use of better nuclear theory. I would argue that in _most_ areas where statistics is needed, there are not theories capable of justifying a certain model _a_priori_ and there will never be. (There may be theories capable of justifying an approximate model, but as argued above such a model must still be tested to see if it works!) Thus, in reality, the "understanding of your subject" will reduce to using the distribution that your colleagues used last year. And why did _they_ use it? Eventually, either because it fit some related data set or for some worse reason. We will never have exact theories; this went out in physics with relativity and quantum mechanics. However, it seems that the social scientists believe that they can do it, based on the normal distribution. I would certainly agree that one must not choose models in the teeth of the data _because_ they are simple, and one must not accept models merely because one has a small and toothless data set that has not got the power to defend itself against baseless allegations. However, if one has a large enough data set that one can say that any model that fits it must be very _close_ to a certain simple model, I do not see the harm (and I do see the utility) of using that simple model. What about Ptolemaic astronomy? It depends on what one means by
Re: How to select a distribution?
Herman Rubin [EMAIL PROTECTED] wrote: As we get more complex situations, like those happening in biology, and especially in the social sciences, it is necessary to consider that models may have substantial errors and still be "accepted", as one can only get some understanding by using models. "All models are wrong. Some models are useful." -- George Box I think what a lot of people forget (or never realized in the first place) is that a model is by definition an oversimplification of the state of nature. A model that fit perfectly would be of no use, as it would be just as complicated as the state of nature itself. As Stephen Jay Gould pointed out in his discussion of factor analysis in _The Mismeasure of Man_, when we build models we are *deliberately* throwing out *information* (not just "noise") in the hopes that we can deal conceptually with what remains. We really can't do otherwise simply because our brains aren't infinitely powerful. But we have to remember that that's what we're doing, and (again a major point of Gould's) disabuse ourselves of the notion that we're discovering something that's more real than the real world. Models are not Platonic ideals. They are conceptual shortcuts, heuristics if you will. They help us cope with uncertainty, but do not make it magically disappear. (I find phraseology like "this data was generated by that model" extremely offensive, as it subtly plays in to both the Platonic ideal notion and the postmodern notion that reality is purely a social or linguistic construct.) = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: How to select a distribution?
Herman Rubin wrote: In article 8smcpv$41r$[EMAIL PROTECTED], Choi, Young Sung [EMAIL PROTECTED] wrote: I am a statistically poor researcher and have a statistical problem. I have two candidate distributions, A(theta1) and B(theta1, theta2) to model my data. Then how should I determine the best distribution for my data? Suggest me an easy book that explain how to select a distribution when making a probability model and how to test the goodness of the selected distribution over other ones. The decision as to what probability models are appropriate must come from understanding your subject. not from any use of simple distributions from probability or statistics textbooks. Above all, do not use what you know or do not know about statistical methods to influence this stage; a good statistician might be able to tell you that certain assumptions are NOT important, but as a statistician must not suggest a model. However, he may be able to ask you the questions which must be answered to produce a good model. Herman's advice may be good in "mature" disciplines in which the processes introducing randomness are truly and completely understood. Thermodynamics, for instance, or... I'm sure there was another one somewhere? But what if one wants to model (say) rainfall, human heights, or the number of ticks on a sheep? By the time one has a complete enough understanding of meteorology, human growth processes, or tick ecology to come up with an _a_priori_ model that one trusts at least as well as well as one trusts the data, one doesn't really need to do statistics any more, just probability theory. (As in thermodynamics...) Suppose somebody *does* come up with a theoretical arguments that shows that (say) birth weights ought to be normally distributed. And suppose the data disagree? What should one do? It would seem as if Herman's advice would lead one to say either "Then so much the worse for the data", or "That is what comes of trying to do statistics when one is not yet infallible", or at most "As our theoretical model does not fit the data, we cannot proceed and will go out to the pub instead." I would argue that in _most_ areas where statistics is needed, there are not theories capable of justifying a certain model _a_priori_ and there will never be. (There may be theories capable of justifying an approximate model, but as argued above such a model must still be tested to see if it works!) Thus, in reality, the "understanding of your subject" will reduce to using the distribution that your colleagues used last year. And why did _they_ use it? Eventually, either because it fit some related data set or for some worse reason. I would certainly agree that one must not choose models in the teeth of the data _because_ they are simple, and one must not accept models merely because one has a small and toothless data set that has not got the power to defend itself against baseless allegations. However, if one has a large enough data set that one can say that any model that fits it must be very _close_ to a certain simple model, I do not see the harm (and I do see the utility) of using that simple model. With small data sets, unless one has a model justified by a larger and closely related data set, nonparametric or robust techniques are safer. For very small data sets, in many cases, you cannot proceed and should go off to the pub... -Robert Dawson = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: How to select a distribution?
as a general strategy ... you apply both models to the observed data ... look at the (squared) residuals of the fits to the real data points ... and see which model produces the smaller amount of squared error ... sometimes this is rather obvious if you look at the data ... for example, what if you have a relationship graph ... X on the baseline and Y on the vertical ... and it has a curvilinear look to it ... kind of like a banana plot ... you could try fitting straight line to the data ... find the squared residuals ... the go to a fancier exponential equation ... find the squared residuals ... and we would see in this case that the fancier model produces smaller errors, on average ... now, this does not give you the BEST model perhaps but, it is the strategy one uses (iterating) to converge on what seems to be the best you (model) can do At 05:53 PM 10/19/00 +0900, Choi, Young Sung wrote: I am a statistically poor researcher and have a statistical problem. I have two candidate distributions, A(theta1) and B(theta1, theta2) to model my data. Then how should I determine the best distribution for my data? Suggest me an easy book that explain how to select a distribution when making a probability model and how to test the goodness of the selected distribution over other ones. Thanks in advance. = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ = = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: How to select a distribution?
On Thu, 19 Oct 2000 17:53:41 +0900, "Choi, Young Sung" [EMAIL PROTECTED] wrote: I am a statistically poor researcher and have a statistical problem. I have two candidate distributions, A(theta1) and B(theta1, theta2) to model my data. Then how should I determine the best distribution for my data? Suggest me an easy book that explain how to select a distribution when making a probability model and how to test the goodness of the selected distribution over other ones. "Data Analysis, A Model Comparison Approach" by Judd and McClelland. What you describe, assuming your notation is intentional, is a nesting of one model within another. So the one with greater number of parameters will have the better "fit" (at least, no worse) in an absolute sense, and the question is whether the fit that is achieved by using more parameters improves more than you should expect, for that increase in parameters. Assume that "fit" is measured by finding parameters satisfying least-squares error, or by maximum-likelihood. (There are other possibilities, but a similar logic generally applies.) If we further assume independence and homogeneity, then the improvement can be tested. Testing is often by an F-test that uses the number of added-parameters as the number of "degrees of freedom" in the numerator. Various texts will have this as the "Chow" test. Finally, you SELECT a distribution according to what sense it makes, and what purpose is served, and whether any good purpose is served by using the more complex parameterization. In some fashion, you need to justify the complexity or other costs of using more parameters. See Robert Abelson, "Statistics as Principled Argument." -- Rich Ulrich, [EMAIL PROTECTED] http://www.pitt.edu/~wpilib/index.html = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: How to select a distribution?
In article 8smcpv$41r$[EMAIL PROTECTED], Choi, Young Sung [EMAIL PROTECTED] wrote: I am a statistically poor researcher and have a statistical problem. I have two candidate distributions, A(theta1) and B(theta1, theta2) to model my data. Then how should I determine the best distribution for my data? Suggest me an easy book that explain how to select a distribution when making a probability model and how to test the goodness of the selected distribution over other ones. The decision as to what probability models are appropriate must come from understanding your subject. not from any use of simple distributions from probability or statistics textbooks. Above all, do not use what you know or do not know about statistical methods to influence this stage; a good statistician might be able to tell you that certain assumptions are NOT important, but as a statistician must not suggest a model. However, he may be able to ask you the questions which must be answered to produce a good model. As for the choice among formulated models, I suggest you consult a statistician after the models are formulated. In this generality, it is not possible to advise you in a short article. -- This address is for information only. I do not claim that these views are those of the Statistics Department or of Purdue University. Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 [EMAIL PROTECTED] Phone: (765)494-6054 FAX: (765)494-0558 = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =