I am interested in getting some feedback from the list members on the 
following commentary on section 12.4.3 ("Unknown probability for 
success") of Jaynes' book PT:TLOS.  Chapter 12 is entitled "Ignorance 
priors and transformation groups".

----------

Chapters 11 and 12 I found quite exciting and useful, as construction of
reasonable priors is a subject that seems to get short shrift in most books
on Bayesian methods, and the notion of an objective prior, that encodes
exactly the information one has at hand and nothing more, is quite 
appealing.

However, I must disagree with Jaynes's construction in 12.4.3 of an
ignorance prior for an "unknown probability for success" theta, which he
concludes should be an improper prior proportional to

  theta^{-1} * (1 - theta)^{-1}

over the interval [0,1].  In fact, I will argue that his own rules
point to the uniform distribution over [0,1] as the appropriate ignorance
prior.  I'll begin by critiquing specific passages in 12.4.3.


** p. 383, second full paragraph: "For example, in a chemical laboratory we
find a jar containing an unknown and unlabeled compound.  We are at first
completely ignorant as to whether a small sample of this compound will
dissolve in water or not.  But, having observed that one small sample does
dissolve, we infer immediately that all samples of this compound are water
soluble, and although this conclusion does not carry quite the force of
deductive proof, we feel strongly that the inference was justified.  Yet the
Bayes-Laplace rule [uniform prior] leads to a negligibly small 
probability for
this being true, and yields only a probability of 2/3 that the next sample
tested will dissolve."

MY RESPONSE: This example is irrelevant for evaluating proposed ignorance
priors over theta, as this is a situation where we have quite 
substantial prior
information.  We know that the relevant information in determining whether a
sample of some solid compound will dissolve in water is

- the chemical identity of the sample
- the quantity of sample
- the quantity of water
- the temperature

All of these are factors we can easily control, and so if we repeat the
experiment with the same unknown compound, keeping the other factors the 
same,
we strongly expect to get the same result.  That is, this prior information
tells us that theta should be (nearly?) 0 or (nearly?) 1, given any 
particular
values for the above four factors.


** p. 383, third full paragraph and onward: "[...] There is a conceptual
difficulty here, since f(theta) d theta is a `probability for a 
probability'.
However, it can be removed by carrying the notion of a split personality to
extremes; instead of supposing that f(theta) describes describes the 
state of
knowledge of any one person, imagine that we have a large population of
individuals who hold varying beliefs about the probability for success, and
that f(theta) describes the distribution of their beliefs."

MY RESPONSE: This artifice is unnecessary.  Following Jaynes's advice to 
start
with the finite and take the infinite only as a well-defined limit, we can
begin by considering a case of n trials, and define theta = (# 
successes) / n.
Our distribution for theta is then a probability of a frequency, not a
probability of a probability, and there is no conceptual difficulty.  We 
then
take the limit as n -> infinity.


** Continuing: "Is it possible that, although each individual holds a 
definite
opinion, the population as a whole is completely ignorant of theta?  What
distribution f(theta) describes a population in a state of total 
confusion on
the issue? [...]

"Now suppose that, before the experiment is performed, one more definite 
piece
of evidence E is given simultaneously to all of them.  Each individual will
change his state of belief according to Bayes' theorem; Mr. X, who had
previously held the probability for success to be

  theta = p(S | X)                 (12.42)

will change it to
                       
  theta' = p(S | E,X) = [omitted]  (12.43)

[...] This new evidence thus generates a mapping of the parameter space 0 <=
theta <= 1 onto itself, given from (12.43) by

  theta' = a * theta / (1 - theta + a * theta)    (12.44)

"[...] If the population as a whole can learn nothing from this new 
evidence,
then it would seem reasonable to say that the population has been 
reduced, by
conflicting propaganda, to a state of total confusion on the issue.  We
therefore define the state of `total confusion' or `complete ignorance' 
by the
condition that, after the transformation (12.44), the number of individuals
who hold beliefs in any given range theta_1 < theta < theta_2 is the same as
before."

MY RESPONSE: I find this characterization of complete ignorance to be quite
puzzling.  I just don't see any reason why this corresponds to any notion of
complete ignorance.  If anyone can enlighten me on this point, I would
appreciate it.

Furthermore, there are certain possible new pieces of evidence E that MUST
change the overall distribution of beliefs -- for example, E might be
frequency data for the first N trials, or even a definite statement 
about the
value of theta itself.  There is also some ambiguity here.  Inference about
theta only makes sense in the context of repeated trials; so, does "S" above
really mean "S_i" (success at i-th trial) for some arbitrary i?  If so, we
must also assume that E is carefully chosen so that

  p(E | S_i, X)

has no dependence on (unobserved values of) i, so that p(S_i | E, X) remains
independent of i.


** p. 384, sentence following equation (12.43): "This new evidence thus
generates a mapping of the parameter space 0 <= theta <= 1 onto itself, 
given
from (12.43) by

  theta' = a theta / (1 - theta + a theta)        (12.44)

where

  a = p(E | S, X) / p(E | F, X).                  (12.45)"

MY RESPONSE: It seems to me that Jaynes is here committing an error that he
warns against elsewhere: erroneously identifying distinct states of
information as the same.  In particular, a is a function of the particular
individual X, since we are conditioning on different states of 
information for
each individual.  In my view, this destroys the entire construction, as 
we no
longer have the transformation (12.44).


Now let's move on to my alternate proposal for an ignorance prior, following
Jaynes's own advice.  We begin with section 12.3, "Continuous 
distributions",
wherein Jaynes writes,

"In the discrete entropy expression

  H_I^d = -(SUM i: 1 <= i <= n: p_i log[p_i])

we suppose that the discrete points x_i, i = 1,2,...,n, become more and more
numerous, in such a way that, in the limit n -> infinity,

  lim_{n->infty} (1/n)(# of points in a < x < b) = INTEGRAL_a^b dx m(x).

If this passage to the limit is sufficiently well-behaved, [...] [t]he
discrete probability distribution p_i will go over into a continuous
probability p(x | I) [...]  The `invariant measure' function, m(x) is
proportional to the limiting density of discrete points."

Then at the beginning of p. 377, Jaynes writes,

"Except for a constant factor, the measure m(x) is also the prior 
distribution
describing `complete ignorance' of x."

On p. 376, last complete paragraph, Jaynes motivates the introduction of
invariance transformations by writing,

"If the parameter space is not the result of any obvious limiting process,
what determines the proper measure m(x)?"

thus strongly implying that if there is an obvious limiting process, this is
the preferred method for constructing m(x).

But in this problem there is, in fact, an obvious limiting process -- 
the one
mentioned at the beginning of this note.  That is, we start by considering
a finite case of n trials, define theta = (# successes) / n, and define

  p(x_1, ..., x_n | theta, I)

as in section 3.1 (sampling without replacement).  (x_i is 1 if the i-th 
trial
is a success, and 0 otherwise.)  Since theta has a finite set of
possible values, and "ignorance" means we are placing no constraints on the
distribution over theta, Chapter 11 tells us that we should use the
maximum-entropy distribution for theta, i.e., the uniform distribution over

  0, 1/n, 2/n, ..., (n-1)/n, 1.

In the limit as n -> infinity we get

  p(x_1, ..., x_k | theta, I) =
    theta^{SUM i:: x_i} * (1 - theta)^{SUM i:: ~x_i}

and the prior over theta turns into a uniform pdf over [0,1].


Finally, I have some misgivings about even this solution.  The problem 
is that
we are not, in fact, completely ignorant about theta.  We know of some
additional structure to the problem -- that is, we know that theta is 
derived
from the results of the trials x_i via theta = (SUM i: 1 <= i <= n: 
x_i).  One
could argue that we should therefore derive the prior over theta from the
ignorance prior over x_1,...,x_n.  As Jaynes discusses in Chapter 3 (?), in
the limit of n -> infinity this amounts to a prior that gives 
probability 1 to
theta=1/2, and we find that we are incapable of learning --

  p(x_{k+1} | x_1,...,x_k, I) = p(x_{k+1} | I) = 1/2.

Thus it seems that any nondegenerate prior for theta is, in some
sense, informative.  At the very least, it tells us that there the
various trials are subject to some common logical influence.

Reply via email to