Ronald E. Parr wrote:
> [Bayes'] rule makes a statement about distributions. To apply the rule
> correctly, we must make sure that we are talking about the same
> distribution.
No, Bayes' Rule does not make a statement about distributions. Here it is:
For any propositions A, B, and prior information X, we have
P(A|B,X) = P(B|A,X) * P(A|X) / P(A|X).
This is a statement about propositions. There is no mention of distributions in
it. Of course, it is commonly the case that B is a proposition of form
x_1 = d_1 and x_2 = d_2 and ... and x_n = d_n,
where the x_i are independent and identically distributed, and that A is a
statement of the form
theta = t,
where theta is the set of parameters describing the shared distribution of the
x_i. But none of this is required by Bayes' Rule.
> Whenever you condition, you are assuming that evidence was drawn from
> the distribution about which you are making a probability statement.
No, at most you are simply assuming one large joint probability distribution
over all the variables of interest (including data variables). There is no
i.i.d. assumption required. Although the assumption that the data are i.i.d. is
essential to frequentist methods, with their reliance on sampling distributions
and central limit theorems, such an assumption is completely unnecessary with
Bayesian methods (although it does simplify the problem).
> For example, if you are trying to determine the bias of a coin, you
> update your posterior for the bias under the assumption that your data
> actually came from the distribution of the coin.
The problem is certainly simpler if you can make such an assumption, but you can
still do a Bayesian analysis even if the assumption is not true. Suppose that
your data come from flips of coin A, but you are trying to infer something about
coin B. If knowing the bias of coin tells you nothing about the bias of coin
B, then you are out of luck -- you won't learn anything about coin B. Let's
model this situation and see what Bayes' Rule tells us:
x_i : i-th coin flip on coin A
theta_A : variable describing bias of coin A
theta_B : variable describing bias of coin B
P(x_i = H | theta_A = t) = t
The x_i are independent given theta_A, and theta_B is independent of both
theta_B
and the x_i.
Call the above information X. Let D be (x_1 = d_1 and x_2 = d_2 and .... x_n =
d_n), where d_1,...,d_n is our experimental data. Then D and the proposition
(theta_B = t) are independent for all values of t, and hence we have
P(theta_B = t | D, X) = P(D | theta_B = t, X) * P(theta_B = t | X) / P(D | X)
= P(D | X) * P(theta_B = t | X) / P(D | X)
= P(theta_B = t | X)
i.e., our posterior distribution over theta_B is the same as the prior. So
Bayes' Rule did not deceive us -- it told us that the data have no relevance at
all to theta_B, just as expected.
Now let's look at a more interesting situation where knowing the bias of coin A
*does* tell us something about the bias of coin B. Let's look again at Bayes'
Rule for this situation:
P(theta_B = t | D, X) = P(D | theta_B = t, X) * P(theta_B = t | X) / P(D |
X)
posterior for theta_B = likelihood for theta_B * prior for theta_B /
constant
The only thing that changes from the previous example is the likelihood function
for theta_B:
P(D | theta_B = t, X)
= INTEGRAL P(D, theta_A = u | theta_B = t, X) du
= INTEGRAL P(D | theta_A = u, theta_B = t, X) P(theta_A = u | theta_B = t, X) du
= INTEGRAL P(D | theta_A = u, X) P(theta_A = u | theta_B = t, X) du
So f(t) = P(D | theta_B = t, X) (the likelihood function for theta_B) is a
smeared out version of g(u) = P(D | theta_A = u, X) (the likelihood function for
theta_B). If the distribution for theta_A conditional on theta_B is highly
concentrated about the value of theta_B -- that is, we expect the values of
theta_A and theta_B to be close -- then we get little smearing, and the data are
almost as informative about theta_B as they are about theta_A. If the
distribution for theta_A conditional on theta_B is diffuse (knowing theta_B
tells us very little about theta_A), then we get a lot of smearing, and the data
tell us very little about theta_B. Note that the degree of smearing (the width
of the conditional distribution h(u|t) = P(theta_A = u | theta_B = t, X)) is
independent of the amount of data we have. Thus, as intuition suggests, there
is a limit to how much we can learn about coin B by doing experiments with coin
A -- once the width of the likelihood function g(u) is much less than the width
of h(u|t), additional data tell us very little.
> If you are simply choosing between models in a world without time, then
> there is no point in talking about induction
That's a pretty strong statement; would you care to elaborate? I can point to a
number of examples of induction where there is no inherent temporal ordering on
the data and the hypothesis, or where the data may be temporally ordered *after*
the hypothesis. For exmple, mathematicians make conjectures based on analogies
and patterns their minds perceive in known proofs and theorems, then try to
prove these conjectures; but these theorems are timeless, lacking any inherent
temporal ordering. As another example, historical science (paleontology,
archaeology, etc.) is all about making inferences about the past from data in
the present -- precisely the opposite of the flow of causality.
> Looking at a mass of prior
> data can give you a hypothesis about how rules held in the world in the
> past. If you try to update your posterior on induction being true in
> the present or future, you are assuming stationarity, i.e. that evidence
> you have about induction in the past is germane to the present. This
> means that you are assuming an underlying distribution that does not
> change across time
That's one possibility, but you don't have to do it that way. If you suspect
that the rules might change over time, then the thing to do is to propose one or
more models of how they might change over time. You can then use Bayesian model
comparison techniques to compare these models with each other and with a model
in which the rules are invariant over time.
> The template for the particular form of circular reasoning in question here
> requires a premise about an inference rule (or precondition thereof) and
> then the use of that rule to establish the premise. The classic example
> of how to fill this template is the "counter-induction" rule. We assume
> that things that have worked in the past will *not* work in the future
> (and vice versa). We apply this rule in a universe where induction
> holds and observe that the rule has failed. From this, we conclude, due
> to the counter-inductive principle, that the rule must work in the
> future. Each new failure only strengthens our belief in
> counter-induction.
I don't think you can make this argument at all rigorous. The moment you try to
translate this argument into mathematics you're going to run into trouble. The
self-referential nature of the argument itself should be a big red warning flag
to you -- it's a cousin to Russell's paradoxical "set of all sets that do not
contain themselves."
> To make this a more Bayesian argument, we would need to replace the
> stationarity assumption we normally make when making predictions about
> the future with one that replaces the probabilities with their
> complements, i.e P(A) -> P(~A), etc. Call this "counter-Bayes" rule.
> Now we entertain the hypothesis, H, that our counter-stationarity
> assumption is valid. We examine our data and see that in the past, the
> counter-stationarity assumption did terribly; P(H) is very low. We
> apply our non-stationarity assumption to the hypothesis and we conclude
> that it has great evidential support for holding in the present and
> future since 1-P(H) is very high. As before, each new failure only
> strengthens the posterior for H.
Several things to note here:
1. You can't use Bayes' Rule to compute P(H) unless you have some alternative to
compare it to.
2. Again you're getting self-referential, by having your counter-stationarity
hypothesis H say something about itself. I doubt you can formalize this.
3. It's not at all clear what your counter-stationarity hypothesis really
*means*.
From what you've said so far it sounds like it might be self-contradictory,
in
which case you can't apply Bayes' rule -- you get the probabilistic
equivalent of
dividing by zero when you take a probability conditional on a known false
hypothesis.
In conclusion, you need to translate this argument into rigorous mathematics
before it will be at all convincing. As it stands, it is far too vague.
English is simply too slippery and ambiguous to use for discussing this kind of
thing.