Induction and probability: let's get rigorous!

Kevin S. Van Horn Thu, 1 Jul 1999 10:42:54 -0700
Ronald E. Parr wrote:
> 
> However you interpret Bayes rule, you make the assumption that the
> evidence upon which you are conditioning is germane to the proposition
> in question.

Not at all, as I demonstrated in another posting.  Bayes' Rule is perfectly
capable of handling situations where the evidence is irrelevant to the
proposition of interest, in which case the posterior is identical to the prior.

I should be clear, by the way, that I was merely addressing a technical point
about Bayes' Rule; I wasn't really adressing the question you and Kathy were
debating over whether you can use Bayes' Rule to determine whether induction
works.  The discussion struck me as too vague -- too dependent on slippery
English -- for any useful conclusion to be reached.

However... now my interest is piqued.  Can we turn this into a precise
mathematical question?  I'm going to take a shot at it.

First of all, our universe is far too messy, so I'm going to look at a much
simpler universe, following a long tradition in mathematics and physics of
tackling simple problems before you even consider more complicated ones.  In
this universe, time is discrete, and there is only one observable: a Boolean
value.  So at time t, the past is described by

  x_1, x_1, ..., x_t

and we want to know if we can make any inferences from this as to whether
"induction works."

I don't know exactly what it means to say that "induction works" or "induction
doesn't work," so I'm going to just propose a definition to get the discussion
off the ground.  It's perfectly reasonable for you to object to my definition,
but then you'll have to help me find a different -- and equally precise --
definition that is acceptable to you.

My interpretation of "induction works" is that there is some regularity to the
universe. Let L be some algorithmic language and p be some program in that
language that takes as input some *state* s and produces as output a new state
s' and a Boolean value y.  We define

 out(q,s_0,t) is the sequence y_1,...,y_t produced by setting
 (s_(i+1),y_(i+1)) = q(s_i) for all 0 <= i < t.

Then I take "induction works" to mean that there is some program p in language L
and initial state s_0 such that

- x_1,...,x_t = out(p,s_0,t) for all t.
- out(p,s_0,t) is defined for all t (p never gets hung up in an infinite loop
when computing the next state and value).

That is, our universe has a hidden state at each time step, and both the next
state and the observable x_t at the current time step can be computed from the
current state using p.  To complete the hypothesis of "induction works", we'll
also have to specify a prior f over the programs p and initial states s_0.  For
example, we could use a prior that assigns higher probabilities to (program,
initial state) pairs representable in fewer bits, in keeping with our assumption
that the universe is regular.

My interpretation of "induction doesn't work" is that the x_i are completely
unrelated, that is, they are independent random variables, so knowing any finite
subset of the x_i tells you absolutely nothing about any of the remaining x_i. 
As to what probability we should assign to x_i, we choose the least informative
distribution: x_i and ~x_i equally probable.

Let I be the hypothesis that induction works, and R be the hypothesis that it
does not.  Let X be the information about I and R described above.  We can
formalize all of this as follows:

  1. P(I or R | X) = 1

  2. P(x_i | x_(k_1), ..., x_(k_n), R, X) = P(x_i | R, X)
     (that is, the x_i are independent of each other, given R)

  3. P(x_i | R, X) = 1/2.

  4. P(x_1, ..., x_t | p, s_0, I, X) = 1 if x_1,...,x_t = out(p, s_0, t);
     otherwise it is 0.

  5. P(p, s_0 | I, X) = f(p, s_0).

  6. P(I | X) = a
     (we have to assign some prior probability to the hypothesis that induction
works).

Then

  P(x_1, ..., x_t | R, X) = 2^(-t)
  P(x_1, ..., x_t | I, X) = (SUM p, s_0: out(p, s_0, t) = x_1,...,x_t: f(p,s_0))
                          = S(x_1,...,x_t)  [definition]

So now we can compute the probability that "induction works," i.e., that there
is some regularity to our universe:

  P(I | x_1,...,x_t, X) =         a * S(x_1,...x_t)
                          -----------------------------------
                          (1-a) * 2^(-t) + a * S(x_1,...,x_t)

For example, if x_1,...,x_t = false,true,false,true,..., then for large enough
t, S(x_1,...,x_t) will converge to a constant > 0, whereas 2^(-t) will become
arbitrarily small, so we become more and more sure that I is true as t
increases.

On the other hand, if x_1,...,x_t is a random sequence for all t (using the
technical definition of randomness from Kolmogorov complexity theory, i.e.,
there is no program representable in fewer than t bits that produces the
x_1,...,x_t), then we will find S(x_1,...,x_t) going to zero faster than 2^(-t),
so we become more and more sure that R is true as t increases.

We have not made any circular assumptions here -- we have explicitly allowed for
the possibility that the past tells you nothing about the future, and we are
capable reaching either conclusion (I true, or R true) depending on the data. 
The point is that Bayes' Rule allows us to *compare* different assumptions.  We
have used it here to compare the assumption of no order whatsoever (the past
tells you nothing about the future) with an assumption of a regular universe.
Induction and probability: let's get rigorous!

Reply via email to