Here are two different purposes for which one might use the cross-entropy, and
the theoretical justifications for such use.

1. Maximum-likelihood parameter estimation.  In this case you want to find
that set of parameters for which the data are most probable.  The cross
entropy of the data, given a parameter set, is negative the log-likelihood of
the parameter set (log of the probability of the fixed data given the variable
parameters.)  So you can find the maximum-likelihood parameter set by
minimizing the cross-entropy.

2. Optimal approximation of a probability distribution.  Suppose that you wish
to approximate a probability mass function p over a finite set X by some pmf
in a set Q of candidate distributions.  What figure of merit do you use to
rank the pmfs in Q?  One way of approaching this is to treat it as a problem
in decision theory, i.e., maximize expected utility of your choice.  That is,
given a utility function u(q, x), where q ranges over Q and x ranges over X,
choose that distribution q in Q that maximizes

         E[u(q,x) ; x ~ p]

(the expected value of u(q,x) with x a random variable with pmf p).

But what utility function should we use?  Here are some qualitative properties
that one might like the utility function to have:

- u(q,x) is a local score function, i.e., u(q,x) = f(q(x), x) for some
  function f.  In other words, you only care about the probability that q
  assigns for the outcome that actually occurs; the probabilities q assigns to
  hypothetical outcomes that could have occurred -- but did not -- are of no
  interest.

- u(q,x) is smooth, i.e., f(y, x) is continuously differentiable in y.

- u(q,x) is proper, i.e. for any strictly positive pmf p over X and q in Q,
  E[u(q,x) ; x ~ p] <= E[u(p,x) ; x ~ p], with equality only when p and q are
  identical.  That is to say, the best you can do is to match the target pmf
  exactly.

If you insist on these qualitative properties, then Bernardo and Smith
(_Bayesian Theory_, pp. 151--154) show that

     u(q,x) = A log q(x) + B(x)

for some positive constant A and function B : X -> R.  We can them maximize
E[u(q,x); x ~ p] (find the optimal approximating pmf q) by minimizing -E[log
q(x); x ~ p], which is just the cross entropy of q given p.  (This generalizes
to distributions over continuous domains also.)

Reply via email to