David --
BIC is a large-sample approximation that arises in the study of the
marginal likelihood and the deviance. It is known to select for overly
simple explanations of small data-sets. The original MDL theory, which
you are invoking, often bottoms out in approximations with this problem,
and Rissanen has moved on to stochastic complexity. Related
density-estimation approaches include minimum message length and entropy
minimization. All invoke the code-length minimizing prescriptions of
Kolmogorov complexity theory, but differ in their relations to this
non-computable gold standard. The gold standard itself is somewhat
overrated, since careful analysis of the current mathematics shows that
the P(minimum K-complexity coding is best) can vary over several orders
of magnitude.
In short, minimizing expected coding length is the best policy but the
guarantees are weak and some of the more popular approximations cannot
be trusted with a small dataset, which, unfortunately, is exactly the
circumstance which calls for a good complexity penalty.
More specifically, with very limited training data you should expect an
ideal learner to infer a model that is simpler than the true
data-generating mechanism, simply because the data does not encompass
all the interesting behavior the data-generating mechanism is capable
of.
BIC uses just the first few terms of a Taylor expansion; the analytic
form and meaning of the next few terms has been explored by
Balasubramanian and might profitably be applied to your problem.
-- Matt