Many thanks to everyone who responded to my recent plea regarding BBN
complexity and likelihood measures (BIC et al). Here is an attempt to
summarise the main points of advice I received. At the end I've included my
original question for reference.

Thanks again.

David
------------------------------------------------------

Ross D. Shachter
I am sure that you will get more complete answers, but let me give you a
simple one.  How complex a model should you want to fit to a small data set?
Personally, unless you have enough data to support complexity, go with the
simplest model you can.

David Heckerman
The BIC is a large-data approximation and should not be used on such small
data sets. If you have no missing data or hidden variables, you can used a
closed-form Bayesian criterion described in -- for example -- "A Tutorial on
Learning With Bayesian Networks" David Heckerman (MSR-TR-95-06)  If you have
missing data for hidden variables, you can use a Markov Chain Monte Carlo
approximation of the Bayesian criterion, also described in the above paper.

Matthew Brand
BIC is a large-sample approximation that arises in the study of the marginal
likelihood and the deviance.  It is known to select for overly simple
explanations of small data-sets.  More specifically, with very limited
training data you should expect an ideal learner to infer a model that is
simpler than the true data-generating mechanism, simply because the data
does not encompass all the interesting behaviour the data-generating
mechanism is capable of.

Kathryn Blackmond Laskey
 It helps on very small data sets to use models that allow for local
structure (Friedman and Goldszmidt "Learning Bayesian networks with local
structure" originally in UAI-96). But have you tried the Bayesian Dirichlet
score?  I've compared belief net learning with naive Bayes and a number of
simple heuristic learning algorithms, on data sets with very small numbers
of observations, and have found that Bayes net learning with the Bayesian
Dirichlet score outperforms the other methods on holdout samples.

David L Dowe
Look out for a forthcoming issue of the Computer Journal, which is sort of
meant to be a 30th anniversary issue of the seminal Wallace and Boulton 1968
MML paper.
The issue will contain several articles, discussions and rejoinders, but,
of course :-) , one of the ones I encourage you to read is:
 Wallace, C.S. and D.L. Dowe.  Minimum Message Length and Kolmogorov
Complexity, accepted, in press, to appear, Computer Journal.

Moises Goldszmidt
I would suggest that you check  "Data Analysis with Bayesian Networks: A
Bootstrap Approach"    Nir Friedman, Moises Goldszmidt, and Abraham Wyner in
UAI99 where we explore the bootstrap precisely for learning (at least) parts
of the network reliably when the data is small.

Brad Venner
BIC is an ad-hoc criteria outside the domains of probability theory and is
not truly Bayesian. Recommends:  Bretthorst, G. Larry, 1996, "An
Introduction To Model Selection Using Probability Theory As Logic" in
Maximum Entropy and Bayesian Methods, G. R. Heidbreder (ed), Kluwer Academic
Publishers, Dordrecht the Netherlands; also available from
http://bayes.wustl.edu/glb/bib.html. 

Greg Cooper
Try what Kathy Laskey calls the "Dirichlet score". I suggest the following
papers as good sources of background reading on the scoring metric: 
Cooper, G.F. and Herskovits, E. (1992) A Bayesian method for the induction
of probabilistic networks from data, Machine Learning  9 309-347. 
Heckerman, D., Geiger, D. and Chickering, D. (1995) Learning Bayesian
networks: The combination of knowledge and statistical data, Machine
Learning  20 197-243.  
Stefano Monti, with whom I work, has developed a Bayesian method for
discretising continuous variables during the search over Bayesian network
structures. See:  Monti, S. and Cooper, G.F. (1998) A multivariate
discretisation method for learning Bayesian networks from mixed data, In:
Proceedings of the Conference on Uncertainty in Artificial Intelligence
404-413.  Stefano's doctoral dissertation, which will be completed soon,
contains much more of the details.

------------------------------------------------------
<Original message:>
Hi,

I'm a PhD student looking for some advice about model selection of Bayes
Belief Networks.

I'm building BBNs using very small data sets (e.g. 20 records), and I'm
using the Bayesian Information Criterion (BIC) to measure the quality of
each network. The BIC is equivalent to the minimum description length (MDL)
and has two terms:

BIC = -log p(data|model) + 0.5 * log(N) * model_dimension

or, more simply: 
BIC = error + complexity_penalty

As the number of records (N) increases, the size of the error measure
increases faster than the complexity penalty. This is because the error
measure is from the *sum* of the log-likelihoods for each data point,
whereas the complexity is based on *log* of the number of records, N. Thus
for large N, the error dominates and accurate models are selected.

For my 20-record data set, with around 4-10 variables, I typically get error
measures of 8-15 and model dimensions of 20-30. So the BIC becomes:
e.g. BIC = 8 + 0.5*log(20)*15 
             = 8 + 22.47
Thus the complexity penalty (22.47) is far larger than the error (8), so the
system always selects the simplest model possible, regardless of error.

So my questions are:
1) is the above understanding of BIC correct?
2) how can BIC be applied to very small data sets?
3) is there an alternative measure appropriate for such small sets?

Thanks in advance for any help/pointers/advice etc.

Regards,

David Corney
Dept. Computer Science
University College London
www.cs.ucl.ac.uk/staff/D.Corney

Reply via email to