Many thanks to everyone who responded to my recent plea regarding BBN complexity and likelihood measures (BIC et al). Here is an attempt to summarise the main points of advice I received. At the end I've included my original question for reference. Thanks again. David ------------------------------------------------------ Ross D. Shachter I am sure that you will get more complete answers, but let me give you a simple one. How complex a model should you want to fit to a small data set? Personally, unless you have enough data to support complexity, go with the simplest model you can. David Heckerman The BIC is a large-data approximation and should not be used on such small data sets. If you have no missing data or hidden variables, you can used a closed-form Bayesian criterion described in -- for example -- "A Tutorial on Learning With Bayesian Networks" David Heckerman (MSR-TR-95-06) If you have missing data for hidden variables, you can use a Markov Chain Monte Carlo approximation of the Bayesian criterion, also described in the above paper. Matthew Brand BIC is a large-sample approximation that arises in the study of the marginal likelihood and the deviance. It is known to select for overly simple explanations of small data-sets. More specifically, with very limited training data you should expect an ideal learner to infer a model that is simpler than the true data-generating mechanism, simply because the data does not encompass all the interesting behaviour the data-generating mechanism is capable of. Kathryn Blackmond Laskey It helps on very small data sets to use models that allow for local structure (Friedman and Goldszmidt "Learning Bayesian networks with local structure" originally in UAI-96). But have you tried the Bayesian Dirichlet score? I've compared belief net learning with naive Bayes and a number of simple heuristic learning algorithms, on data sets with very small numbers of observations, and have found that Bayes net learning with the Bayesian Dirichlet score outperforms the other methods on holdout samples. David L Dowe Look out for a forthcoming issue of the Computer Journal, which is sort of meant to be a 30th anniversary issue of the seminal Wallace and Boulton 1968 MML paper. The issue will contain several articles, discussions and rejoinders, but, of course :-) , one of the ones I encourage you to read is: Wallace, C.S. and D.L. Dowe. Minimum Message Length and Kolmogorov Complexity, accepted, in press, to appear, Computer Journal. Moises Goldszmidt I would suggest that you check "Data Analysis with Bayesian Networks: A Bootstrap Approach" Nir Friedman, Moises Goldszmidt, and Abraham Wyner in UAI99 where we explore the bootstrap precisely for learning (at least) parts of the network reliably when the data is small. Brad Venner BIC is an ad-hoc criteria outside the domains of probability theory and is not truly Bayesian. Recommends: Bretthorst, G. Larry, 1996, "An Introduction To Model Selection Using Probability Theory As Logic" in Maximum Entropy and Bayesian Methods, G. R. Heidbreder (ed), Kluwer Academic Publishers, Dordrecht the Netherlands; also available from http://bayes.wustl.edu/glb/bib.html. Greg Cooper Try what Kathy Laskey calls the "Dirichlet score". I suggest the following papers as good sources of background reading on the scoring metric: Cooper, G.F. and Herskovits, E. (1992) A Bayesian method for the induction of probabilistic networks from data, Machine Learning 9 309-347. Heckerman, D., Geiger, D. and Chickering, D. (1995) Learning Bayesian networks: The combination of knowledge and statistical data, Machine Learning 20 197-243. Stefano Monti, with whom I work, has developed a Bayesian method for discretising continuous variables during the search over Bayesian network structures. See: Monti, S. and Cooper, G.F. (1998) A multivariate discretisation method for learning Bayesian networks from mixed data, In: Proceedings of the Conference on Uncertainty in Artificial Intelligence 404-413. Stefano's doctoral dissertation, which will be completed soon, contains much more of the details. ------------------------------------------------------ <Original message:> Hi, I'm a PhD student looking for some advice about model selection of Bayes Belief Networks. I'm building BBNs using very small data sets (e.g. 20 records), and I'm using the Bayesian Information Criterion (BIC) to measure the quality of each network. The BIC is equivalent to the minimum description length (MDL) and has two terms: BIC = -log p(data|model) + 0.5 * log(N) * model_dimension or, more simply: BIC = error + complexity_penalty As the number of records (N) increases, the size of the error measure increases faster than the complexity penalty. This is because the error measure is from the *sum* of the log-likelihoods for each data point, whereas the complexity is based on *log* of the number of records, N. Thus for large N, the error dominates and accurate models are selected. For my 20-record data set, with around 4-10 variables, I typically get error measures of 8-15 and model dimensions of 20-30. So the BIC becomes: e.g. BIC = 8 + 0.5*log(20)*15 = 8 + 22.47 Thus the complexity penalty (22.47) is far larger than the error (8), so the system always selects the simplest model possible, regardless of error. So my questions are: 1) is the above understanding of BIC correct? 2) how can BIC be applied to very small data sets? 3) is there an alternative measure appropriate for such small sets? Thanks in advance for any help/pointers/advice etc. Regards, David Corney Dept. Computer Science University College London www.cs.ucl.ac.uk/staff/D.Corney
