An extended version of the paper that Kathy refers to below (Friedman and
Goldszmidt "Learning Bayesian networks with local structure" originally in
UAI-96) appeared in the recent collection by Jordan "Learning Graphical Models".
Also, I would suggest that you check "Data Analysis with Bayesian Networks: A
Bootstrap Approach"
Nir Friedman, Moises Goldszmidt, and Abraham Wyner in UAI99 where we
explore the boostrap precisely for learning (at least) parts of the network
reliably when the data is small.
Moises
Kathryn Blackmond Laskey wrote:
> As Matthew Brand pointed out, there are fundamental limits on what any
> learner can do with very small data sets.
>
> But have you tried the Bayesian Dirichlet score? It gives exact posterior
> probabilities when the structure-specific prior knowledge is represented by
> the conjugate Dirichlet distribution. I've compared belief net learning
> with naive Bayes and a number of simple heuristic learning algorithms, on
> data sets with very small numbers of observations, and have found that
> Bayes net learning with the Bayesian Dirichlet score outperforms the other
> methods on holdout samples. It helps on very small data sets to use models
> that allow for local structure (see Friedman, UAI 97 I think, but don't
> have the ref handy).
>
> Kathy Laskey
>
> At 5:56 PM +0100 8/17/99, Corney, David wrote:
> >Hi,
> >
> >I'm a PhD student looking for some advice about model selection of Bayes
> >Belief Networks.
> >
> >I'm building BBNs using very small data sets (e.g. 20 records), and I'm
> >using the Bayesian Information Criterion (BIC) to measure the quality of
> >each network. The BIC is equivalent to the minimum description length (MDL)
> >and has two terms:
> >
> >BIC = -log p(data|model) + 0.5 * log(N) * model_dimension
> >
> >or, more simply:
> >BIC = error + complexity_penalty
> >
> >As the number of records (N) increases, the size of the error measure
> >increases faster than the complexity penalty. This is because the error
> >measure is from the *sum* of the log-likelihoods for each data point,
> >whereas the complexity is based on *log* of the number of records, N. Thus
> >for large N, the error dominates and accurate models are selected.
> >
> >For my 20-record data set, with around 4-10 variables, I typically get error
> >measures of 8-15 and model dimensions of 20-30. So the BIC becomes:
> >e.g. BIC = 8 + 0.5*log(20)*15
> > = 8 + 22.47
> >Thus the complexity penalty (22.47) is far larger than the error (8), so the
> >system always selects the simplest model possible, regardless of error.
> >
> >So my questions are:
> >1) is the above understanding of BIC correct?
> >2) how can BIC be applied to very small data sets?
> >3) is there an alternative measure appropriate for such small sets?
> >
> >Thanks in advance for any help/pointers/advice etc.
> >
> >Regards,
> >
> >David
> >
> >
> >
> >David Corney
> >Dept. Computer Science
> >University College London
> >www.cs.ucl.ac.uk/staff/D.Corney