Claudio,

In general, this approach is a reasonable one and can be said to be based on
building a language model of some kind which might or might not be based on
HMM's.  A token trigram model, for instance, would be easier to train than
an HMM and would probably give very comparable performance.

Also, it might be helpful to build two language models, one for boilerplate
and one for content.  That would also be amenable to partially supervised
training in which you mark up a small sample, tag a larger sample and train
on both for your final model.

In any case, Mahout software may be helpful to you but we still lack an HMM
implementation.  It was proposed recently, but I haven't seen any progress
yet.

Where the Mahout project can help is with implementation advice for getting
you to a scalable implementation, especially if your efforts were to produce
a generally useful HMM implementation.

Also, if you look at the Boilerplate implementation, you will see that it is
actually somewhat related to your idea except that it uses lines instead of
tokens.  The idea of sequence appears in Boilerplate analogously to your
approach, but is somewhat fuzzier because of the size of the units being
analyzed.  You can read this two different ways.  One way is to say that
Boilerplate confirms your suspicions and indicates that your approach would
work well and the other way is to say that Boilerplate confirms your
suspicions and that you don't need to do your exact approach.  Either way,
you should take it as confirmation, though.

On Fri, May 21, 2010 at 4:02 AM, Claudio Martella <
[email protected]> wrote:

> 4) take a sliding window of (N-1) tokens and try to predict the Nth
> token with the HMM.
> 5) calculate the error of prediction
> 6) analize the error function (curve)
>

Reply via email to