Claudio, In general, this approach is a reasonable one and can be said to be based on building a language model of some kind which might or might not be based on HMM's. A token trigram model, for instance, would be easier to train than an HMM and would probably give very comparable performance.
Also, it might be helpful to build two language models, one for boilerplate and one for content. That would also be amenable to partially supervised training in which you mark up a small sample, tag a larger sample and train on both for your final model. In any case, Mahout software may be helpful to you but we still lack an HMM implementation. It was proposed recently, but I haven't seen any progress yet. Where the Mahout project can help is with implementation advice for getting you to a scalable implementation, especially if your efforts were to produce a generally useful HMM implementation. Also, if you look at the Boilerplate implementation, you will see that it is actually somewhat related to your idea except that it uses lines instead of tokens. The idea of sequence appears in Boilerplate analogously to your approach, but is somewhat fuzzier because of the size of the units being analyzed. You can read this two different ways. One way is to say that Boilerplate confirms your suspicions and indicates that your approach would work well and the other way is to say that Boilerplate confirms your suspicions and that you don't need to do your exact approach. Either way, you should take it as confirmation, though. On Fri, May 21, 2010 at 4:02 AM, Claudio Martella < [email protected]> wrote: > 4) take a sliding window of (N-1) tokens and try to predict the Nth > token with the HMM. > 5) calculate the error of prediction > 6) analize the error function (curve) >
