Ted Dunning wrote: > Claudio, > > In general, this approach is a reasonable one and can be said to be based on > building a language model of some kind which might or might not be based on > HMM's. A token trigram model, for instance, would be easier to train than > an HMM and would probably give very comparable performance. > > Hi Ted, thanks for your detailed feedback. Yes, a token trigram would be reasonable. Do you have any reference for an MR token trigram model?
> Also, it might be helpful to build two language models, one for boilerplate > and one for content. That would also be amenable to partially supervised > training in which you mark up a small sample, tag a larger sample and train > on both for your final model. > > the idea of having two language models is actually pretty smart. the problem i see is finding a reasonable corpus for the boilerplate (which is not a problem for the content) > In any case, Mahout software may be helpful to you but we still lack an HMM > implementation. It was proposed recently, but I haven't seen any progress > yet. > > Where the Mahout project can help is with implementation advice for getting > you to a scalable implementation, especially if your efforts were to produce > a generally useful HMM implementation. > > I'm reconsidering my architecture. The trained model wouldn't be used in a distributed way (who would want to use a whole cluster to extract content from a page?), meaning that one page should be manageable by one single node (having a BIG HMM would mean i'd be trying to extract text from a HUGE corpus all-in-once). The scalability would come into the ngram generation (which works on the huge training corpus) and the model training (which works on the huge generated ngrams). The constraint would be the final model to be usable by a single node (which works on a tiny page), or just a workstation out of the cluster. Does it make sense? > Also, if you look at the Boilerplate implementation, you will see that it is > actually somewhat related to your idea except that it uses lines instead of > tokens. The idea of sequence appears in Boilerplate analogously to your > approach, but is somewhat fuzzier because of the size of the units being > analyzed. You can read this two different ways. One way is to say that > Boilerplate confirms your suspicions and indicates that your approach would > work well and the other way is to say that Boilerplate confirms your > suspicions and that you don't need to do your exact approach. Either way, > you should take it as confirmation, though. > > that makes me very happy. also the paper, in the related work section, reports some similar approaches with high accuracy. I don't understand the by-line approach though. Wouldn't a window based approach be more general? > On Fri, May 21, 2010 at 4:02 AM, Claudio Martella < > [email protected]> wrote: > > >> 4) take a sliding window of (N-1) tokens and try to predict the Nth >> token with the HMM. >> 5) calculate the error of prediction >> 6) analize the error function (curve) >> >> > > -- Claudio Martella Digital Technologies Unit Research & Development - Analyst TIS innovation park Via Siemens 19 | Siemensstr. 19 39100 Bolzano | 39100 Bozen Tel. +39 0471 068 123 Fax +39 0471 068 129 [email protected] http://www.tis.bz.it
