Hi Claudio, Have you checked BoilerPipe? It should be integrated in Tika soon ( https://issues.apache.org/jira/browse/TIKA-420) and hence usable from Nutch, which you use if I am not mistaken.
Of course this is also something that could be done in Mahout, maybe someone in the list can comment HTH Julien -- DigitalPebble Ltd http://www.digitalpebble.com On 21 May 2010 12:02, Claudio Martella <[email protected]> wrote: > Hello list, > > > I'm trying to write a content extraction tool for web crawlers. The idea > is to automatically divide content from navigation. > My attempt is: > > > 1) get N-grams from somewhere (extract them from my corpus or use some > dataset like google's) > 2) train a HMM about these N-grams. > 3) take a webpage, tokenize it > 4) take a sliding window of (N-1) tokens and try to predict the Nth > token with the HMM. > 5) calculate the error of prediction > 6) analize the error function (curve) > > > the idea behind this is that blocks of content should have a low error > which should grow constantly while you move the sliding window to > navigational blocks and reduce again towards new content blocks. By > analyzing these curves i should be able to find which block is which. > (this is what i believe spinn3r is doing: > http://spinn3r.com/content-extract) > The algorithms behind these steps are not particularly innovative, so > I'd like to ask the list what code i can re-use out of mahout or any > MR/hadoop attempt they're aware of. > > > Thanks in advance, > > Claudio > > -- > Claudio Martella > Digital Technologies > Unit Research & Development - Analyst > > TIS innovation park > Via Siemens 19 | Siemensstr. 19 > 39100 Bolzano | 39100 Bozen > Tel. +39 0471 068 123 > Fax +39 0471 068 129 > [email protected] http://www.tis.bz.it > > Short information regarding use of personal data. According to Section 13 > of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we > process your personal data in order to fulfil contractual and fiscal > obligations and also to send you information regarding our services and > events. Your personal data are processed with and without electronic means > and by respecting data subjects' rights, fundamental freedoms and dignity, > particularly with regard to confidentiality, personal identity and the right > to personal data protection. At any time and without formalities you can > write an e-mail to [email protected] in order to object the processing of > your personal data for the purpose of sending advertising materials and also > to exercise the right to access personal data and other rights referred to > in Section 7 of Decree 196/2003. The data controller is TIS Techno > Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the > complete information on the web site www.tis.bz.it. > > >
