Hi Claudio,

Have you checked BoilerPipe? It should be integrated in Tika soon (
https://issues.apache.org/jira/browse/TIKA-420) and hence usable from Nutch,
which you use if I am not mistaken.

Of course this is also something that could be done in Mahout, maybe someone
in the list can comment

HTH

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com

On 21 May 2010 12:02, Claudio Martella <[email protected]> wrote:

> Hello list,
>
>
> I'm trying to write a content extraction tool for web crawlers. The idea
> is to automatically divide content from navigation.
> My attempt is:
>
>
> 1) get N-grams from somewhere (extract them from my corpus or use some
> dataset like google's)
> 2) train a HMM about these N-grams.
> 3) take a webpage, tokenize it
> 4) take a sliding window of (N-1) tokens and try to predict the Nth
> token with the HMM.
> 5) calculate the error of prediction
> 6) analize the error function (curve)
>
>
> the idea behind this is that blocks of content should have a low error
> which should grow constantly while you move the sliding window to
> navigational blocks and reduce again towards new content blocks. By
> analyzing these curves i should be able to find which block is which.
> (this is what i believe spinn3r is doing:
> http://spinn3r.com/content-extract)
> The algorithms behind these steps are not particularly innovative, so
> I'd like to ask the list what code i can re-use out of mahout or any
> MR/hadoop attempt they're aware of.
>
>
> Thanks in advance,
>
> Claudio
>
> --
> Claudio Martella
> Digital Technologies
> Unit Research & Development - Analyst
>
> TIS innovation park
> Via Siemens 19 | Siemensstr. 19
> 39100 Bolzano | 39100 Bozen
> Tel. +39 0471 068 123
> Fax  +39 0471 068 129
> [email protected] http://www.tis.bz.it
>
> Short information regarding use of personal data. According to Section 13
> of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we
> process your personal data in order to fulfil contractual and fiscal
> obligations and also to send you information regarding our services and
> events. Your personal data are processed with and without electronic means
> and by respecting data subjects' rights, fundamental freedoms and dignity,
> particularly with regard to confidentiality, personal identity and the right
> to personal data protection. At any time and without formalities you can
> write an e-mail to [email protected] in order to object the processing of
> your personal data for the purpose of sending advertising materials and also
> to exercise the right to access personal data and other rights referred to
> in Section 7 of Decree 196/2003. The data controller is TIS Techno
> Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the
> complete information on the web site www.tis.bz.it.
>
>
>

Reply via email to