Hi,

have look at boilerpipe http://code.google.com/p/boilerpipe/

BR

Hannes

On Fri, Apr 29, 2011 at 11:26 AM, jotta <[email protected]> wrote:

> Hi!
>
> I have to crawl couple of sites (each of them is different). Problem is
> that
> most of the crawled content is rubbish (page headers, menus, adverts,
> footer
> etc).
> I want to ask you in what way are you getting right content from this? Is
> it
> a good way to parse html in own plugin and get content only from this
> html's
> tags which I want? And then index this content to a new field.
>
> Regards,
> Jotta--
> View this message in context:
> http://lucene.472066.n3.nabble.com/Getting-content-from-crawling-site-s-tp2878602p2878602.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Reply via email to