Hi, have look at boilerpipe http://code.google.com/p/boilerpipe/
BR Hannes On Fri, Apr 29, 2011 at 11:26 AM, jotta <[email protected]> wrote: > Hi! > > I have to crawl couple of sites (each of them is different). Problem is > that > most of the crawled content is rubbish (page headers, menus, adverts, > footer > etc). > I want to ask you in what way are you getting right content from this? Is > it > a good way to parse html in own plugin and get content only from this > html's > tags which I want? And then index this content to a new field. > > Regards, > Jotta-- > View this message in context: > http://lucene.472066.n3.nabble.com/Getting-content-from-crawling-site-s-tp2878602p2878602.html > Sent from the Nutch - User mailing list archive at Nabble.com. >

