See https://issues.apache.org/jira/browse/NUTCH-961
On 29 April 2011 15:46, Hannes Carl Meyer <[email protected]> wrote: > Hi, > > have look at boilerpipe http://code.google.com/p/boilerpipe/ > > BR > > Hannes > > On Fri, Apr 29, 2011 at 11:26 AM, jotta <[email protected]> wrote: > > > Hi! > > > > I have to crawl couple of sites (each of them is different). Problem is > > that > > most of the crawled content is rubbish (page headers, menus, adverts, > > footer > > etc). > > I want to ask you in what way are you getting right content from this? Is > > it > > a good way to parse html in own plugin and get content only from this > > html's > > tags which I want? And then index this content to a new field. > > > > Regards, > > Jotta-- > > View this message in context: > > > http://lucene.472066.n3.nabble.com/Getting-content-from-crawling-site-s-tp2878602p2878602.html > > Sent from the Nutch - User mailing list archive at Nabble.com. > > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

