Hi! I have to crawl couple of sites (each of them is different). Problem is that most of the crawled content is rubbish (page headers, menus, adverts, footer etc). I want to ask you in what way are you getting right content from this? Is it a good way to parse html in own plugin and get content only from this html's tags which I want? And then index this content to a new field.
Regards, Jotta-- View this message in context: http://lucene.472066.n3.nabble.com/Getting-content-from-crawling-site-s-tp2878602p2878602.html Sent from the Nutch - User mailing list archive at Nabble.com.

