Hi!

I have to crawl couple of sites (each of them is different). Problem is that
most of the crawled content is rubbish (page headers, menus, adverts, footer
etc).
I want to ask you in what way are you getting right content from this? Is it
a good way to parse html in own plugin and get content only from this html's
tags which I want? And then index this content to a new field.

Regards,
Jotta--
View this message in context: 
http://lucene.472066.n3.nabble.com/Getting-content-from-crawling-site-s-tp2878602p2878602.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to