Getting content from crawling site's

jotta Fri, 29 Apr 2011 07:11:29 -0700

Hi!

I have to crawl couple of sites (each of them is different). Problem is that
most of the crawled content is rubbish (page headers, menus, adverts, footer
etc).
I want to ask you in what way are you getting right content from this? Is it
a good way to parse html in own plugin and get content only from this html's
tags which I want? And then index this content to a new field.


Regards,
Jotta--
View this message in context: 
http://lucene.472066.n3.nabble.com/Getting-content-from-crawling-site-s-tp2878602p2878602.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Getting content from crawling site's

Reply via email to