Hi everyone.

I wonder if anyone can help me.

I am crawling our site with nutch 1.9, and would like to be able to parse the 
pages but not the headers, navbar and footer.

The reason for this is because when you post it to Solr, the content field 
starts with the same text for all pages, and if you query for text that is in 
the navbar for instance, it includes all your pages.

It there any way of configuring Nutch to do this?

Kind Regards

Mark


Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to