Does anyone know of a way to crawl a website, but ignore headers and footers, or include just the main content of a site by say only including content in a <div class="main">, for example.
I have tried using https://github.com/BayanGroup/nutch-custom-search in Nutch 1.9 but I can't get it to work. Any ideas greatly appreciated. Thanks Regards Mark Wilson [email protected]
signature.asc
Description: Message signed with OpenPGP using GPGMail

