Hi Mark, You can use the boilerpipe feature that comes with Tika that will try to detect the main content (text) of the page and ignore all the noise around it, although this is supported by current versions of Tika Nutch doesn't expose a configuration option to enable, you could apply/use the patch in [1], this patch needs update to work the Nutch 1.9 source code but it shouldn't be that hard. One more option is using [2] you'll also need to apply a patch and then configure a property "parser.html.NodesToExclude" in your nutch-site.xml file, and then you can set a list of nodes separated by | that will not be indexed; in the JIRA you can check the format of this configuration.
Regards, [1] https://issues.apache.org/jira/browse/NUTCH-961 [2] https://issues.apache.org/jira/browse/NUTCH-585 ----- Original Message ----- From: "Mark Wilson" <[email protected]> To: [email protected] Sent: Wednesday, June 3, 2015 11:43:46 AM Subject: [MASSMAIL]Crawling pages but ignoring header and footer Does anyone know of a way to crawl a website, but ignore headers and footers, or include just the main content of a site by say only including content in a <div class="main">, for example. I have tried using https://github.com/BayanGroup/nutch-custom-search in Nutch 1.9 but I can't get it to work. Any ideas greatly appreciated. Thanks Regards Mark Wilson [email protected]

