Ok basic knowledge deficit. Looks like parser.html.whitelist settings will not prevent sections of the html from being indexed into solr.
How can i accomplish my goal of preventing header, footer and a few divs from being indexed into the content field of solr? Kris ----- Original Message ----- From: "KRIS MUSSHORN" <[email protected]> To: [email protected] Sent: Friday, September 30, 2016 11:54:40 AM Subject: Re: control order of operations would a better option be to use this property? indexer.max.content.length = 32765 ----- Original Message ----- From: "KRIS MUSSHORN" <[email protected]> To: [email protected] Sent: Friday, September 30, 2016 9:25:17 AM Subject: control order of operations I've got nutch-site.xml set to http.content.limit = 32765 ( 1 short of solr max ). I also have parser.html.whitelist set to ignore a bunch of irrelevant tags. Can I set nutch so that whitelist applies before truncation? Kris

