Blackice suggested a plugin or build a custom plug in. That’s not going to work for me. Any other solutions?
Why does the whitelist html not work? Kris -----Original Message----- From: KRIS MUSSHORN [mailto:[email protected]] Sent: Friday, September 30, 2016 1:36 PM To: [email protected] Subject: Re: control order of operations Ok basic knowledge deficit. Looks like parser.html.whitelist settings will not prevent sections of the html from being indexed into solr. How can i accomplish my goal of preventing header, footer and a few divs from being indexed into the content field of solr? Kris ----- Original Message ----- From: "KRIS MUSSHORN" <[email protected]> To: [email protected] Sent: Friday, September 30, 2016 11:54:40 AM Subject: Re: control order of operations would a better option be to use this property? indexer.max.content.length = 32765 ----- Original Message ----- From: "KRIS MUSSHORN" <[email protected]> To: [email protected] Sent: Friday, September 30, 2016 9:25:17 AM Subject: control order of operations I've got nutch-site.xml set to http.content.limit = 32765 ( 1 short of solr max ). I also have parser.html.whitelist set to ignore a bunch of irrelevant tags. Can I set nutch so that whitelist applies before truncation? Kris

