Ok basic knowledge deficit. 

Looks like parser.html.whitelist settings will not prevent sections of the html 
from being indexed into solr. 

How can i accomplish my goal of preventing header, footer and a few divs from 
being indexed into the content field of solr? 

Kris 


----- Original Message -----

From: "KRIS MUSSHORN" <[email protected]> 
To: [email protected] 
Sent: Friday, September 30, 2016 11:54:40 AM 
Subject: Re: control order of operations 

would a better option be to use this property? 

indexer.max.content.length = 32765 

----- Original Message ----- 

From: "KRIS MUSSHORN" <[email protected]> 
To: [email protected] 
Sent: Friday, September 30, 2016 9:25:17 AM 
Subject: control order of operations 

I've got nutch-site.xml set to http.content.limit = 32765 ( 1 short of solr max 
). 

I also have parser.html.whitelist set to ignore a bunch of irrelevant tags. 

Can I set nutch so that whitelist applies before truncation? 

Kris 


Reply via email to