RE: control order of operations

Kris Musshorn Sun, 02 Oct 2016 09:14:02 -0700

Blackice suggested a plugin or build a custom plug in.
That’s not going to work for me.
Any other solutions?

Why does the whitelist html not work?

Kris

-----Original Message-----
From: KRIS MUSSHORN [mailto:[email protected]] 
Sent: Friday, September 30, 2016 1:36 PM
To: [email protected]
Subject: Re: control order of operations

Ok basic knowledge deficit. 

Looks like parser.html.whitelist settings will not prevent sections of the html 
from being indexed into solr. 

How can i accomplish my goal of preventing header, footer and a few divs from 
being indexed into the content field of solr? 

Kris 

----- Original Message -----

From: "KRIS MUSSHORN" <[email protected]> 
To: [email protected] 
Sent: Friday, September 30, 2016 11:54:40 AM 
Subject: Re: control order of operations 

would a better option be to use this property? 

indexer.max.content.length = 32765 

----- Original Message ----- 

From: "KRIS MUSSHORN" <[email protected]> 
To: [email protected] 
Sent: Friday, September 30, 2016 9:25:17 AM 
Subject: control order of operations 

I've got nutch-site.xml set to http.content.limit = 32765 ( 1 short of solr max 
). 

I also have parser.html.whitelist set to ignore a bunch of irrelevant tags. 

Can I set nutch so that whitelist applies before truncation? 

Kris

RE: control order of operations

Reply via email to