Thanks blackice but I cant use a plug in that’s not been maintained in a year 
in my production environment

-----Original Message-----
From: BlackIce [mailto:[email protected]] 
Sent: Friday, September 30, 2016 2:42 PM
To: [email protected]
Subject: Re: control order of operations

Try these, don't remember which I used and don't have access to my setup right 
now (there used to be a whitelist/blacklist plugin, but I don't seem to be able 
to find it on Google right now)

https://github.com/BayanGroup/nutch-custom-search

On Sep 30, 2016 7:35 PM, "KRIS MUSSHORN" <[email protected]> wrote:

Ok basic knowledge deficit.

Looks like parser.html.whitelist settings will not prevent sections of the html 
from being indexed into solr.

How can i accomplish my goal of preventing header, footer and a few divs from 
being indexed into the content field of solr?

Kris


----- Original Message -----

From: "KRIS MUSSHORN" <[email protected]>
To: [email protected]
Sent: Friday, September 30, 2016 11:54:40 AM
Subject: Re: control order of operations

would a better option be to use this property?

indexer.max.content.length = 32765

----- Original Message -----

From: "KRIS MUSSHORN" <[email protected]>
To: [email protected]
Sent: Friday, September 30, 2016 9:25:17 AM
Subject: control order of operations

I've got nutch-site.xml set to http.content.limit = 32765 ( 1 short of solr max 
).

I also have parser.html.whitelist set to ignore a bunch of irrelevant tags.

Can I set nutch so that whitelist applies before truncation?

Kris

Reply via email to