RE: control order of operations

Kris Musshorn Sun, 02 Oct 2016 08:42:06 -0700

Any other options for this issue?

-----Original Message-----
From: BlackIce [mailto:[email protected]] 
Sent: Saturday, October 1, 2016 2:11 AM
To: [email protected]
Subject: RE: control order of operations


Then make your own :)

On Sep 30, 2016 11:13 PM, "Kris Musshorn" <[email protected]> wrote:

> Thanks blackice but I cant use a plug in that’s not been maintained in 
> a year in my production environment
>
> -----Original Message-----
> From: BlackIce [mailto:[email protected]]
> Sent: Friday, September 30, 2016 2:42 PM
> To: [email protected]
> Subject: Re: control order of operations
>
> Try these, don't remember which I used and don't have access to my 
> setup right now (there used to be a whitelist/blacklist plugin, but I 
> don't seem to be able to find it on Google right now)
>
> https://github.com/BayanGroup/nutch-custom-search
>
> On Sep 30, 2016 7:35 PM, "KRIS MUSSHORN" <[email protected]> wrote:
>
> Ok basic knowledge deficit.
>
> Looks like parser.html.whitelist settings will not prevent sections of 
> the html from being indexed into solr.
>
> How can i accomplish my goal of preventing header, footer and a few 
> divs from being indexed into the content field of solr?
>
> Kris
>
>
> ----- Original Message -----
>
> From: "KRIS MUSSHORN" <[email protected]>
> To: [email protected]
> Sent: Friday, September 30, 2016 11:54:40 AM
> Subject: Re: control order of operations
>
> would a better option be to use this property?
>
> indexer.max.content.length = 32765
>
> ----- Original Message -----
>
> From: "KRIS MUSSHORN" <[email protected]>
> To: [email protected]
> Sent: Friday, September 30, 2016 9:25:17 AM
> Subject: control order of operations
>
> I've got nutch-site.xml set to http.content.limit = 32765 ( 1 short of 
> solr max ).
>
> I also have parser.html.whitelist set to ignore a bunch of irrelevant tags.
>
> Can I set nutch so that whitelist applies before truncation?
>
> Kris
>
>

RE: control order of operations

Reply via email to