Re: control order of operations

Comcast Sat, 01 Oct 2016 10:37:52 -0700

Someday

Sent from my iPhone


> On Oct 1, 2016, at 2:11 AM, BlackIce <[email protected]> wrote:
> 
> Then make your own :)
> 
>> On Sep 30, 2016 11:13 PM, "Kris Musshorn" <[email protected]> wrote:
>> 
>> Thanks blackice but I cant use a plug in that’s not been maintained in a
>> year in my production environment
>> 
>> -----Original Message-----
>> From: BlackIce [mailto:[email protected]]
>> Sent: Friday, September 30, 2016 2:42 PM
>> To: [email protected]
>> Subject: Re: control order of operations
>> 
>> Try these, don't remember which I used and don't have access to my setup
>> right now (there used to be a whitelist/blacklist plugin, but I don't seem
>> to be able to find it on Google right now)
>> 
>> https://github.com/BayanGroup/nutch-custom-search
>> 
>> On Sep 30, 2016 7:35 PM, "KRIS MUSSHORN" <[email protected]> wrote:
>> 
>> Ok basic knowledge deficit.
>> 
>> Looks like parser.html.whitelist settings will not prevent sections of the
>> html from being indexed into solr.
>> 
>> How can i accomplish my goal of preventing header, footer and a few divs
>> from being indexed into the content field of solr?
>> 
>> Kris
>> 
>> 
>> ----- Original Message -----
>> 
>> From: "KRIS MUSSHORN" <[email protected]>
>> To: [email protected]
>> Sent: Friday, September 30, 2016 11:54:40 AM
>> Subject: Re: control order of operations
>> 
>> would a better option be to use this property?
>> 
>> indexer.max.content.length = 32765
>> 
>> ----- Original Message -----
>> 
>> From: "KRIS MUSSHORN" <[email protected]>
>> To: [email protected]
>> Sent: Friday, September 30, 2016 9:25:17 AM
>> Subject: control order of operations
>> 
>> I've got nutch-site.xml set to http.content.limit = 32765 ( 1 short of
>> solr max ).
>> 
>> I also have parser.html.whitelist set to ignore a bunch of irrelevant tags.
>> 
>> Can I set nutch so that whitelist applies before truncation?
>> 
>> Kris
>> 
>>

Re: control order of operations

Reply via email to