Any other options for this issue? -----Original Message----- From: BlackIce [mailto:[email protected]] Sent: Saturday, October 1, 2016 2:11 AM To: [email protected] Subject: RE: control order of operations
Then make your own :) On Sep 30, 2016 11:13 PM, "Kris Musshorn" <[email protected]> wrote: > Thanks blackice but I cant use a plug in that’s not been maintained in > a year in my production environment > > -----Original Message----- > From: BlackIce [mailto:[email protected]] > Sent: Friday, September 30, 2016 2:42 PM > To: [email protected] > Subject: Re: control order of operations > > Try these, don't remember which I used and don't have access to my > setup right now (there used to be a whitelist/blacklist plugin, but I > don't seem to be able to find it on Google right now) > > https://github.com/BayanGroup/nutch-custom-search > > On Sep 30, 2016 7:35 PM, "KRIS MUSSHORN" <[email protected]> wrote: > > Ok basic knowledge deficit. > > Looks like parser.html.whitelist settings will not prevent sections of > the html from being indexed into solr. > > How can i accomplish my goal of preventing header, footer and a few > divs from being indexed into the content field of solr? > > Kris > > > ----- Original Message ----- > > From: "KRIS MUSSHORN" <[email protected]> > To: [email protected] > Sent: Friday, September 30, 2016 11:54:40 AM > Subject: Re: control order of operations > > would a better option be to use this property? > > indexer.max.content.length = 32765 > > ----- Original Message ----- > > From: "KRIS MUSSHORN" <[email protected]> > To: [email protected] > Sent: Friday, September 30, 2016 9:25:17 AM > Subject: control order of operations > > I've got nutch-site.xml set to http.content.limit = 32765 ( 1 short of > solr max ). > > I also have parser.html.whitelist set to ignore a bunch of irrelevant tags. > > Can I set nutch so that whitelist applies before truncation? > > Kris > >

