RE: control order of operations

Markus Jelsma Tue, 04 Oct 2016 09:38:12 -0700

To my knowledge, there is no such thing, and it would probably never work 
generic in any way. If you want to prevent section being extracted, Nutch has 
support for Boilerpipe, an open source extractor. It has major drawbacks, but 
can work fine in some cases.


M.

 
 
-----Original message-----
> From:KRIS MUSSHORN <[email protected]>
> Sent: Friday 30th September 2016 19:36
> To: [email protected]
> Subject: Re: control order of operations
> 
> Ok basic knowledge deficit. 
> 
> Looks like parser.html.whitelist settings will not prevent sections of the 
> html from being indexed into solr. 
> 
> How can i accomplish my goal of preventing header, footer and a few divs from 
> being indexed into the content field of solr? 
> 
> Kris 
> 
> 
> ----- Original Message -----
> 
> From: "KRIS MUSSHORN" <[email protected]> 
> To: [email protected] 
> Sent: Friday, September 30, 2016 11:54:40 AM 
> Subject: Re: control order of operations 
> 
> would a better option be to use this property? 
> 
> indexer.max.content.length = 32765 
> 
> ----- Original Message ----- 
> 
> From: "KRIS MUSSHORN" <[email protected]> 
> To: [email protected] 
> Sent: Friday, September 30, 2016 9:25:17 AM 
> Subject: control order of operations 
> 
> I've got nutch-site.xml set to http.content.limit = 32765 ( 1 short of solr 
> max ). 
> 
> I also have parser.html.whitelist set to ignore a bunch of irrelevant tags. 
> 
> Can I set nutch so that whitelist applies before truncation? 
> 
> Kris 
> 
> 
>

RE: control order of operations

Reply via email to