I've got nutch-site.xml set to http.content.limit = 32765 ( 1 short of solr max ).
I also have parser.html.whitelist set to ignore a bunch of irrelevant tags. Can I set nutch so that whitelist applies before truncation? Kris
I've got nutch-site.xml set to http.content.limit = 32765 ( 1 short of solr max ).
I also have parser.html.whitelist set to ignore a bunch of irrelevant tags. Can I set nutch so that whitelist applies before truncation? Kris