Hi Everyone!
We are running Nutch 1.15. We are trying to implement the nutch-585-excludeNodes.patch described on: https://issues.apache.org/jira/browse/NUTCH-585 It's acting like it's not running. We don't get an error when the crawl runs, no errors in the hadoop logs, it just doesn't exclude the content from the page. We installed it in the directory plugins>parse-html We added the following to our nutch-site.xml to exclude div id=sidebar <property> <name>parser.html.NodesToExclude</name> <value>div;id;sidebar</value> <description> A list of nodes whose content will not be indexed separated by "|". Use this to tell the HTML parser to ignore, for example, site navigation text. Each node has three elements: the first one is the tag name, the second one the attribute name, the third one the value of the attribute. Note that nodes with these attributes, and their children, will be silently ignored by the parser so verify the indexed content with Luke to confirm results. </description> </property> Here is our plugin.includes property from nutch-site.xml <property> <name>plugin.includes</name> <value>exchange-jexl|protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)</value> <description> plugins </description> </property> One question I have is would having Tika configured in nutch-site.xml like the following cause any problems with the parse-html plugin not running? <property> <name>tika.extractor</name> <value>boilerpipe</value> <description> Which text extraction algorithm to use. Valid values are: boilerpipe or none. </description> </property> <!-- DMB added --> <property> <name>tika.extractor.boilerpipe.algorithm</name> <value>ArticleExtractor</value> <description> Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, ArticleExtractor or CanolaExtractor. </description> </property> We don't have a lot to go on to debug the issue. The plugin has logic to enable logging: if (LOG.isTraceEnabled()) + LOG.trace("Stripping " + pNode.getNodeName() + "#" + idNode.getNodeValue()); But nothing shows in the log files when we crawl. I updated log4j.properties setting these two values to TRACE thinking I had to enable trace before the logging would work: log4j.logger.org.apache.nutch.crawl.Crawl=TRACE,cmdstdout log4j.logger.org.apache.nutch.parse.html=TRACE,cmdstdout I reran the crawl and no logging occurred and of course the content we didn't want crawled and indexed is still showing up in SOLR. I could really use some help and suggestions! Thank you! Dave Beckstrom -- *Fig Leaf Software is now Collective FLS, Inc.* * * *Collective FLS, Inc.* https://www.collectivefls.com/ <https://www.collectivefls.com/>

