Hello Dave, You have both TikaParser and HtmlParser enabled. This probably means you never use HtmlParser but always TikaParser. You can instruct Nutch via parse-plugins.xml which Parser impl. to choose based on MIME-type. If you select HtmlParser for html and xhtml, Nutch should use HtmlParser instead.
Regards, Markus -----Original message----- > From:Dave Beckstrom <[email protected]> > Sent: Wednesday 9th October 2019 22:10 > To: [email protected] > Subject: Nutch excludeNodes Patch > > Hi Everyone! > > > We are running Nutch 1.15. > > We are trying to implement the nutch-585-excludeNodes.patch described on: > https://issues.apache.org/jira/browse/NUTCH-585 > > It's acting like it's not running. We don't get an error when the crawl > runs, no errors in the hadoop logs, it just doesn't exclude the content > from the page. > > We installed it in the directory plugins>parse-html > > We added the following to our nutch-site.xml to exclude div id=sidebar > > <property> > <name>parser.html.NodesToExclude</name> > <value>div;id;sidebar</value> > <description> > A list of nodes whose content will not be indexed separated by "|". Use > this to tell > the HTML parser to ignore, for example, site navigation text. > Each node has three elements: the first one is the tag name, the second > one the > attribute name, the third one the value of the attribute. > Note that nodes with these attributes, and their children, will be > silently ignored by the parser > so verify the indexed content with Luke to confirm results. > </description> > </property> > > Here is our plugin.includes property from nutch-site.xml > > <property> > <name>plugin.includes</name> > > <value>exchange-jexl|protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)</value> > <description> plugins > </description> > </property> > > One question I have is would having Tika configured in nutch-site.xml like > the following cause any problems with the parse-html plugin not running? > > <property> > <name>tika.extractor</name> > <value>boilerpipe</value> > <description> > Which text extraction algorithm to use. Valid values are: boilerpipe or > none. > </description> > </property> > <!-- DMB added --> > <property> > <name>tika.extractor.boilerpipe.algorithm</name> > <value>ArticleExtractor</value> > <description> > Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, > ArticleExtractor > or CanolaExtractor. > </description> > </property> > > We don't have a lot to go on to debug the issue. The plugin has logic to > enable logging: > > if (LOG.isTraceEnabled()) > + LOG.trace("Stripping " + pNode.getNodeName() + "#" + > idNode.getNodeValue()); > > But nothing shows in the log files when we crawl. I > updated log4j.properties setting these two values to TRACE thinking I had > to enable trace before the logging would work: > > log4j.logger.org.apache.nutch.crawl.Crawl=TRACE,cmdstdout > log4j.logger.org.apache.nutch.parse.html=TRACE,cmdstdout > > I reran the crawl and no logging occurred and of course the content we > didn't want crawled and indexed is still showing up in SOLR. > > I could really use some help and suggestions! > > Thank you! > > Dave Beckstrom > > -- > *Fig Leaf Software is now Collective FLS, Inc.* > * > * > *Collective FLS, Inc.* > > https://www.collectivefls.com/ <https://www.collectivefls.com/> > > > >

