Markus, Thank you so much for the reply!
I made the change to parse-plugins.xml and the plug-in is being called now. That plug-in didn't work so I changed to the blacklist-whitelist plug-in and I've got it working thanks to your help! Dave On Wed, Oct 9, 2019 at 4:00 PM Markus Jelsma <[email protected]> wrote: > Hello Dave, > > You have both TikaParser and HtmlParser enabled. This probably means you > never use HtmlParser but always TikaParser. You can instruct Nutch via > parse-plugins.xml which Parser impl. to choose based on MIME-type. If you > select HtmlParser for html and xhtml, Nutch should use HtmlParser instead. > > Regards, > Markus > > -----Original message----- > > From:Dave Beckstrom <[email protected]> > > Sent: Wednesday 9th October 2019 22:10 > > To: [email protected] > > Subject: Nutch excludeNodes Patch > > > > Hi Everyone! > > > > > > We are running Nutch 1.15. > > > > We are trying to implement the nutch-585-excludeNodes.patch described on: > > https://issues.apache.org/jira/browse/NUTCH-585 > > > > It's acting like it's not running. We don't get an error when the crawl > > runs, no errors in the hadoop logs, it just doesn't exclude the content > > from the page. > > > > We installed it in the directory plugins>parse-html > > > > We added the following to our nutch-site.xml to exclude div id=sidebar > > > > <property> > > <name>parser.html.NodesToExclude</name> > > <value>div;id;sidebar</value> > > <description> > > A list of nodes whose content will not be indexed separated by "|". > Use > > this to tell > > the HTML parser to ignore, for example, site navigation text. > > Each node has three elements: the first one is the tag name, the second > > one the > > attribute name, the third one the value of the attribute. > > Note that nodes with these attributes, and their children, will be > > silently ignored by the parser > > so verify the indexed content with Luke to confirm results. > > </description> > > </property> > > > > Here is our plugin.includes property from nutch-site.xml > > > > <property> > > <name>plugin.includes</name> > > > > > <value>exchange-jexl|protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)</value> > > <description> plugins > > </description> > > </property> > > > > One question I have is would having Tika configured in nutch-site.xml > like > > the following cause any problems with the parse-html plugin not running? > > > > <property> > > <name>tika.extractor</name> > > <value>boilerpipe</value> > > <description> > > Which text extraction algorithm to use. Valid values are: boilerpipe or > > none. > > </description> > > </property> > > <!-- DMB added --> > > <property> > > <name>tika.extractor.boilerpipe.algorithm</name> > > <value>ArticleExtractor</value> > > <description> > > Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, > > ArticleExtractor > > or CanolaExtractor. > > </description> > > </property> > > > > We don't have a lot to go on to debug the issue. The plugin has logic to > > enable logging: > > > > if (LOG.isTraceEnabled()) > > + LOG.trace("Stripping " + pNode.getNodeName() + "#" + > > idNode.getNodeValue()); > > > > But nothing shows in the log files when we crawl. I > > updated log4j.properties setting these two values to TRACE thinking I had > > to enable trace before the logging would work: > > > > log4j.logger.org.apache.nutch.crawl.Crawl=TRACE,cmdstdout > > log4j.logger.org.apache.nutch.parse.html=TRACE,cmdstdout > > > > I reran the crawl and no logging occurred and of course the content we > > didn't want crawled and indexed is still showing up in SOLR. > > > > I could really use some help and suggestions! > > > > Thank you! > > > > Dave Beckstrom > > > > -- > > *Fig Leaf Software is now Collective FLS, Inc.* > > * > > * > > *Collective FLS, Inc.* > > > > https://www.collectivefls.com/ <https://www.collectivefls.com/> > > > > > > > > > -- *Fig Leaf Software is now Collective FLS, Inc.* * * *Collective FLS, Inc.* https://www.collectivefls.com/ <https://www.collectivefls.com/>

