Nutch excludeNodes Patch

Dave Beckstrom Wed, 09 Oct 2019 13:10:58 -0700

Hi Everyone!


We are running Nutch 1.15.

We are trying to implement the nutch-585-excludeNodes.patch described on:
https://issues.apache.org/jira/browse/NUTCH-585

It's acting like it's not running.  We don't get an error when the crawl
runs, no errors in the hadoop logs, it just doesn't exclude the content
from the page.

We installed it in the directory plugins>parse-html

We added the following to our nutch-site.xml to exclude div id=sidebar

<property>
  <name>parser.html.NodesToExclude</name>
  <value>div;id;sidebar</value>
  <description>
  A list of nodes whose content will not be indexed separated by "|".  Use
this to tell
  the HTML parser to ignore, for example, site navigation text.
  Each node has three elements: the first one is the tag name, the second
one the
  attribute name, the third one the value of the attribute.
  Note that nodes with these attributes, and their children, will be
silently ignored by the parser
  so verify the indexed content with Luke to confirm results.
  </description>
</property>

Here is our plugin.includes property from nutch-site.xml

 <property>
  <name>plugin.includes</name>

<value>exchange-jexl|protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)</value>
  <description> plugins
  </description>
 </property>

One question I have is  would having Tika configured in nutch-site.xml like
the following  cause any problems with the parse-html plugin not running?

<property>
  <name>tika.extractor</name>
  <value>boilerpipe</value>
  <description>
  Which text extraction algorithm to use. Valid values are: boilerpipe or
none.
  </description>
</property>
 <!-- DMB added -->
<property>
  <name>tika.extractor.boilerpipe.algorithm</name>
  <value>ArticleExtractor</value>
  <description>
  Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor,
ArticleExtractor
  or CanolaExtractor.
  </description>
</property>

We don't have a lot to go on to debug the issue.  The plugin has logic to
enable logging:

if (LOG.isTraceEnabled())
+        LOG.trace("Stripping " + pNode.getNodeName() + "#" +
idNode.getNodeValue());

But nothing shows in the log files when we crawl. I
updated log4j.properties setting these two values to TRACE thinking I had
to enable trace before the logging would work:

 log4j.logger.org.apache.nutch.crawl.Crawl=TRACE,cmdstdout
 log4j.logger.org.apache.nutch.parse.html=TRACE,cmdstdout

I reran the crawl and no logging occurred and of course the content  we
didn't want crawled and indexed is still showing up in SOLR.

I could really use some help and suggestions!

Thank you!

Dave Beckstrom

-- 
*Fig Leaf Software is now Collective FLS, Inc.*
*
*
*Collective FLS, Inc.* 

https://www.collectivefls.com/ <https://www.collectivefls.com/>

Nutch excludeNodes Patch

Reply via email to