RE: Nutch excludeNodes Patch

Markus Jelsma Wed, 09 Oct 2019 14:00:07 -0700

Hello Dave,

You have both TikaParser and HtmlParser enabled. This probably means you never 
use HtmlParser but always TikaParser. You can instruct Nutch via 
parse-plugins.xml which Parser impl. to choose based on MIME-type. If you 
select HtmlParser for html and xhtml, Nutch should use HtmlParser instead.


Regards,
Markus
 
-----Original message-----
> From:Dave Beckstrom <[email protected]>
> Sent: Wednesday 9th October 2019 22:10
> To: [email protected]
> Subject: Nutch excludeNodes Patch
> 
> Hi Everyone!
> 
> 
> We are running Nutch 1.15.
> 
> We are trying to implement the nutch-585-excludeNodes.patch described on:
> https://issues.apache.org/jira/browse/NUTCH-585
> 
> It's acting like it's not running.  We don't get an error when the crawl
> runs, no errors in the hadoop logs, it just doesn't exclude the content
> from the page.
> 
> We installed it in the directory plugins>parse-html
> 
> We added the following to our nutch-site.xml to exclude div id=sidebar
> 
> <property>
>   <name>parser.html.NodesToExclude</name>
>   <value>div;id;sidebar</value>
>   <description>
>   A list of nodes whose content will not be indexed separated by "|".  Use
> this to tell
>   the HTML parser to ignore, for example, site navigation text.
>   Each node has three elements: the first one is the tag name, the second
> one the
>   attribute name, the third one the value of the attribute.
>   Note that nodes with these attributes, and their children, will be
> silently ignored by the parser
>   so verify the indexed content with Luke to confirm results.
>   </description>
> </property>
> 
> Here is our plugin.includes property from nutch-site.xml
> 
>  <property>
>   <name>plugin.includes</name>
> 
> <value>exchange-jexl|protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)</value>
>   <description> plugins
>   </description>
>  </property>
> 
> One question I have is  would having Tika configured in nutch-site.xml like
> the following  cause any problems with the parse-html plugin not running?
> 
> <property>
>   <name>tika.extractor</name>
>   <value>boilerpipe</value>
>   <description>
>   Which text extraction algorithm to use. Valid values are: boilerpipe or
> none.
>   </description>
> </property>
>  <!-- DMB added -->
> <property>
>   <name>tika.extractor.boilerpipe.algorithm</name>
>   <value>ArticleExtractor</value>
>   <description>
>   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor,
> ArticleExtractor
>   or CanolaExtractor.
>   </description>
> </property>
> 
> We don't have a lot to go on to debug the issue.  The plugin has logic to
> enable logging:
> 
> if (LOG.isTraceEnabled())
> +        LOG.trace("Stripping " + pNode.getNodeName() + "#" +
> idNode.getNodeValue());
> 
> But nothing shows in the log files when we crawl. I
> updated log4j.properties setting these two values to TRACE thinking I had
> to enable trace before the logging would work:
> 
>  log4j.logger.org.apache.nutch.crawl.Crawl=TRACE,cmdstdout
>  log4j.logger.org.apache.nutch.parse.html=TRACE,cmdstdout
> 
> I reran the crawl and no logging occurred and of course the content  we
> didn't want crawled and indexed is still showing up in SOLR.
> 
> I could really use some help and suggestions!
> 
> Thank you!
> 
> Dave Beckstrom
> 
> -- 
> *Fig Leaf Software is now Collective FLS, Inc.*
> *
> *
> *Collective FLS, Inc.* 
> 
> https://www.collectivefls.com/ <https://www.collectivefls.com/> 
> 
> 
> 
>

RE: Nutch excludeNodes Patch

Reply via email to