Re: Nutch excludeNodes Patch

Dave Beckstrom Thu, 10 Oct 2019 13:31:08 -0700

Markus,

Thank you so much for the reply!


I made the change to  parse-plugins.xml  and the plug-in is being called
now.  That plug-in didn't work so I changed to the blacklist-whitelist
plug-in and I've got it working thanks to your help!

 Dave

On Wed, Oct 9, 2019 at 4:00 PM Markus Jelsma <[email protected]>
wrote:

> Hello Dave,
>
> You have both TikaParser and HtmlParser enabled. This probably means you
> never use HtmlParser but always TikaParser. You can instruct Nutch via
> parse-plugins.xml which Parser impl. to choose based on MIME-type. If you
> select HtmlParser for html and xhtml, Nutch should use HtmlParser instead.
>
> Regards,
> Markus
>
> -----Original message-----
> > From:Dave Beckstrom <[email protected]>
> > Sent: Wednesday 9th October 2019 22:10
> > To: [email protected]
> > Subject: Nutch excludeNodes Patch
> >
> > Hi Everyone!
> >
> >
> > We are running Nutch 1.15.
> >
> > We are trying to implement the nutch-585-excludeNodes.patch described on:
> > https://issues.apache.org/jira/browse/NUTCH-585
> >
> > It's acting like it's not running.  We don't get an error when the crawl
> > runs, no errors in the hadoop logs, it just doesn't exclude the content
> > from the page.
> >
> > We installed it in the directory plugins>parse-html
> >
> > We added the following to our nutch-site.xml to exclude div id=sidebar
> >
> > <property>
> >   <name>parser.html.NodesToExclude</name>
> >   <value>div;id;sidebar</value>
> >   <description>
> >   A list of nodes whose content will not be indexed separated by "|".
> Use
> > this to tell
> >   the HTML parser to ignore, for example, site navigation text.
> >   Each node has three elements: the first one is the tag name, the second
> > one the
> >   attribute name, the third one the value of the attribute.
> >   Note that nodes with these attributes, and their children, will be
> > silently ignored by the parser
> >   so verify the indexed content with Luke to confirm results.
> >   </description>
> > </property>
> >
> > Here is our plugin.includes property from nutch-site.xml
> >
> >  <property>
> >   <name>plugin.includes</name>
> >
> >
> <value>exchange-jexl|protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)</value>
> >   <description> plugins
> >   </description>
> >  </property>
> >
> > One question I have is  would having Tika configured in nutch-site.xml
> like
> > the following  cause any problems with the parse-html plugin not running?
> >
> > <property>
> >   <name>tika.extractor</name>
> >   <value>boilerpipe</value>
> >   <description>
> >   Which text extraction algorithm to use. Valid values are: boilerpipe or
> > none.
> >   </description>
> > </property>
> >  <!-- DMB added -->
> > <property>
> >   <name>tika.extractor.boilerpipe.algorithm</name>
> >   <value>ArticleExtractor</value>
> >   <description>
> >   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor,
> > ArticleExtractor
> >   or CanolaExtractor.
> >   </description>
> > </property>
> >
> > We don't have a lot to go on to debug the issue.  The plugin has logic to
> > enable logging:
> >
> > if (LOG.isTraceEnabled())
> > +        LOG.trace("Stripping " + pNode.getNodeName() + "#" +
> > idNode.getNodeValue());
> >
> > But nothing shows in the log files when we crawl. I
> > updated log4j.properties setting these two values to TRACE thinking I had
> > to enable trace before the logging would work:
> >
> >  log4j.logger.org.apache.nutch.crawl.Crawl=TRACE,cmdstdout
> >  log4j.logger.org.apache.nutch.parse.html=TRACE,cmdstdout
> >
> > I reran the crawl and no logging occurred and of course the content  we
> > didn't want crawled and indexed is still showing up in SOLR.
> >
> > I could really use some help and suggestions!
> >
> > Thank you!
> >
> > Dave Beckstrom
> >
> > --
> > *Fig Leaf Software is now Collective FLS, Inc.*
> > *
> > *
> > *Collective FLS, Inc.*
> >
> > https://www.collectivefls.com/ <https://www.collectivefls.com/>
> >
> >
> >
> >
>

-- 
*Fig Leaf Software is now Collective FLS, Inc.*
*
*
*Collective FLS, Inc.* 

https://www.collectivefls.com/ <https://www.collectivefls.com/>

Re: Nutch excludeNodes Patch

Reply via email to