[SECURITY] Nutch 2.3.1 affected by downstream dependency CVE-2016-6809
Title: Nutch 2.3.1 affected by downstream dependency CVE-2016-6809 Vulnerable Versions: 2.3.1 (1.16 is not vulnerable) Disclosure date: 2018-10-22 Credit: Pierre Ernst, Salesforce Summary: Remote Code Execution in Apache Nutch 2.3.1 when crawling web site containing malicious content Description: The reporter found an RCE security vulnerability in Nutch 2.3.1 when crawling a web site that links a doctored Matlab file. This was due to unsafe deserialization of user generated content. The root cause is 2 outdated 3rd party dependencies: 1. Apache Tika version 1.10 (CVE-2016-6809) 2. Apache Commons Collections 4 version 4.0 (COLLECTIONS-580) Upgrading these 2 dependencies to the latest version will fix the issue. Resolution: The Apache Nutch Project Management Committee released Apache Nutch 2.4 on 2019-10-11 (https://s.apache.org/uw8i3). All users of the 2.X branch should upgrade to this version immediately. In addition, note that we expect that v2.4 is the last release on the 2.x series. The Nutch PMC decided to freeze the development on the 2.x branch for now, as no committers are actively working on it. See the above hyperlink for more information on upgrading and the 2.x retirement decision. Contact: either dev[at] or private[at]nutch[dot]apache[dot]org depending on the nature of your contact. Regards lewismc (On behalf of the Apache Nutch PMC) -- http://home.apache.org/~lewismc/ http://people.apache.org/keys/committer/lewismc
Re: Unable to index on Hadoop 3.2.0 with 1.16
Hi Markus, I've tested in pseudo-distributed mode with Hadoop 3.2.1, including indexing into Solr. It worked. Could be a dependency version issue similar to that causing NUTCH-2706. But that's only an assumption. Since the IndexWriters.describe() is for help only, I would just deactivate this method and open an issue to investigate the reason. Need also to think when and where to output the index writer options. Maybe better call the describe() methods of the indexer plugins explicitly via IndexingJob --help or similar. Best, Sebastian On 14.10.19 17:08, Markus Jelsma wrote: > Hello, > > We're upgrading our stuff to 1.16 and got a peculiar problem when we started > indexing: > > 2019-10-14 13:50:30,586 WARN [main] org.apache.hadoop.mapred.YarnChild: > Exception running child : java.lang.IllegalStateException: text width is less > than 1, was <-41> > at org.apache.commons.lang3.Validate.validState(Validate.java:829) > at > de.vandermeer.skb.interfaces.transformers.textformat.Text_To_FormattedText.transform(Text_To_FormattedText.java:215) > at > de.vandermeer.asciitable.AT_Renderer.renderAsCollection(AT_Renderer.java:250) > at de.vandermeer.asciitable.AT_Renderer.render(AT_Renderer.java:128) > at de.vandermeer.asciitable.AsciiTable.render(AsciiTable.java:191) > at org.apache.nutch.indexer.IndexWriters.describe(IndexWriters.java:326) > at > org.apache.nutch.indexer.IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:45) > at > org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.(ReduceTask.java:542) > at > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:615) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) > at java.base/java.security.AccessController.doPrivileged(Native Method) > at java.base/javax.security.auth.Subject.doAs(Subject.java:423) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168) > > The only IndexWriter we use is SolrIndexer, and locally everything is just > fine. > > Any thoughts? > > Thanks, > Markus >
Unable to index on Hadoop 3.2.0 with 1.16
Hello, We're upgrading our stuff to 1.16 and got a peculiar problem when we started indexing: 2019-10-14 13:50:30,586 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.IllegalStateException: text width is less than 1, was <-41> at org.apache.commons.lang3.Validate.validState(Validate.java:829) at de.vandermeer.skb.interfaces.transformers.textformat.Text_To_FormattedText.transform(Text_To_FormattedText.java:215) at de.vandermeer.asciitable.AT_Renderer.renderAsCollection(AT_Renderer.java:250) at de.vandermeer.asciitable.AT_Renderer.render(AT_Renderer.java:128) at de.vandermeer.asciitable.AsciiTable.render(AsciiTable.java:191) at org.apache.nutch.indexer.IndexWriters.describe(IndexWriters.java:326) at org.apache.nutch.indexer.IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:45) at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.(ReduceTask.java:542) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:615) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at java.base/java.security.AccessController.doPrivileged(Native Method) at java.base/javax.security.auth.Subject.doAs(Subject.java:423) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168) The only IndexWriter we use is SolrIndexer, and locally everything is just fine. Any thoughts? Thanks, Markus
Re: metatags missing with parse-html
Hi Dave, could you share an example document? Which Nutch version is used? I tried to reproduce the problem without success using Nutch v1.16: - example document: Test metatags test for metatag extraction - using parse-html (works) > bin/nutch indexchecker -Dmetatags.names='*' \ -Dindex.parse.md='metatag.language,metatag.subject,metatag.category' \ -Dplugin.includes='protocol-http|parse-(html|metatags)|index-(basic|metadata)' \ http://localhost/nutch/test_metatags.html fetching: http://localhost/nutch/test_metatags.html robots.txt whitelist not configured. parsing: http://localhost/nutch/test_metatags.html contentType: text/html tstamp :Mon Oct 14 13:24:14 CEST 2019 metatag.language : en metatag.language : en metatag.category : meta data metatag.category : meta data digest :50d08494ba791bb52fcdeebfc08ba640 host : localhost metatag.subject : test metatag.subject : test id :http://localhost/nutch/test_metatags.html title : Test metatags url : http://localhost/nutch/test_metatags.html content : Test metatags test for metatag extraction - using parse-tika (works) > bin/nutch indexchecker -Dmetatags.names='*' \ -Dindex.parse.md='metatag.language,metatag.subject,metatag.category' \ -Dplugin.includes='protocol-http|parse-(tika|metatags)|index-(basic|metadata)' \ http://localhost/nutch/test_metatags.html fetching: http://localhost/nutch/test_metatags.html robots.txt whitelist not configured. parsing: http://localhost/nutch/test_metatags.html contentType: text/html tstamp :Mon Oct 14 13:25:34 CEST 2019 metatag.language : en metatag.language : en metatag.category : meta data metatag.category : meta data digest :50d08494ba791bb52fcdeebfc08ba640 host : localhost metatag.subject : test metatag.subject : test id :http://localhost/nutch/test_metatags.html title : Test metatags url : http://localhost/nutch/test_metatags.html content : Test metatags test for metatag extraction There are currently two issue open around metatags: https://issues.apache.org/jira/browse/NUTCH-1559 https://issues.apache.org/jira/browse/NUTCH-2525 Maybe it's related to one of those? Best, Sebastian On 11.10.19 22:38, Dave Beckstrom wrote: > Hi Everyone, > > It seems like I take 1 step forward and 2 steps backwards. > > I was using parse-tika and I needed to change to parse-html in order to use > a plug-in for excluding content such as headers and footers. > > I have the excludes working with the plug-in. But now I see that all of > the metatags are missing from solr. The metatag fields are defined in SOLR > but not populated. > > Metatags were working prior to the change to parse-html. What would > explain the metatags not being indexed when the configuration > parameters didn't change? Is there some other setting for parse-html that > I need to look into? > > Thanks! > > > > plugin.includes > > exchange-jexl|protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)|index-blacklist-whitelist > > > > > metatags.names > * > > > > index.parse.md >metatag.language,metatag.subject,metatag.category > > >
RE: [ANNOUNCE] Apache Nutch 1.16 Release
Thanks Sebastian! -Original message- > From:Sebastian Nagel > Sent: Friday 11th October 2019 17:03 > To: user@nutch.apache.org > Cc: d...@nutch.apache.org; annou...@apache.org > Subject: [ANNOUNCE] Apache Nutch 1.16 Release > > Hi folks! > > The Apache Nutch [0] Project Management Committee are pleased to announce > the immediate release of Apache Nutch v1.16. We advise all current users > and developers to upgrade to this release. > > Nutch is a well matured, production ready Web crawler. Nutch 1.x enables > fine grained configuration, relying on Apache Hadoop™ [1] data structures, > which are great for batch processing. > > As usual in the 1.X series, release artifacts are made available as both > source and binary and also available within Maven Central [2] as a Maven > dependency. The release is available from our downloads page [3]. > > This release includes more than 100 bug fixes and improvements, the full > list of changes can be seen in the release report [4]. Please also check > the changelog [5] for breaking changes. > > > Thanks to all Nutch contributors which made this release possible, > Sebastian (on behalf of the Nutch PMC) > > > [0] https://nutch.apache.org/ > [1] https://hadoop.apache.org/ > [2] > https://search.maven.org/search?q=g:org.apache.nutch%20AND%20a:nutch%20AND%20v:1.16 > [3] https://nutch.apache.org/downloads.html > [4] https://s.apache.org/l2j94 > [5] https://dist.apache.org/repos/dist/release/nutch/1.16/CHANGES.txt >