[SECURITY] Nutch 2.3.1 affected by downstream dependency CVE-2016-6809

2019-10-14 Thread lewis john mcgibbney
Title: Nutch 2.3.1 affected by downstream dependency CVE-2016-6809

Vulnerable Versions: 2.3.1 (1.16 is not vulnerable)

Disclosure date: 2018-10-22

Credit: Pierre Ernst, Salesforce

Summary: Remote Code Execution in Apache Nutch 2.3.1 when crawling web site
containing malicious content

Description: The reporter found an RCE security vulnerability in Nutch
2.3.1 when crawling a web site that links a doctored Matlab file. This was
due to unsafe deserialization of user generated content. The root cause is
2 outdated 3rd party dependencies: 1. Apache Tika version 1.10
(CVE-2016-6809) 2. Apache Commons Collections 4 version 4.0
(COLLECTIONS-580) Upgrading these 2 dependencies to the latest version will
fix the issue.

Resolution: The Apache Nutch Project Management Committee released Apache
Nutch 2.4 on 2019-10-11 (https://s.apache.org/uw8i3). All users of the 2.X
branch should upgrade to this version immediately. In addition, note that
we expect that v2.4 is the last release on the 2.x series. The Nutch PMC
decided to freeze the development on the 2.x branch for now, as no
committers are actively working on it. See the above hyperlink for more
information on upgrading and the 2.x retirement decision.

Contact: either dev[at] or private[at]nutch[dot]apache[dot]org depending on
the nature of your contact.

Regards lewismc
(On behalf of the Apache Nutch PMC)
-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: Unable to index on Hadoop 3.2.0 with 1.16

2019-10-14 Thread Sebastian Nagel
Hi Markus,

I've tested in pseudo-distributed mode with Hadoop 3.2.1,
including indexing into Solr. It worked.

Could be a dependency version issue similar to that
causing NUTCH-2706. But that's only an assumption.

Since the IndexWriters.describe() is for help only,
I would just deactivate this method and open an issue to
investigate the reason. Need also to think when and where
to output the index writer options. Maybe better call
the describe() methods of the indexer plugins explicitly
via IndexingJob --help or similar.

Best,
Sebastian

On 14.10.19 17:08, Markus Jelsma wrote:
> Hello,
> 
> We're upgrading our stuff to 1.16 and got a peculiar problem when we started 
> indexing:
> 
> 2019-10-14 13:50:30,586 WARN [main] org.apache.hadoop.mapred.YarnChild: 
> Exception running child : java.lang.IllegalStateException: text width is less 
> than 1, was <-41>
>   at org.apache.commons.lang3.Validate.validState(Validate.java:829)
>   at 
> de.vandermeer.skb.interfaces.transformers.textformat.Text_To_FormattedText.transform(Text_To_FormattedText.java:215)
>   at 
> de.vandermeer.asciitable.AT_Renderer.renderAsCollection(AT_Renderer.java:250)
>   at de.vandermeer.asciitable.AT_Renderer.render(AT_Renderer.java:128)
>   at de.vandermeer.asciitable.AsciiTable.render(AsciiTable.java:191)
>   at org.apache.nutch.indexer.IndexWriters.describe(IndexWriters.java:326)
>   at 
> org.apache.nutch.indexer.IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:45)
>   at 
> org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.(ReduceTask.java:542)
>   at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:615)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
>   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
>   at java.base/java.security.AccessController.doPrivileged(Native Method)
>   at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
> 
> The only IndexWriter we use is SolrIndexer, and locally everything is just 
> fine. 
> 
> Any thoughts?
> 
> Thanks,
> Markus
> 



Unable to index on Hadoop 3.2.0 with 1.16

2019-10-14 Thread Markus Jelsma
Hello,

We're upgrading our stuff to 1.16 and got a peculiar problem when we started 
indexing:

2019-10-14 13:50:30,586 WARN [main] org.apache.hadoop.mapred.YarnChild: 
Exception running child : java.lang.IllegalStateException: text width is less 
than 1, was <-41>
at org.apache.commons.lang3.Validate.validState(Validate.java:829)
at 
de.vandermeer.skb.interfaces.transformers.textformat.Text_To_FormattedText.transform(Text_To_FormattedText.java:215)
at 
de.vandermeer.asciitable.AT_Renderer.renderAsCollection(AT_Renderer.java:250)
at de.vandermeer.asciitable.AT_Renderer.render(AT_Renderer.java:128)
at de.vandermeer.asciitable.AsciiTable.render(AsciiTable.java:191)
at org.apache.nutch.indexer.IndexWriters.describe(IndexWriters.java:326)
at 
org.apache.nutch.indexer.IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:45)
at 
org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.(ReduceTask.java:542)
at 
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:615)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)

The only IndexWriter we use is SolrIndexer, and locally everything is just 
fine. 

Any thoughts?

Thanks,
Markus


Re: metatags missing with parse-html

2019-10-14 Thread Sebastian Nagel
Hi Dave,

could you share an example document? Which Nutch version is used?

I tried to reproduce the problem without success using Nutch v1.16:

- example document:



Test metatags





test for metatag extraction



- using parse-html (works)

> bin/nutch indexchecker -Dmetatags.names='*' \
   -Dindex.parse.md='metatag.language,metatag.subject,metatag.category' \
   
-Dplugin.includes='protocol-http|parse-(html|metatags)|index-(basic|metadata)' \
   http://localhost/nutch/test_metatags.html
fetching: http://localhost/nutch/test_metatags.html
robots.txt whitelist not configured.
parsing: http://localhost/nutch/test_metatags.html
contentType: text/html
tstamp :Mon Oct 14 13:24:14 CEST 2019
metatag.language :  en
metatag.language :  en
metatag.category :  meta data
metatag.category :  meta data
digest :50d08494ba791bb52fcdeebfc08ba640
host :  localhost
metatag.subject :   test
metatag.subject :   test
id :http://localhost/nutch/test_metatags.html
title : Test metatags
url :   http://localhost/nutch/test_metatags.html
content :   Test metatags
test for metatag extraction

- using parse-tika (works)

> bin/nutch indexchecker -Dmetatags.names='*' \
   -Dindex.parse.md='metatag.language,metatag.subject,metatag.category' \
   
-Dplugin.includes='protocol-http|parse-(tika|metatags)|index-(basic|metadata)' \
   http://localhost/nutch/test_metatags.html
fetching: http://localhost/nutch/test_metatags.html
robots.txt whitelist not configured.
parsing: http://localhost/nutch/test_metatags.html
contentType: text/html
tstamp :Mon Oct 14 13:25:34 CEST 2019
metatag.language :  en
metatag.language :  en
metatag.category :  meta data
metatag.category :  meta data
digest :50d08494ba791bb52fcdeebfc08ba640
host :  localhost
metatag.subject :   test
metatag.subject :   test
id :http://localhost/nutch/test_metatags.html
title : Test metatags
url :   http://localhost/nutch/test_metatags.html
content :   Test metatags
test for metatag extraction


There are currently two issue open around metatags:
 https://issues.apache.org/jira/browse/NUTCH-1559
 https://issues.apache.org/jira/browse/NUTCH-2525

Maybe it's related to one of those?


Best,
Sebastian


On 11.10.19 22:38, Dave Beckstrom wrote:
> Hi Everyone,
> 
> It seems like I take 1 step forward and 2 steps backwards.
> 
> I was using parse-tika and I needed to change to parse-html in order to use
> a plug-in for excluding content such as headers and footers.
> 
> I have the excludes working with the plug-in.  But now I see that all of
> the metatags are missing from solr.  The metatag fields are defined in SOLR
> but not populated.
> 
> Metatags were working prior to the change to parse-html.  What would
> explain the metatags not being indexed when the configuration
> parameters didn't change?  Is there some other setting for parse-html that
> I need to look into?
> 
> Thanks!
> 
> 
>  
>   plugin.includes
> 
> exchange-jexl|protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)|index-blacklist-whitelist
>
>  
>  
>  
>   metatags.names
>   *
>
>  
>  
>   index.parse.md
>metatag.language,metatag.subject,metatag.category
>
> 
> 



RE: [ANNOUNCE] Apache Nutch 1.16 Release

2019-10-14 Thread Markus Jelsma
Thanks Sebastian!
 
-Original message-
> From:Sebastian Nagel 
> Sent: Friday 11th October 2019 17:03
> To: user@nutch.apache.org
> Cc: d...@nutch.apache.org; annou...@apache.org
> Subject: [ANNOUNCE] Apache Nutch 1.16 Release
> 
> Hi folks!
> 
> The Apache Nutch [0] Project Management Committee are pleased to announce
> the immediate release of Apache Nutch v1.16. We advise all current users
> and developers to upgrade to this release.
> 
> Nutch is a well matured, production ready Web crawler. Nutch 1.x enables
> fine grained configuration, relying on Apache Hadoop™ [1] data structures,
> which are great for batch processing.
> 
> As usual in the 1.X series, release artifacts are made available as both
> source and binary and also available within Maven Central [2] as a Maven
> dependency. The release is available from our downloads page [3].
> 
> This release includes more than 100 bug fixes and improvements, the full
> list of changes can be seen in the release report [4]. Please also check
> the changelog [5] for breaking changes.
> 
> 
> Thanks to all Nutch contributors which made this release possible,
> Sebastian (on behalf of the Nutch PMC)
> 
> 
> [0] https://nutch.apache.org/
> [1] https://hadoop.apache.org/
> [2]
> https://search.maven.org/search?q=g:org.apache.nutch%20AND%20a:nutch%20AND%20v:1.16
> [3] https://nutch.apache.org/downloads.html
> [4] https://s.apache.org/l2j94
> [5] https://dist.apache.org/repos/dist/release/nutch/1.16/CHANGES.txt
>