[jira] [Updated] (NUTCH-2290) Update licenses of bundled libraries

2019-10-11 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2290:
---
Fix Version/s: (was: 2.5)

> Update licenses of bundled libraries
> 
>
> Key: NUTCH-2290
> URL: https://issues.apache.org/jira/browse/NUTCH-2290
> Project: Nutch
>  Issue Type: Bug
>  Components: deployment
>Affects Versions: 2.3.1, 1.12
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.17
>
> Attachments: 3rd-party-licenses-nutch-1.15.txt
>
>
> The files LICENSE.txt and NOTICE.txt were last edited 5 years ago and should 
> be updated to include all licenses of dependencies (and their dependencies) 
> in accordance to [Assembling LICENSE and NOTICE 
> HOWTO|http://www.apache.org/dev/licensing-howto.html]:
> # check for missing or obsolete licenses due to added or removed dependencies
> # update year in NOTICE.txt -- should be a range according to the licensing 
> HOWTO
> # bundled libraries are referenced with path and version number, e.g 
> {{lib/icu4j-4_0_1.jar}}. This would require to update the LICENSE.txt with 
> every dependency upgrade. A more generic reference ("ICU4J") would be easier 
> to maintain but the HOWTO requires to "specify the version of the dependency 
> as licenses are sometimes changed".
> # try to reduce the size of LICENSE.txt (currently 5800 lines). Mainly, 
> according to the HOWTO there is no need to repeat the Apache license again 
> and again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (NUTCH-1086) Rewrite protocol-httpclient

2019-10-11 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1086:
---
Fix Version/s: (was: 2.5)

> Rewrite protocol-httpclient
> ---
>
> Key: NUTCH-1086
> URL: https://issues.apache.org/jira/browse/NUTCH-1086
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: nutchgora, 1.5
>Reporter: Markus Jelsma
>Assignee: Fabio Santagostino
>Priority: Major
> Fix For: 1.17
>
> Attachments: Http.java, HttpResponse.java
>
>
> There are several issues about protocol-httpclient and several comments about 
> rewriting the plugin with the new http client libraries. There is, however, 
> not yet an issue for rewriting/reimplementing protocol-httpclient.
> http://hc.apache.org/httpcomponents-client-ga/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (NUTCH-2671) Upgrade ant ivy library

2019-10-11 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2671:
---
Fix Version/s: (was: 2.5)

> Upgrade ant ivy library
> ---
>
> Key: NUTCH-2671
> URL: https://issues.apache.org/jira/browse/NUTCH-2671
> Project: Nutch
>  Issue Type: Sub-task
>  Components: build
>Affects Versions: 2.4, 1.16
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.17
>
>
> Upgrade the [ant ivy library|https://ant.apache.org/ivy/index.html] to latest 
> release (2.5.0-rc1) to address NUTCH-2669.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (NUTCH-2669) Reliable solution for javax.ws packaging.type

2019-10-11 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2669:
---
Fix Version/s: (was: 2.5)

> Reliable solution for javax.ws packaging.type
> -
>
> Key: NUTCH-2669
> URL: https://issues.apache.org/jira/browse/NUTCH-2669
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 2.4, 1.16
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Blocker
> Fix For: 1.17
>
>
> The upgrade of Tika to v1.19.1 (NUTCH-2651, NUTCH-2665, NUTCH-2667) raises an 
> ant/ivy issue during build when resolving/fetching dependencies:
> {noformat}
> [ivy:resolve] [FAILED ] 
> javax.ws.rs#javax.ws.rs-api;2.1!javax.ws.rs-api.${packaging.type}:  (0ms)
> [ivy:resolve]  local: tried
> [ivy:resolve]   
> /home/jenkins/.ivy2/local/javax.ws.rs/javax.ws.rs-api/2.1/${packaging.type}s/javax.ws.rs-api.${packaging.type}
> [ivy:resolve]  maven2: tried
> [ivy:resolve]   
> http://repo1.maven.org/maven2/javax/ws/rs/javax.ws.rs-api/2.1/javax.ws.rs-api-2.1.${packaging.type}
> [ivy:resolve]  apache-snapshot: tried
> [ivy:resolve]   
> https://repository.apache.org/content/repositories/snapshots/javax/ws/rs/javax.ws.rs-api/2.1/javax.ws.rs-api-2.1.${packaging.type}
> [ivy:resolve]  sonatype: tried
> [ivy:resolve]   
> http://oss.sonatype.org/content/repositories/releases/javax/ws/rs/javax.ws.rs-api/2.1/javax.ws.rs-api-2.1.${packaging.type}
> [ivy:resolve] ::
> [ivy:resolve] ::  FAILED DOWNLOADS::
> [ivy:resolve] :: ^ see resolution messages for details  ^ ::
> [ivy:resolve] ::
> [ivy:resolve] :: 
> javax.ws.rs#javax.ws.rs-api;2.1!javax.ws.rs-api.${packaging.type}
> [ivy:resolve] ::
> [ivy:resolve]  ERRORS
> ...
> BUILD FAILED
> {noformat}
> More information about this issue is linked on 
> [jax-rs#576|https://github.com/jax-rs/api/pull/576]. 
> A work-around is to define a property {{packaging.type}} and set it to 
> {{jar}}. This can be done
> - in command-line {{ant -Dpackaging.type=jar ...}}
> - in default.properties
> - in ivysettings.xml
> The last work-around is active in current master/1.x. However, there are 
> still Jenkins builds failing while few succeed:
> ||#build||status jax-rs||machine||work-around||
> |3578|success|H28|ivysettings.xml|
> |3577|failed|H28|ivysettings.xml|
> |3576|failed|H33|ivysettings.xml|
> |3575|success|ubuntu-4|ivysettings.xml|
> |3574|failed|ubuntu-4|-Dpackaging.type=jar + default.properties|
> |3571|failed|?|-Dpackaging.type=jar + default.properties|
> |3568|failed|?|-Dpackaging.type=jar + default.properties|
> Builds which failed for other reasons are left away. The only pattern I see 
> is that only the second build on every of the Jenkins machines succeeds. A 
> possible reason could be that the build environments on the machines persist 
> state (the Nutch build directory, local ivy cache, etc.). If this is the 
> case, it may take some time until all Jenkins machines will succeed.
> The ivysettings.xml work-around was the first which succeeded on a Jenkins 
> build but it may be the case that all three work-arounds apply.
> The issue is supposed to be resolved (without work-arounds) by IVY-1577. 
> However, it looks like it isn't:
> - get rc2 of ivy 2.5.0 (the URL may change):
> {noformat}
> % wget -O ivy/ivy-2.5.0-rc2-test.jar \
> 
> https://builds.apache.org/job/Ivy/lastSuccessfulBuild/artifact/build/artifact/org.apache.ivy_2.5.0.cr2_20181023065327.jar
> {noformat}
> - edit default properties and set {{ivy.version=2.5.0-rc2-test}}
> - remove work-around in ivysettings.xml (or default.properties)
> - run {{ant clean runtime}} and check for failure resp. whether javax.ws lib 
> is in place: {{ls build/lib/javax.ws.rs-api*.jar}}
> This solution fails for 
> [ivy-2.5.0-rc1.jar|http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.5.0-rc1/ivy-2.5.0-rc1.jar]
>  and the mentioned rc2 jar as of 2018-10-23. But maybe the procedure is 
> wrong, I'll contact the ant/ivy team to solve this.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (NUTCH-2744) CrawlDbReader: improved reporting of syntactic errors in Jexl expression

2019-10-11 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2744:
--

 Summary: CrawlDbReader: improved reporting of syntactic errors in 
Jexl expression
 Key: NUTCH-2744
 URL: https://issues.apache.org/jira/browse/NUTCH-2744
 Project: Nutch
  Issue Type: Bug
  Components: crawldb
Affects Versions: 1.16
Reporter: Sebastian Nagel
 Fix For: 1.17


CrawlDbReader reports syntactic errors in Jexl expressions only in task logs 
(hadoop.log in local mode) and continues as if there where no Jexl expression 
set. It should report it more verbosely and probably also fail the job, at 
least, if the error can be checked at job start.
In my case a trivial error ({{score > .9}} instead of {{score > 0.9}}), the 
crawlDb was just left unfiltered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (NUTCH-1522) Upgrade to Tika 1.3

2019-10-11 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel closed NUTCH-1522.
--

> Upgrade to Tika 1.3
> ---
>
> Key: NUTCH-1522
> URL: https://issues.apache.org/jira/browse/NUTCH-1522
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Reporter: Julien Nioche
>Priority: Minor
> Fix For: 1.7, 2.2.1
>
>
> http://www.apache.org/dist/tika/CHANGES-1.3.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (NUTCH-1126) JUnit test for urlfilter-prefix

2019-10-11 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel closed NUTCH-1126.
--

> JUnit test for urlfilter-prefix
> ---
>
> Key: NUTCH-1126
> URL: https://issues.apache.org/jira/browse/NUTCH-1126
> Project: Nutch
>  Issue Type: Sub-task
>  Components: build
>Affects Versions: 1.4
>Reporter: Lewis John McGibbney
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.8, 2.2.1
>
> Attachments: test_case_for_urlfilter-prefix.patch
>
>
> This issue is part of the larger attempt to provide a Junit test case for 
> every Nutch plugin.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (NUTCH-1578) Upgrade to Hadoop 1.2.0

2019-10-11 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel closed NUTCH-1578.
--

> Upgrade to Hadoop 1.2.0
> ---
>
> Key: NUTCH-1578
> URL: https://issues.apache.org/jira/browse/NUTCH-1578
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 1.7, 2.2.1
>
>
> Hadoop 1.2.0 finally has the ability to run mappers in parallel when running 
> in local mode. In trunk at least the generator seems to run slightly faster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (NUTCH-1475) Index-More Plugin -- A better fall back value for date field

2019-10-11 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel closed NUTCH-1475.
--

> Index-More Plugin -- A better fall back value for date field
> 
>
> Key: NUTCH-1475
> URL: https://issues.apache.org/jira/browse/NUTCH-1475
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.1, 1.5.1
> Environment: All
>Reporter: James Sullivan
>Assignee: Sebastian Nagel
>Priority: Minor
>  Labels: index-more, plugins
> Fix For: 1.7, 2.2.1
>
> Attachments: NUTCH-1475-trunk-v1.patch, NUTCH-1475-trunk-v2.patch, 
> index-more-1xand2x.patch, index-more-2x.patch, index-more-2x.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Among other fields, the more plugin for Nutch 2.x provides a "last modified" 
> and "date" field for the Solr index. The "last modified" field is the last 
> modified date from the http headers if available, if not available it is left 
> empty. Currently, the "date" field is the same as the "last modified" field 
> unless that field is empty in which case getFetchTime is used as a fall back. 
> I think getFetchTime is not a good fall back as it is the next fetch time and 
> often a month or more in the future which doesn't make sense for the date 
> field. Users do not expect webpages/documents with future dates. A more 
> sensible fallback would be current date at the time it is indexed. 
> This is possible by simply changing line 97 of 
> https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
>  from
> time = page.getFetchTime(); // use fetch time
> to
> time = new Date().getTime();
> Users interested in the getFetchTime value can still get it from the "tstamp" 
> field.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (NUTCH-1591) Incorrect conversion of ByteBuffer to String

2019-10-11 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel closed NUTCH-1591.
--

> Incorrect conversion of ByteBuffer to String
> 
>
> Key: NUTCH-1591
> URL: https://issues.apache.org/jira/browse/NUTCH-1591
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb, indexer, parser, storage
>Affects Versions: 2.2
> Environment: Mac O/S 10.8.4, JDK 1.6.0_51
>Reporter: Jason Howes
>Priority: Critical
> Fix For: 2.2.1
>
> Attachments: NUTCH-1591.patch, NUTCH-1591.zip, Nutch1591Test.java
>
>
> There are many occurrences of the following ByteBuffer-to-String conversion 
> throughout the Nutch codebase:
> {code}
> ByteBuffer buf = ...;
> return new String(buf.array);
> {code}
> This approach assume that the ByteBuffer and its underlying array are aligned 
> (i.e. ByteBuffer.arrayOffset() is equal to 0 and the length of the underlying 
> array is the same as ByteBuffer.remaining()). In many cases this is not the 
> case. The correct way to convert a ByteBuffer to a String (or stream thereof) 
> is the following:
> {code}
> ByteBuffer buf = ...;
> return new String(buf.array(), buf.arrayOffset() + buf.position(), 
> buf.remaining());
> {code}
> I noticed this bug when using Nutch with Cassandra. In most cases, the parsed 
> content contains data from other columns (as well as garbage content) since 
> the Cassandra client library returns ByteBuffers that are views on top of a 
> larger byte[]. It also seems that others have hit this as well:
> http://grokbase.com/p/nutch/user/132jnq8s4r/slow-parse-on-hadoop
> I've attached a patch based on the release-2.2 tag of the 2.x branch on 
> GitHub:
> https://github.com/apache/nutch/tree/release-2.2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (NUTCH-2360) HTTP Basic Authentication in SolrIndexerPlugin is gone

2019-10-11 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel closed NUTCH-2360.
--

> HTTP Basic Authentication in SolrIndexerPlugin is gone
> --
>
> Key: NUTCH-2360
> URL: https://issues.apache.org/jira/browse/NUTCH-2360
> Project: Nutch
>  Issue Type: Bug
>  Components: docker, indexer, plugin
>Affects Versions: 1.12
>Reporter: Patrick Schirch
>Priority: Critical
>
> We upgraded Docker Nutch from 1.11 to 1.12. Now Nutch can't push to SSL HTTP 
> Basic Auth protected Solr by SolrIndexerPlugin anymore. After some research 
> we found the reason. The HTTP Basic Authentication was removed.
> https://svn.apache.org/viewvc/nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java?r1=1696506=1728313_format=h
> Is that intended?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2743) Add list of Nutch properties (nutch-default.xml) to documentation

2019-10-11 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949552#comment-16949552
 ] 

Sebastian Nagel commented on NUTCH-2743:


One benefit: this would make properties and description searchable via web 
search engines.

> Add list of Nutch properties (nutch-default.xml) to documentation
> -
>
> Key: NUTCH-2743
> URL: https://issues.apache.org/jira/browse/NUTCH-2743
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.17
>
>
> The file nutch-default.xml lists all Nutch properties. It should become part 
> of the documentation similar as done for Hadoop (eg. 
> [mapred-default.xml|https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml]),
>  including the XSL (configuration.xsl) required to render the file into a 
> table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (NUTCH-2743) Add list of Nutch properties (nutch-default.xml) to documentation

2019-10-11 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2743:
--

 Summary: Add list of Nutch properties (nutch-default.xml) to 
documentation
 Key: NUTCH-2743
 URL: https://issues.apache.org/jira/browse/NUTCH-2743
 Project: Nutch
  Issue Type: Improvement
  Components: documentation
Reporter: Sebastian Nagel
 Fix For: 1.17


The file nutch-default.xml lists all Nutch properties. It should become part of 
the documentation similar as done for Hadoop (eg. 
[mapred-default.xml|https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml]),
 including the XSL (configuration.xsl) required to render the file into a table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[ANNOUNCE] Apache Nutch 1.16 Release

2019-10-11 Thread Sebastian Nagel
Hi folks!

The Apache Nutch [0] Project Management Committee are pleased to announce
the immediate release of Apache Nutch v1.16. We advise all current users
and developers to upgrade to this release.

Nutch is a well matured, production ready Web crawler. Nutch 1.x enables
fine grained configuration, relying on Apache Hadoop™ [1] data structures,
which are great for batch processing.

As usual in the 1.X series, release artifacts are made available as both
source and binary and also available within Maven Central [2] as a Maven
dependency. The release is available from our downloads page [3].

This release includes more than 100 bug fixes and improvements, the full
list of changes can be seen in the release report [4]. Please also check
the changelog [5] for breaking changes.


Thanks to all Nutch contributors which made this release possible,
Sebastian (on behalf of the Nutch PMC)


[0] https://nutch.apache.org/
[1] https://hadoop.apache.org/
[2]
https://search.maven.org/search?q=g:org.apache.nutch%20AND%20a:nutch%20AND%20v:1.16
[3] https://nutch.apache.org/downloads.html
[4] https://s.apache.org/l2j94
[5] https://dist.apache.org/repos/dist/release/nutch/1.16/CHANGES.txt


[ANNOUNCE] Apache Nutch 2.4 Release

2019-10-11 Thread Sebastian Nagel
Hi,

the Apache Nutch [0] Project Management Committee are pleased to announce
the immediate release of Apache Nutch v2.4. We advise all current users
and developers to upgrade to this release or to switch to use Nutch 1.x
alternatively (see below).

This release contains 81 issues addressed. For a complete overview of these
issues please see the release report [1].

As usual in the 2.X series, release artifacts are made available as only
source
from our downloads page [2] and also available within Maven Central [3].

Please note that we expect that v2.4 is the last release on the 2.x series.
We've decided to freeze the development on the 2.x branch for now, as no
committer
is actively working on it. Also note that this wasn't an easy decision and
in any case it's revertible. The source repositories of the 2.x branch will
be ready for future continuation of the development at any time. To keep up
with bug fixes and improvements we recommend using the Nutch 1.x 'master'
codebase
(current release is 1.16) instead as this branch is under active
development.
Thank you to everyone who contributed to Nutch 2.x over the years!

Thanks to all Nutch contributors which made this release possible,
Sebastian (on behalf of the Nutch PMC)


[0] https://nutch.apache.org/
[1] https://s.apache.org/bFfL
[2] https://nutch.apache.org/downloads.html
[3]
https://search.maven.org/search?q=g:org.apache.nutch%20AND%20a:nutch%20AND%20v:2.4