(nutch) branch master updated: NUTCH-3041 Address confusing logging in o.a.n.net.URLExemptionFilters (#813)

2024-05-15 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 8abc78a65 NUTCH-3041 Address confusing logging in 
o.a.n.net.URLExemptionFilters (#813)
8abc78a65 is described below

commit 8abc78a653eb7970def10031d732fb4c7aa0fb6f
Author: Lewis John McGibbney 
AuthorDate: Wed May 15 20:07:15 2024 -0700

NUTCH-3041 Address confusing logging in o.a.n.net.URLExemptionFilters (#813)
---
 .../org/apache/nutch/net/URLExemptionFilters.java  |  7 +--
 src/plugin/urlfilter-ignoreexempt/README.md| 18 +++-
 .../urlfilter/ignoreexempt/ExemptionUrlFilter.java | 24 +-
 3 files changed, 26 insertions(+), 23 deletions(-)

diff --git a/src/java/org/apache/nutch/net/URLExemptionFilters.java 
b/src/java/org/apache/nutch/net/URLExemptionFilters.java
index c730228e4..ed401053e 100644
--- a/src/java/org/apache/nutch/net/URLExemptionFilters.java
+++ b/src/java/org/apache/nutch/net/URLExemptionFilters.java
@@ -24,6 +24,7 @@ import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
 import java.lang.invoke.MethodHandles;
+import java.util.Arrays;
 
 /** Creates and caches {@link URLExemptionFilter} implementing plugins. */
 public class URLExemptionFilters {
@@ -44,8 +45,10 @@ public class URLExemptionFilters {
 throw new IllegalStateException(e);
   }
 }
-LOG.info("Found {} extensions at point:'{}'", filters.length,
-URLExemptionFilter.X_POINT_ID);
+if (filters.length > 0) {
+  LOG.info("Found {} URLExemptionFilter implementations: '{}'", 
filters.length,
+Arrays.toString(filters));
+}
   }
 
   /**
diff --git a/src/plugin/urlfilter-ignoreexempt/README.md 
b/src/plugin/urlfilter-ignoreexempt/README.md
index a8f932e75..374b29abd 100644
--- a/src/plugin/urlfilter-ignoreexempt/README.md
+++ b/src/plugin/urlfilter-ignoreexempt/README.md
@@ -17,8 +17,8 @@
 
 urlfilter-ignoreexempt
 ==
-  This plugin allows certain urls to be exempted when the external links are 
configured to be ignored.
-  This is useful when focused crawl is setup but some resources like static 
files are linked from CDNs (external domains).
+This plugin allows certain urls to be exempted when the external links are 
configured to be ignored.
+This is useful when focused crawl is setup but some resources like static 
files are linked from CDNs (external domains).
 
 # How to enable ?
 Add `urlfilter-ignoreexempt` value to `plugin.includes` property
@@ -36,25 +36,21 @@ open `conf/db-ignore-external-exemptions.txt` and add the 
regex rules.
 ## Format :
 
 The format is same same as `regex-urlfilter.txt`.
- Each non-comment, non-blank line contains a regular expression
- prefixed by '+' or '-'.  The first matching pattern in the file
- determines whether a URL is exempted or ignored.  If no pattern
- matches, the URL is ignored.
-
+Each non-comment, non-blank line contains a regular expression
+prefixed by '+' or '-'.  The first matching pattern in the file
+determines whether a URL is exempted or ignored.  If no pattern
+matches, the URL is ignored.
 
 ## Example :
 
- To exempt urls ending with image extensions, use this rule
+To exempt urls ending with image extensions, use this rule
 
 `+(?i)\.(jpg|png|gif)$`
 
-   
-   
 ## Testing the Rules :
 
 After enabling the plugin and adding your rules to 
`conf/db-ignore-external-exemptions.txt`, run:

 `bin/nutch plugin urlfilter-ignoreexempt  
org.apache.nutch.urlfilter.ignoreexempt.ExemptionUrlFilter http://yoururl.here`
 
-
 This should print `true` for urls which are accepted by configured rules.
\ No newline at end of file
diff --git 
a/src/plugin/urlfilter-ignoreexempt/src/java/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.java
 
b/src/plugin/urlfilter-ignoreexempt/src/java/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.java
index 96ca9b4ac..8028e3672 100644
--- 
a/src/plugin/urlfilter-ignoreexempt/src/java/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.java
+++ 
b/src/plugin/urlfilter-ignoreexempt/src/java/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.java
@@ -25,21 +25,25 @@ import java.io.Reader;
 import java.util.regex.Pattern;
 import java.util.List;
 
-
 /**
- * This implementation of {@link org.apache.nutch.net.URLExemptionFilter} uses 
regex configuration
- * to check if URL is eligible for exemption from 'db.ignore.external'.
- * When this filter is enabled, the external urls will be checked against 
configured sequence of regex rules.
+ * This implementation of {@link org.apache.nutch.net.URLExemptionFilter} 
+ * uses regex configuration to check if URL is eligible for exemption from 
+ * the db.ignore.external.links configuration property.
+ * When this filter is enabled, the external urls will be checked 
+ * against confi

(nutch) branch master updated: NUTCH-3054 Address deprecation of Node16 for all GitHub Actions (#817)

2024-04-30 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 7ac3ce28e NUTCH-3054 Address deprecation of Node16 for all GitHub 
Actions (#817)
7ac3ce28e is described below

commit 7ac3ce28e065fb5160f96ce7bce1ec840f87d0dc
Author: Lewis John McGibbney 
AuthorDate: Tue Apr 30 07:35:39 2024 -0700

NUTCH-3054 Address deprecation of Node16 for all GitHub Actions (#817)
---
 .github/workflows/master-build.yml | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/.github/workflows/master-build.yml 
b/.github/workflows/master-build.yml
index e0af58df0..db24168b9 100644
--- a/.github/workflows/master-build.yml
+++ b/.github/workflows/master-build.yml
@@ -30,9 +30,9 @@ jobs:
 os: [ubuntu-latest]
 runs-on: ${{ matrix.os }}
 steps:
-  - uses: actions/checkout@v4
+  - uses: actions/checkout@v4.1.4
   - name: Set up JDK ${{ matrix.java }}
-uses: actions/setup-java@v3
+uses: actions/setup-java@v4.2.1
 with:
   java-version: ${{ matrix.java }}
   distribution: 'temurin'
@@ -45,9 +45,9 @@ jobs:
 os: [ubuntu-latest]
 runs-on: ${{ matrix.os }}
 steps:
-  - uses: actions/checkout@v4
+  - uses: actions/checkout@v4.1.4
   - name: Set up JDK ${{ matrix.java }}
-uses: actions/setup-java@v3
+uses: actions/setup-java@v4.2.1
 with:
   java-version: ${{ matrix.java }}
   distribution: 'temurin'
@@ -68,9 +68,9 @@ jobs:
 os: [ubuntu-latest, macos-latest]
 runs-on: ${{ matrix.os }}
 steps:
-  - uses: actions/checkout@v4
+  - uses: actions/checkout@v4.1.4
   - name: Set up JDK ${{ matrix.java }}
-uses: actions/setup-java@v3
+uses: actions/setup-java@v4.2.1
 with:
   java-version: ${{ matrix.java }}
   distribution: 'temurin'



(nutch) branch master updated: Boostrap Nutch 1.21 development drive.

2024-04-28 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 817af69d4 Boostrap Nutch 1.21 development drive.
817af69d4 is described below

commit 817af69d451609d725fc7fb040bc32f1fa0052bc
Author: Lewis John McGibbney 
AuthorDate: Sun Apr 28 17:34:10 2024 -0700

Boostrap Nutch 1.21 development drive.
---
 conf/nutch-default.xml | 2 +-
 default.properties | 4 ++--
 src/bin/nutch  | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index edcaeb569..c00d9776b 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -203,7 +203,7 @@
 
 
   http.agent.version
-  Nutch-1.20-SNAPSHOT
+  Nutch-1.21-SNAPSHOT
   A version string to advertise in the User-Agent
header.
 
diff --git a/default.properties b/default.properties
index 385e53e57..47041f465 100644
--- a/default.properties
+++ b/default.properties
@@ -14,9 +14,9 @@
 # limitations under the License.
 
 name=apache-nutch
-version=1.20-SNAPSHOT
+version=1.21-SNAPSHOT
 final.name=${name}-${version}
-year=2022
+year=2024
 
 basedir = ./
 src.dir = ./src/java
diff --git a/src/bin/nutch b/src/bin/nutch
index 561c79e77..b3e0a256b 100755
--- a/src/bin/nutch
+++ b/src/bin/nutch
@@ -61,7 +61,7 @@ done
 
 # if no args specified, show usage
 if [ $# = 0 ]; then
-  echo "nutch 1.20-SNAPSHOT"
+  echo "nutch 1.21-SNAPSHOT"
   echo "Usage: nutch COMMAND [-Dproperty=value]... [command-specific args]..."
   echo "where COMMAND is one of:"
   echo "  readdbread / dump crawl db"



(nutch) branch master updated: Add GitHub CI badge to README

2024-04-28 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new c0b94614c Add GitHub CI badge to README
c0b94614c is described below

commit c0b94614ccf88cf1c55980bebd93bec357a31cac
Author: Lewis John McGibbney 
AuthorDate: Sun Apr 28 10:23:32 2024 -0700

Add GitHub CI badge to README
---
 README.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/README.md b/README.md
index e05f56ccd..28acfe8c7 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,8 @@
 Apache Nutch README
 ===
 
+[![master pull request 
ci](https://github.com/apache/nutch/actions/workflows/master-build.yml/badge.svg)](https://github.com/apache/nutch/actions/workflows/master-build.yml)
+
 https://nutch.apache.org/assets/img/nutch_logo_tm.png; align="right" 
width="300" />
 
 For the latest information about Nutch, please visit our website at:



svn commit: r68753 - in /release/nutch: 1.19/ 1.20/apache-nutch-1.20-bin.tar.gz.sha512 1.20/apache-nutch-1.20-bin.zip.sha512 1.20/apache-nutch-1.20-src.tar.gz.sha512 1.20/apache-nutch-1.20-src.zip.sha

2024-04-24 Thread lewismc
Author: lewismc
Date: Thu Apr 25 04:27:39 2024
New Revision: 68753

Log:
Cleanup older Nutch release distributions and add sha512sums for 1.20 release.

Added:
release/nutch/1.20/apache-nutch-1.20-bin.tar.gz.sha512
release/nutch/1.20/apache-nutch-1.20-bin.zip.sha512
release/nutch/1.20/apache-nutch-1.20-src.tar.gz.sha512
release/nutch/1.20/apache-nutch-1.20-src.zip.sha512
Removed:
release/nutch/1.19/
release/nutch/2.4/

Added: release/nutch/1.20/apache-nutch-1.20-bin.tar.gz.sha512
==
--- release/nutch/1.20/apache-nutch-1.20-bin.tar.gz.sha512 (added)
+++ release/nutch/1.20/apache-nutch-1.20-bin.tar.gz.sha512 Thu Apr 25 04:27:39 
2024
@@ -0,0 +1 @@
+871dc0a8cbfc61daf84ea08ce6987ffa4cfcec4e24d388ffeffd49e983426ba8dd218bc2cb4eba45e65cfe0e43ae72fad99e70850b83154ca3e86803c6bd1c01
  apache-nutch-1.20-bin.tar.gz

Added: release/nutch/1.20/apache-nutch-1.20-bin.zip.sha512
==
--- release/nutch/1.20/apache-nutch-1.20-bin.zip.sha512 (added)
+++ release/nutch/1.20/apache-nutch-1.20-bin.zip.sha512 Thu Apr 25 04:27:39 2024
@@ -0,0 +1 @@
+b37761be4a5464d60ef97c2515944757a33e093d844415c6f0f1f2e0a81076e473cf58879f1e58d499c169b39d74f10a2936eb24d3250bc216ecf167bdaa4f8e
  apache-nutch-1.20-bin.zip

Added: release/nutch/1.20/apache-nutch-1.20-src.tar.gz.sha512
==
--- release/nutch/1.20/apache-nutch-1.20-src.tar.gz.sha512 (added)
+++ release/nutch/1.20/apache-nutch-1.20-src.tar.gz.sha512 Thu Apr 25 04:27:39 
2024
@@ -0,0 +1 @@
+dfd70c95f6eba5a9c843639433f77c0651e12d9075541330fa5d159b4698192a968d670ea14275a6560707ac22d79ab2bcbfe339ce7d6f51a2f52d90209e5de3
  apache-nutch-1.20-src.tar.gz

Added: release/nutch/1.20/apache-nutch-1.20-src.zip.sha512
==
--- release/nutch/1.20/apache-nutch-1.20-src.zip.sha512 (added)
+++ release/nutch/1.20/apache-nutch-1.20-src.zip.sha512 Thu Apr 25 04:27:39 2024
@@ -0,0 +1 @@
+c4407accbcfc1bf67ea0f7121d3d726988c31e2bb90631ec892caf98aeebc946a2c72b303d42fdef020206da4509437ea3dbb5761e46fe541b81f39d4923c5ed
  apache-nutch-1.20-src.zip




svn commit: r68752 - /dev/nutch/1.20/ /release/nutch/1.20/

2024-04-24 Thread lewismc
Author: lewismc
Date: Thu Apr 25 02:23:27 2024
New Revision: 68752

Log:
Release Apache Nutch 1.20

Added:
release/nutch/1.20/
  - copied from r68751, dev/nutch/1.20/
Removed:
dev/nutch/1.20/



svn commit: r68410 [1/3] - /dev/nutch/1.20/

2024-04-09 Thread lewismc
Author: lewismc
Date: Tue Apr  9 20:44:40 2024
New Revision: 68410

Log:
Stage Apache Nutch 1.20  RC#1

Added:
dev/nutch/1.20/
dev/nutch/1.20/CHANGES.md
dev/nutch/1.20/apache-nutch-1.20-bin.tar.gz   (with props)
dev/nutch/1.20/apache-nutch-1.20-bin.tar.gz.asc
dev/nutch/1.20/apache-nutch-1.20-bin.zip   (with props)
dev/nutch/1.20/apache-nutch-1.20-bin.zip.asc
dev/nutch/1.20/apache-nutch-1.20-src.tar.gz   (with props)
dev/nutch/1.20/apache-nutch-1.20-src.tar.gz.asc
dev/nutch/1.20/apache-nutch-1.20-src.zip   (with props)
dev/nutch/1.20/apache-nutch-1.20-src.zip.asc



svn commit: r68410 [2/3] - /dev/nutch/1.20/

2024-04-09 Thread lewismc
(snagel)
+
+* NUTCH-2177 Generator produces only one partition even in distributed mode 
(jnioche, snagel)
+
+* NUTCH-2158 Upgrade to Tika 1.11 (jnioche, snagel)
+
+* NUTCH-2175 Typos in property descriptions in nutch-default.xml (Roannel 
Fernández Hernández via snagel)
+
+* NUTCH-2069 Ignore external links based on domain (jnioche)
+
+* NUTCH-2173 String.join in FileDumper breaks the build (joyce)
+
+* NUTCH-2166 Add reverse URL format to dump tool (joyce)
+
+* NUTCH-2157 Addressing Miredot REST API Warnings (Sujen Shah)
+
+* NUTCH-2165 FileDumper Util hard codes part-# folder name (joyce)
+
+* NUTCH-2167 Backport TableUtil from 2.x for URL reversing (joyce)
+
+* NUTCH-2160 Upgrade Selenium Java to 2.48.2 (lewismc, kwhitehall)
+
+* NUTCH-2120 Remove MapWritable from trunk codebase (lewismc)
+
+* NUTCH-1911 Improve DomainStatistics tool command line parsing (joyce)
+
+* NUTCH-2064 URLNormalizer basic to encode reserved chars and decode 
non-reserved chars (markus, snagel)
+
+* NUTCH-2159 Ensure that all WebApp files are copied into generated artifacts 
for 1.X Webapp (lewismc)
+
+* NUTCH-2154 Nutch REST API (DB) suffering NullPointerException (Aron Ahmadia, 
Sujen Shah via mattmann)
+

svn commit: r68410 [3/3] - /dev/nutch/1.20/

2024-04-09 Thread lewismc
Added: dev/nutch/1.20/apache-nutch-1.20-bin.tar.gz
==
Binary file - no diff available.

Propchange: dev/nutch/1.20/apache-nutch-1.20-bin.tar.gz
--
svn:mime-type = application/octet-stream

Added: dev/nutch/1.20/apache-nutch-1.20-bin.tar.gz.asc
==
--- dev/nutch/1.20/apache-nutch-1.20-bin.tar.gz.asc (added)
+++ dev/nutch/1.20/apache-nutch-1.20-bin.tar.gz.asc Tue Apr  9 20:44:40 2024
@@ -0,0 +1,16 @@
+-BEGIN PGP SIGNATURE-
+
+iQIzBAABCAAdFiEE23tRmRIcCKXI9AUrOkcX8Ei66/YFAmYVpOQACgkQOkcX8Ei6
+6/buExAAwPh4uHBMGPvVUBLztSm5Ze+ZeRjHsxARVmiglyFUCKo9n1ZySHTaoqlW
+3f1I7c79dqrVZyqKMY9O5BjdA5K0w7scz3klHNOdrUc5Zal8GSY52sbOXq+CLka0
+fEYz3H3BMfB1eDn8F+dtFcYgfKqatVf+sFbvLdzfeorLzURZha/07WsGiXAtc629
+dOuNb9mweE5+BlEaeIm3ypYww294KZEvtQstouuvdal86Gm94KCenVb989CofQLb
+RHamuxjmVDOtb22G+PqCEFfPWZ3HSz9eOqzqn133glR88soWwG468MxzLAJZXpDU
+uB05ENvozkcIngj/emSZFy7Y1sY81VH0ErLxbxZDCIssxpVnOwI6N+5Un00T/nMz
+VbUeXv1Zq9XY2SHDZr9AP8wiWre4ae5wp2NAMVD2zlcTVo66jbDEiNSCzKmK/pPe
+gdexcS47lXQjCCYYe6rnUO8T5wEAeVn2Ctp+1mdjfDamN7liNExzvPtoUg07uDyx
+TM48F+5Es1c9wYC3nVyUvqadfKWFnCqfPIPogEeNTH5mwWTAtaXCcPcib+GxoCd+
+k5x5BEmB6wyQbmTKLjSVdDI6DL+suO4MtlIw1/2yHnj4uMPnAvABnG8uBKp2sCMc
+3GlQWJ5FiadkXASf6bbCv5+2iQof1BhRGJAu5PvYjRGEASG3IhM=
+=dpeR
+-END PGP SIGNATURE-

Added: dev/nutch/1.20/apache-nutch-1.20-bin.zip
==
Binary file - no diff available.

Propchange: dev/nutch/1.20/apache-nutch-1.20-bin.zip
--
svn:mime-type = application/octet-stream

Added: dev/nutch/1.20/apache-nutch-1.20-bin.zip.asc
==
--- dev/nutch/1.20/apache-nutch-1.20-bin.zip.asc (added)
+++ dev/nutch/1.20/apache-nutch-1.20-bin.zip.asc Tue Apr  9 20:44:40 2024
@@ -0,0 +1,16 @@
+-BEGIN PGP SIGNATURE-
+
+iQIzBAABCAAdFiEE23tRmRIcCKXI9AUrOkcX8Ei66/YFAmYVpUwACgkQOkcX8Ei6
+6/aj4RAAqeXW9QsddsFuxVu2el37aZhV4HOsGsCX66G/wxz5nj5s34O41IKxTPrv
+SJ0XRoekQ304uGYziAzDtDQUyXfAFo7gpF3w5TgK+5f8Mz8piPiW80uIMZYaUgXV
+kAr6dYlbLPtcbyzspxCBHFZlHPf0MC6YtnaHPFq5B9LBjLl3nE+u1HkCUlHjWm84
+dQqijPyaiFyYGhsuU4/xaAJcgluUNcQlmAcY6125vOtMGKJqHdTVU/rZvJ30Ym0V
+/k92t6+CgU4y8a/JyOToNFRD0f+3aGGNQUXKZIvAenzNIugv5wlubxF/CRht+J5L
+0bU48GcZjboNknKBc8tMewBwhHpAGAL5O5AS92j8naWUrZ1Wkur1y3EL7wiS39xJ
+fI0BRrTNcVapOoUnoQuXtxpoqRjiBmC2sEP9nH9T5dHNZaDljOielB4gi+1SGYYR
+DXiIpe6i/bMjMEO14At3ACwIoXknLo/gPQKUaIGQUTb+rlrFbZWVByZvcO826Az6
+0eEllycEzdvLpn0wv03zJhz9KwzJJCFJ4jgip/LIN5UXFHhUjzWykdJ2HUxHXq3v
+1zjee9o3/K0UqUn07d/rIG3pNdteja4PDo0AmLt2l/B8Pfi0pnZj9LjbL5DIWcNp
+oe41Ew6RFL7hjRZV2HwwBSmYCHNUSoL5HCR9dk10PcQFrH6phW0=
+=nfnj
+-END PGP SIGNATURE-

Added: dev/nutch/1.20/apache-nutch-1.20-src.tar.gz
==
Binary file - no diff available.

Propchange: dev/nutch/1.20/apache-nutch-1.20-src.tar.gz
--
svn:mime-type = application/octet-stream

Added: dev/nutch/1.20/apache-nutch-1.20-src.tar.gz.asc
==
--- dev/nutch/1.20/apache-nutch-1.20-src.tar.gz.asc (added)
+++ dev/nutch/1.20/apache-nutch-1.20-src.tar.gz.asc Tue Apr  9 20:44:40 2024
@@ -0,0 +1,16 @@
+-BEGIN PGP SIGNATURE-
+
+iQIzBAABCAAdFiEE23tRmRIcCKXI9AUrOkcX8Ei66/YFAmYVpPwACgkQOkcX8Ei6
+6/ZUGhAAjocHBJYQynpMuU+Geai8TC2sVBGUt33VuDPG5fHVnq5Y/QiwK3B/AL0u
+DtQdcajwnym3QMYBq0ZzzjOqXtE0B0Awwsz14KQYt+43AMpakLsVXBysZDXOTTcm
+yrSc3IJEYvxlDQg0DA9uU4qpw5AHcEP3gzQ5tqA8X9V0EWejf82+KRjpJmKwJi1j
+hS1rIdY0cCd15Ibo+jCf7PMSWZqYcEUdivy9+h1Zm+hV5mv49TMm4Js+fsNQrFyh
+2dS5EZSvommodgP4hjKCpW7EkNRcl20ZmlVntLNhULTEXDd8CCpweg/7iSNo0hD/
+MWS2YMtY2zf2lnid217YNhSG1a2LprZ3sqmMtEcM0/F8PsOrA1p1klsuTz6+S2FO
+ei89JdVQvOJbh6PdeaNkQqBTnc06seNQLTF+6iLtCPVQ3mojFJhqgnaMWP3W20A+
+ZElNLRe0Jw//5aX19YZilRoxAwA3aAxXSXIeNk9TukiRPOqvevxORDoXy3INosYj
+/8HrSESOXsZyCIyOQzHExYNDQA/SkH8BisxY9aVDDmJyaKTXgWAaraLVn1+/6thX
+zGhT3M349+bSrfR4BiMO7Cg3r0VcMgUkcfIUPfZtpLtOIV9bs+rGrxWlujor1vC6
+eS3hfSjMbQHLR3UuLMFRhWIAiunXAMHqnrRwWK20vOy5LiJo70I=
+=lrhO
+-END PGP SIGNATURE-

Added: dev/nutch/1.20/apache-nutch-1.20-src.zip
==
Binary file - no diff available.

Propchange: dev/nutch/1.20/apache-nutch-1.20-src.zip
--
svn:mime-type = application/octet-stream

Added: dev/nutch/1.20/apache-nutch-1.20-src.zip.asc
==
--- dev/nutch/1.20/apache-nutch-1.20-src.zip.asc (added)
+++ 

(nutch) annotated tag release-1.20 updated (a2cb6aa5d -> 6510cb241)

2024-04-09 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a change to annotated tag release-1.20
in repository https://gitbox.apache.org/repos/asf/nutch.git


*** WARNING: tag release-1.20 was modified! ***

from a2cb6aa5d (commit)
  to 6510cb241 (tag)
 tagging a2cb6aa5d3e90b7249e47323f2fa4cbf2aa9fa27 (commit)
 replaces release-1.13
  by Lewis John McGibbney
  on Tue Apr 9 09:44:29 2024 -0700

- Log -
Apache Nutch 1.20 RC#1 Tag
---


No new revisions were added by this update.

Summary of changes:



(nutch) branch branch-1.20 updated: Prepare Nutch 1.20 release candidate

2024-04-09 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch branch-1.20
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/branch-1.20 by this push:
 new a2cb6aa5d Prepare Nutch 1.20 release candidate
a2cb6aa5d is described below

commit a2cb6aa5d3e90b7249e47323f2fa4cbf2aa9fa27
Author: Lewis John McGibbney 
AuthorDate: Tue Apr 9 09:23:24 2024 -0700

Prepare Nutch 1.20 release candidate
---
 ivy/mvn.template | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ivy/mvn.template b/ivy/mvn.template
index fafc79f83..43ecfbd6a 100644
--- a/ivy/mvn.template
+++ b/ivy/mvn.template
@@ -45,7 +45,7 @@
 https://github.com/apache/nutch.git
   
 
-  2
+  
  
   maven2 
   https://repo.maven.apache.org/maven2/ 



(nutch) branch branch-1.20 created (now f141a398c)

2024-04-09 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a change to branch branch-1.20
in repository https://gitbox.apache.org/repos/asf/nutch.git


  at f141a398c Prepare Nutch 1.20 release candidate

This branch includes the following new commits:

 new f141a398c Prepare Nutch 1.20 release candidate

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.




(nutch) 01/01: Prepare Nutch 1.20 release candidate

2024-04-09 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch branch-1.20
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit f141a398c1c0c4e2a1861cd2928fff6a58f53b1f
Author: Lewis John McGibbney 
AuthorDate: Tue Apr 9 09:16:40 2024 -0700

Prepare Nutch 1.20 release candidate
---
 .gitignore |   2 +
 CHANGES.md | 157 +
 conf/nutch-default.xml |   2 +-
 default.properties |   4 +-
 src/bin/nutch  |   2 +-
 5 files changed, 163 insertions(+), 4 deletions(-)

diff --git a/.gitignore b/.gitignore
index 8c521aa68..972a7cfcb 100644
--- a/.gitignore
+++ b/.gitignore
@@ -26,3 +26,5 @@ lib/spotbugs-*
 ivy/dependency-check-ant/*
 .gradle*
 ivy/apache-rat-*
+ivy/maven-ant-tasks-*
+pom.xml
diff --git a/CHANGES.md b/CHANGES.md
index adea4478f..0e9a0cf45 100644
--- a/CHANGES.md
+++ b/CHANGES.md
@@ -1,5 +1,162 @@
 # Nutch Change Log
 
+
+Nutch 1.20 Release 09/04/2024 (dd/mm/)
+Release Report: https://s.apache.org/ovjf3
+
+Sub-task
+
+
+[NUTCH-2596] -   
  Upgrade from org.mortbay.jetty to org.eclipse.jetty
+
+[NUTCH-2852] -   
  Method invokes System.exit(...) 9 bugs
+
+[NUTCH-2972] -   
  Javadoc build fails using JDK 17
+
+[NUTCH-3007] -   
  Fix impossible casts
+
+
+
+Bug
+
+
+[NUTCH-2634] -   
  Some links marked as nofollow are followed anyway.
+
+[NUTCH-2820] -   
  Review sample files used in any23 unit tests
+
+[NUTCH-2924] -   
  Generate maxCount expr evaluated only once
+
+[NUTCH-2937] -   
  parse-tika: review dependency exclusions and avoid dependency conflicts in 
distributed mode
+
+[NUTCH-2973] -   
  Single domain names (eg https://localnet) cant be crawled - filtering 
fails
+
+[NUTCH-2974] -   
  Ant build fails with Unparseable date on certain platforms
+
+[NUTCH-2979] -   
  Upgrade Commons Text to 1.10.0
+
+[NUTCH-2982] -   
  Generator: parameter for URL normalization not passed forward
+
+[NUTCH-2985] -   
  Disable plugin urlfilter-validator by default
+
+[NUTCH-2992] -   
  Fetcher: always block fetch queues when exceptions threshold is reached
+
+[NUTCH-3000] -   
  protocol-selenium returns only the body,strips off the head/ element
+
+[NUTCH-3001] -   
  protocol-selenium requires Content-Type header 
+
+[NUTCH-3002] -   
  Protocol-okhttp HttpResponse: HTTP header metadata lookup should be 
case-insensitive
+
+[NUTCH-3008] -   
  indexer-elastic: downgrade to ES 7.10.2 to address licensing issues
+
+[NUTCH-3012] -   
  SegmentReader when dumping with option -recode: NPE on unparsed documents
+
+[NUTCH-3027] -   
  Trivial resource leak patch in DomainSuffixes.java
+
+[NUTCH-3035] -   
  Update license and notice file for release of 1.20 
+
+
+
+New Feature
+
+
+[NUTCH-2832] -   
  Create tutorial on sending Nutch logs to Elasticsearch
+
+[NUTCH-2888] -   
  Selenium Protocol: Support for Selenium 4
+
+[NUTCH-2920] -   
  Implement a indexer-opensearch plugin
+
+[NUTCH-2991] -   
  Support HTTP/S Header Authorization for Solr connections
+
+[NUTCH-3029] -   
  Host specific max. and min. intervals in adaptive scheduler
+
+
+
+Improvement
+
+
+[NUTCH-2853] -   
  bin/nutch: remove deprecated commands solrindex, solrdedup, solrclean
+
+[NUTCH-2883] -   
  Provide means to run server as a persistent service in Docker container
+
+[NUTCH-2897] -   
  Do not supress deprecated API warnings
+
+[NUTCH-2961] -   
  Upgrade dependencies of parsefilter-naivebayes
+
+[NUTCH-2980] -   
  Upgrade Selenium Java to 4.7.2
+
+[NUTCH-2983] -   
  nutch-default.xml improvements
+
+[NUTCH-2990] -   
  HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309
+
+[NUTCH-2993] -   
  ScoringDepth plugin to skip depth check based on URL Pattern
+
+[NUTCH-2995] -   
  Upgrade to crawler-commons 1.4
+
+[NUTCH-2996] -   
  Use new SimpleRobotRulesParser API entry point (crawler-commons 1.4)
+
+[NUTCH-2997] -   
  Add Override annotations where applicable
+
+[NUTCH-3004] -   
  Avoid NPE in HttpResponse
+
+[NUTCH-3005] -   
  Upgrade selenium as needed
+
+[NUTCH-3009] -   
  Upgrade to Hadoop 3.3.6
+
+[NUTCH-3010] -   
  Injector: count unique number of injected URLs
+
+[NUTCH-3011] -   
  HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors 
(HTTP 5xx)
+
+[NUTCH-3013] -   
  Employ commons-lang3s StopWatch to simplify timing logic
+
+[NUTCH-3014] -   
  Standardize Job names
+
+[NUTCH-3015] -   
  Add more CI steps to GitHub master-build.yml
+
+[NUTCH-3017] -   
  Allow fast-urlfilter to load from HDFS/S3 and support gzipped input
+
+[NUTCH-3025] -   
  urlfilter-fast to filter based on the length of the URL
+
+[NUTCH-3031] -   
  ProtocolFactory host mapper to support domains

(nutch) branch master updated: NUTCH-3038 Address issues discovered during 1.20 release management dryrun (#811)

2024-04-08 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 271f92e11 NUTCH-3038 Address issues discovered during 1.20 release 
management dryrun (#811)
271f92e11 is described below

commit 271f92e11c39b7a3583cfcd8d664262cfac59674
Author: Lewis John McGibbney 
AuthorDate: Mon Apr 8 16:21:13 2024 -0700

NUTCH-3038 Address issues discovered during 1.20 release management dryrun 
(#811)
---
 CHANGES.txt => CHANGES.md |  0
 build.xml |  6 +++---
 docker/Dockerfile |  2 +-
 docker/README.md  |  3 +--
 ivy/mvn.template  | 37 +++--
 5 files changed, 8 insertions(+), 40 deletions(-)

diff --git a/CHANGES.txt b/CHANGES.md
similarity index 100%
rename from CHANGES.txt
rename to CHANGES.md
diff --git a/build.xml b/build.xml
index 49187d3ba..845bdfce8 100644
--- a/build.xml
+++ b/build.xml
@@ -329,7 +329,7 @@
 
 
   
-  
+  
   
   
   
@@ -340,7 +340,7 @@
 
 
   
-  
+  
   
   
   
@@ -352,7 +352,7 @@
 
 
   
-  
+  
   
   
   
diff --git a/docker/Dockerfile b/docker/Dockerfile
index fb93fe98a..2eb218bad 100644
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@@ -22,7 +22,7 @@
 #  2 == Same as mode 1 with addition of Nutch WebApp
 ARG BUILD_MODE=0
 
-FROM alpine:3.13 AS base
+FROM alpine:3.19 AS base
 
 ARG SERVER_PORT=8081
 ARG SERVER_HOST=0.0.0.0
diff --git a/docker/README.md b/docker/README.md
index c8330bf9b..80e1a1d6d 100644
--- a/docker/README.md
+++ b/docker/README.md
@@ -3,7 +3,6 @@
 ![Docker 
Pulls](https://img.shields.io/docker/pulls/apache/nutch?style=for-the-badge)
 ![Docker Image Size (latest by 
date)](https://img.shields.io/docker/image-size/apache/nutch?style=for-the-badge)
 ![Docker Image Version (latest 
semver)](https://img.shields.io/docker/v/apache/nutch?style=for-the-badge)
-![MicroBadger 
Layers](https://img.shields.io/microbadger/layers/apache/nutch?style=for-the-badge)
 ![Docker 
Stars](https://img.shields.io/docker/stars/apache/nutch?style=for-the-badge)
 ![Docker Automated 
build](https://img.shields.io/docker/automated/apache/nutch?style=for-the-badge)
 
@@ -25,7 +24,7 @@ Current configuration of this image consists of components:
 
 ##  Base Image
 
-* [alpine:3.13](https://hub.docker.com/_/alpine/)
+* [alpine:3.19](https://hub.docker.com/_/alpine/tags)
 
 ## Tips
 
diff --git a/ivy/mvn.template b/ivy/mvn.template
index b38b37f6d..fafc79f83 100644
--- a/ivy/mvn.template
+++ b/ivy/mvn.template
@@ -22,7 +22,7 @@
   
 org.apache
 apache
-23
+31
   
   ${ivy.pom.groupId}
   ${ivy.pom.artifactId}
@@ -45,12 +45,7 @@
 https://github.com/apache/nutch.git
   
 
-  
-
-  miredot
-  MireDot Releases
-  http://nexus.qmino.com/content/repositories/miredot
-
+  2
  
   maven2 
   https://repo.maven.apache.org/maven2/ 
@@ -128,7 +123,7 @@
 
   org.apache.maven.plugins
   maven-compiler-plugin
-  3.8.1
+  3.13.0
   
 11
 11
@@ -136,31 +131,5 @@
 
   
 
-
-  
-com.qmino
-miredot-plugin
-2.4.0
-
-  
-
-  restdoc
-
-  
-
-
-  
cHJvamVjdHxvcmcuYXBhY2hlLm51dGNoLm51dGNofDIwMTktMTAtMzB8dHJ1ZXwtMSNNQ3dDRkJMb0FjM283ME1YRERRMkFJemY1QmxZUjAwK0FoUkJVMlJrVi81RlBQc25zMUZ2S2g0Q29weGFxZz09
-  
-
-  jax-rs
-
-  
-  
-
-  
-  
-
-  
-
   
 



(nutch) branch branch-1.20 deleted (was 9cfe3d7f9)

2024-04-05 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a change to branch branch-1.20
in repository https://gitbox.apache.org/repos/asf/nutch.git


 was 9cfe3d7f9 Prepare for Nutch 1.20 release

This change permanently discards the following revisions:

 discard 9cfe3d7f9 Prepare for Nutch 1.20 release



(nutch) 01/01: Prepare for Nutch 1.20 release

2024-04-05 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch branch-1.20
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 9cfe3d7f9bf46a71f5473d7afb1dfc71f7ff2c1b
Author: Lewis John McGibbney 
AuthorDate: Fri Apr 5 19:33:51 2024 -0700

Prepare for Nutch 1.20 release
---
 CHANGES.txt| 150 +
 conf/nutch-default.xml |   2 +-
 default.properties |   4 +-
 src/bin/nutch  |   2 +-
 4 files changed, 154 insertions(+), 4 deletions(-)

diff --git a/CHANGES.txt b/CHANGES.txt
index adea4478f..6b032d798 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -1,5 +1,155 @@
 # Nutch Change Log
 
+Nutch 1.20 Release 05/04/2024 (dd/mm/)
+Release Report: https://s.apache.org/arvtl
+
+Release Notes - Nutch - Version 1.20
+
+Sub-task
+
+
+[NUTCH-2596] -   
  Upgrade from org.mortbay.jetty to org.eclipse.jetty
+
+[NUTCH-2852] -   
  Method invokes System.exit(...) 9 bugs
+
+[NUTCH-2972] -   
  Javadoc build fails using JDK 17
+
+[NUTCH-3007] -   
  Fix impossible casts
+
+
+
+Bug
+
+
+[NUTCH-2634] -   
  Some links marked as nofollow are followed anyway.
+
+[NUTCH-2820] -   
  Review sample files used in any23 unit tests
+
+[NUTCH-2924] -   
  Generate maxCount expr evaluated only once
+
+[NUTCH-2973] -   
  Single domain names (eg https://localnet) cant be crawled - filtering 
fails
+
+[NUTCH-2974] -   
  Ant build fails with Unparseable date on certain platforms
+
+[NUTCH-2979] -   
  Upgrade Commons Text to 1.10.0
+
+[NUTCH-2982] -   
  Generator: parameter for URL normalization not passed forward
+
+[NUTCH-2985] -   
  Disable plugin urlfilter-validator by default
+
+[NUTCH-2992] -   
  Fetcher: always block fetch queues when exceptions threshold is reached
+
+[NUTCH-3000] -   
  protocol-selenium returns only the body,strips off the head/ element
+
+[NUTCH-3001] -   
  protocol-selenium requires Content-Type header 
+
+[NUTCH-3002] -   
  Protocol-okhttp HttpResponse: HTTP header metadata lookup should be 
case-insensitive
+
+[NUTCH-3008] -   
  indexer-elastic: downgrade to ES 7.10.2 to address licensing issues
+
+[NUTCH-3012] -   
  SegmentReader when dumping with option -recode: NPE on unparsed documents
+
+[NUTCH-3027] -   
  Trivial resource leak patch in DomainSuffixes.java
+
+[NUTCH-3035] -   
  Update license and notice file for release of 1.20 
+
+
+
+New Feature
+
+
+[NUTCH-2832] -   
  Create tutorial on sending Nutch logs to Elasticsearch
+
+[NUTCH-2888] -   
  Selenium Protocol: Support for Selenium 4
+
+[NUTCH-2920] -   
  Implement a indexer-opensearch plugin
+
+[NUTCH-2991] -   
  Support HTTP/S Header Authorization for Solr connections
+
+[NUTCH-3029] -   
  Host specific max. and min. intervals in adaptive scheduler
+
+
+
+Improvement
+
+
+[NUTCH-2853] -   
  bin/nutch: remove deprecated commands solrindex, solrdedup, solrclean
+
+[NUTCH-2883] -   
  Provide means to run server as a persistent service in Docker container
+
+[NUTCH-2897] -   
  Do not supress deprecated API warnings
+
+[NUTCH-2961] -   
  Upgrade dependencies of parsefilter-naivebayes
+
+[NUTCH-2980] -   
  Upgrade Selenium Java to 4.7.2
+
+[NUTCH-2983] -   
  nutch-default.xml improvements
+
+[NUTCH-2990] -   
  HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309
+
+[NUTCH-2993] -   
  ScoringDepth plugin to skip depth check based on URL Pattern
+
+[NUTCH-2995] -   
  Upgrade to crawler-commons 1.4
+
+[NUTCH-2996] -   
  Use new SimpleRobotRulesParser API entry point (crawler-commons 1.4)
+
+[NUTCH-2997] -   
  Add Override annotations where applicable
+
+[NUTCH-3004] -   
  Avoid NPE in HttpResponse
+
+[NUTCH-3009] -   
  Upgrade to Hadoop 3.3.6
+
+[NUTCH-3010] -   
  Injector: count unique number of injected URLs
+
+[NUTCH-3011] -   
  HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors 
(HTTP 5xx)
+
+[NUTCH-3013] -   
  Employ commons-lang3s StopWatch to simplify timing logic
+
+[NUTCH-3014] -   
  Standardize Job names
+
+[NUTCH-3015] -   
  Add more CI steps to GitHub master-build.yml
+
+[NUTCH-3017] -   
  Allow fast-urlfilter to load from HDFS/S3 and support gzipped input
+
+[NUTCH-3025] -   
  urlfilter-fast to filter based on the length of the URL
+
+[NUTCH-3031] -   
  ProtocolFactory host mapper to support domains
+
+[NUTCH-3032] -   
  Indexing plugin as an adapter for end users own POJO instances
+
+[NUTCH-3036] -   
  Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium
+
+
+
+Task
+
+
+[NUTCH-2959] -   
  Upgrade to Apache Tika 2.9.0
+
+[NUTCH-2977] -   
  Support for showing dependency tree
+
+[NUTCH-2978] -   
  Move to slf4j2 and remove

(nutch) branch branch-1.20 created (now 9cfe3d7f9)

2024-04-05 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a change to branch branch-1.20
in repository https://gitbox.apache.org/repos/asf/nutch.git


  at 9cfe3d7f9 Prepare for Nutch 1.20 release

This branch includes the following new commits:

 new 9cfe3d7f9 Prepare for Nutch 1.20 release

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.




(nutch) branch master updated: NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time (#810)

2024-04-04 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new c9e2f4ed6 NUTCH-3032 Code for an ArbitraryIndexingFilter to index 
values resolved by user POJO code at index time (#810)
c9e2f4ed6 is described below

commit c9e2f4ed693014e9dcb9d6f68ae918e0c0eedd26
Author: Joe Gilvary 
AuthorDate: Thu Apr 4 12:06:19 2024 -0400

NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by 
user POJO code at index time (#810)
---
 build.xml  |   4 +
 conf/nutch-default.xml |  66 +
 src/plugin/build.xml   |   3 +
 src/plugin/index-arbitrary/build.xml   |  22 ++
 src/plugin/index-arbitrary/ivy.xml |  39 +++
 src/plugin/index-arbitrary/plugin.xml  |  42 +++
 .../indexer/arbitrary/ArbitraryIndexingFilter.java | 286 +
 .../nutch/indexer/arbitrary/package-info.java  |  23 ++
 .../org/apache/nutch/indexer/arbitrary/Echo.java   |  40 +++
 .../apache/nutch/indexer/arbitrary/Multiplier.java |  47 
 .../arbitrary/TestArbitraryIndexingFilter.java | 222 
 11 files changed, 794 insertions(+)

diff --git a/build.xml b/build.xml
index 0a18682f8..49187d3ba 100644
--- a/build.xml
+++ b/build.xml
@@ -203,6 +203,7 @@
   
   
   
+  
   
   
   
@@ -646,6 +647,7 @@
   
   
   
+  
   
   
   
@@ -1173,6 +1175,8 @@
 
 
 
+
+
 
 
 
diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index 8b24f092a..edcaeb569 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -2252,6 +2252,72 @@ CAUTION: Set the parser.timeout to -1 or a bigger value 
than 30, when using this
   
 
 
+
+
+  index.arbitrary.function.count
+  
+  The count of arbitrary additions/edits to the document.
+Specify the remaining properties (fieldName, className, constructorArgs,
+methodName, and methodArgs) independently in this file by appending a
+dot (.) followed by integer numerals (beginning with '0') to the property
+names, e.g.:
+
+index.arbitrary.fieldName.0
+for the field to add/set with the first arbitrary addition or:
+
+index.arbitrary.className.3
+for the POJO class name to use in setting the fourth arbitrary addition.
+  
+
+
+
+  index.arbitrary.fieldName.0
+  
+  The name of the field to add to the document with the value
+returned from the custom POJO.
+
+
+
+  index.arbitrary.className.0
+  
+  The fully qualified name of the POJO class that will supply
+values for the new field.
+
+
+
+  index.arbitrary.constructorArgs.0
+  
+  The values (as strings) to pass into the POJO constructor.
+The POJO must accept a String representation of the NutchDocument's URL
+as the first parameter in the constructor. The values you specify here 
+will populate the constructor arguments 1,..,n-1 where n=the count of
+arguments to the constructor. Argument #0 will be the NutchDocument's URL.
+  
+
+
+
+  index.arbitrary.methodName.0
+  
+  The name of the method to invoke on the instance of your custom
+class in order to determine the value to add to the document.
+  
+
+
+  index.arbitrary.methodArgs.0
+  
+  The values (as strings) to pass into the named method on the 
POJO
+instance. Unlike the constructor args, there is no required argument that 
this
+method in the POJO must accept, i.e., the Arbitrary Indexer doesn't supply 
any
+arguments taken from the NutchDocument values by default.
+
+
+
+  index.arbitrary.overwrite.0
+  Whether to overwrite any existing value in the doc for
+for fieldName. Default is false if not specified in config
+  
+
+
 
 
   metatags.names
diff --git a/src/plugin/build.xml b/src/plugin/build.xml
index 34688ed56..498259a95 100755
--- a/src/plugin/build.xml
+++ b/src/plugin/build.xml
@@ -40,6 +40,7 @@
 
 
 
+
 
 
 
@@ -117,6 +118,7 @@
  
  
  
+ 
  
  
  
@@ -179,6 +181,7 @@
 
 
 
+
 
 
 
diff --git a/src/plugin/index-arbitrary/build.xml 
b/src/plugin/index-arbitrary/build.xml
new file mode 100644
index 0..818020c84
--- /dev/null
+++ b/src/plugin/index-arbitrary/build.xml
@@ -0,0 +1,22 @@
+
+
+
+
+  
+
+
diff --git a/src/plugin/index-arbitrary/ivy.xml 
b/src/plugin/index-arbitrary/ivy.xml
new file mode 100644
index 0..9feb1e1b4
--- /dev/null
+++ b/src/plugin/index-arbitrary/ivy.xml
@@ -0,0 +1,39 @@
+
+
+
+  
+
+https://nutch.apache.org/"/>
+
+Apache Nutch
+
+  
+
+  
+
+  
+
+  
+
+
+  
+
+  
+  
+  
+
diff --git a/src/plugin/index-arbitrary/plugin.xml 
b/src/plugin/index-arbitrary/plugin.xml

(nutch) branch master updated (5a95bc653 -> 1563396d9)

2024-03-30 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


from 5a95bc653 NUTCH-3035 Update license and notice file for release of 
1.20 (#808)
 add 1563396d9 NUTCH-3036 Upgrade org.seleniumhq.selenium:selenium-java 
dependency i… (#807)

No new revisions were added by this update.

Summary of changes:
 README.md  |   1 +
 .../nutch/protocol/htmlunit/HtmlUnitWebDriver.java |  27 ++--
 .../apache/nutch/protocol/http/api/HttpBase.java   |  63 -
 src/plugin/lib-selenium/README.md  |   2 +-
 src/plugin/lib-selenium/howto_upgrade_selenium.md  |  34 +++--
 src/plugin/lib-selenium/ivy.xml|   2 +-
 src/plugin/lib-selenium/plugin.xml | 147 ++---
 .../nutch/protocol/selenium/HttpWebClient.java |  82 +---
 src/plugin/protocol-interactiveselenium/README.md  |   4 +-
 src/plugin/protocol-selenium/README.md |   2 +-
 .../org/apache/nutch/protocol/selenium/Http.java   |  16 +--
 11 files changed, 144 insertions(+), 236 deletions(-)



(nutch) branch master updated (3905a8df7 -> 5a95bc653)

2024-03-30 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


from 3905a8df7 NUTCH-3037 Upgrade org.apache.kafka:kafka_2.12: to v3.7.0 
(#809)
 add 5a95bc653 NUTCH-3035 Update license and notice file for release of 
1.20 (#808)

No new revisions were added by this update.

Summary of changes:
 LICENSE-binary | 193 +++--
 NOTICE-binary  | 667 +
 2 files changed, 416 insertions(+), 444 deletions(-)



(nutch) branch master updated (367988dfd -> 3905a8df7)

2024-03-30 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


from 367988dfd NUTCH-3008 indexer-elastic: downgrade to ES 7.10.2 to 
address licensing issues
 add 3905a8df7 NUTCH-3037 Upgrade org.apache.kafka:kafka_2.12: to v3.7.0 
(#809)

No new revisions were added by this update.

Summary of changes:
 src/plugin/indexer-kafka/ivy.xml|  4 +-
 src/plugin/indexer-kafka/plugin.xml | 73 +
 2 files changed, 59 insertions(+), 18 deletions(-)



(nutch) branch master updated: NUTCH-3033 Upgrade Ivy to v2.5.2 (#803)

2024-03-13 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 4f62dec0f NUTCH-3033 Upgrade Ivy to v2.5.2 (#803)
4f62dec0f is described below

commit 4f62dec0f3001a8d41b236913346669ac7968133
Author: Lewis John McGibbney 
AuthorDate: Wed Mar 13 07:42:58 2024 -0700

NUTCH-3033 Upgrade Ivy to v2.5.2 (#803)
---
 .gitignore  | 5 +
 build.xml   | 2 +-
 default.properties  | 2 +-
 ivy/ivy.xml | 4 +++-
 ivy/ivysettings.xml | 4 ++--
 src/plugin/build-plugin.xml | 4 ++--
 src/plugin/creativecommons/ivy.xml  | 4 +++-
 src/plugin/exchange-jexl/ivy.xml| 4 +++-
 src/plugin/feed/ivy.xml | 4 +++-
 src/plugin/headings/ivy.xml | 4 +++-
 src/plugin/index-anchor/ivy.xml | 4 +++-
 src/plugin/index-basic/ivy.xml  | 4 +++-
 src/plugin/index-geoip/ivy.xml  | 4 +++-
 src/plugin/index-jexl-filter/ivy.xml| 4 +++-
 src/plugin/index-links/ivy.xml  | 4 +++-
 src/plugin/index-metadata/ivy.xml   | 4 +++-
 src/plugin/index-more/ivy.xml   | 4 +++-
 src/plugin/index-replace/ivy.xml| 4 +++-
 src/plugin/index-static/ivy.xml | 4 +++-
 src/plugin/indexer-cloudsearch/ivy.xml  | 4 +++-
 src/plugin/indexer-csv/ivy.xml  | 4 +++-
 src/plugin/indexer-dummy/ivy.xml| 4 +++-
 src/plugin/indexer-elastic/ivy.xml  | 4 +++-
 src/plugin/indexer-kafka/ivy.xml| 4 +++-
 src/plugin/indexer-opensearch-1x/ivy.xml| 4 +++-
 src/plugin/indexer-rabbit/ivy.xml   | 4 +++-
 src/plugin/indexer-solr/ivy.xml | 4 +++-
 src/plugin/language-identifier/ivy.xml  | 4 +++-
 src/plugin/lib-htmlunit/build-ivy.xml   | 2 +-
 src/plugin/lib-htmlunit/ivy.xml | 4 +++-
 src/plugin/lib-http/ivy.xml | 4 +++-
 src/plugin/lib-nekohtml/ivy.xml | 4 +++-
 src/plugin/lib-rabbitmq/ivy.xml | 4 +++-
 src/plugin/lib-regex-filter/ivy.xml | 4 +++-
 src/plugin/lib-selenium/ivy.xml | 4 +++-
 src/plugin/lib-xml/ivy.xml  | 4 +++-
 src/plugin/microformats-reltag/ivy.xml  | 4 +++-
 src/plugin/mimetype-filter/ivy.xml  | 4 +++-
 src/plugin/nutch-extensionpoints/ivy.xml| 4 +++-
 src/plugin/parse-ext/ivy.xml| 4 +++-
 src/plugin/parse-html/ivy.xml   | 4 +++-
 src/plugin/parse-js/ivy.xml | 4 +++-
 src/plugin/parse-metatags/ivy.xml   | 4 +++-
 src/plugin/parse-tika/ivy.xml   | 4 +++-
 src/plugin/parse-zip/ivy.xml| 4 +++-
 src/plugin/parsefilter-debug/ivy.xml| 4 +++-
 src/plugin/parsefilter-naivebayes/ivy.xml   | 4 +++-
 src/plugin/parsefilter-regex/ivy.xml| 4 +++-
 src/plugin/protocol-file/ivy.xml| 4 +++-
 src/plugin/protocol-foo/ivy.xml | 4 +++-
 src/plugin/protocol-ftp/ivy.xml | 4 +++-
 src/plugin/protocol-htmlunit/ivy.xml| 4 +++-
 src/plugin/protocol-http/ivy.xml| 4 +++-
 src/plugin/protocol-httpclient/ivy.xml  | 4 +++-
 src/plugin/protocol-interactiveselenium/ivy.xml | 4 +++-
 src/plugin/protocol-okhttp/ivy.xml  | 4 +++-
 src/plugin/protocol-selenium/ivy.xml| 4 +++-
 src/plugin/publish-rabbitmq/ivy.xml | 4 +++-
 src/plugin/scoring-depth/ivy.xml| 4 +++-
 src/plugin/scoring-link/ivy.xml | 4 +++-
 src/plugin/scoring-metadata/ivy.xml | 4 +++-
 src/plugin/scoring-opic/ivy.xml | 4 +++-
 src/plugin/scoring-orphan/ivy.xml   | 4 +++-
 src/plugin/scoring-similarity/ivy.xml   | 4 +++-
 src/plugin/subcollection/ivy.xml| 4 +++-
 src/plugin/tld/ivy.xml  | 4 +++-
 src/plugin/urlfilter-automaton/ivy.xml  | 4 +++-
 src/plugin/urlfilter-domain/ivy.xml | 4 +++-
 src/plugin/urlfilter-domaindenylist/ivy.xml | 4 +++-
 src/plugin/urlfilter-fast/ivy.xml   | 4 +++-
 src/plugin/urlfilter-ignoreexempt/ivy.xml   | 4 +++-
 src/plugin/urlfilter-prefix/ivy.xml | 4 +++-
 src/plugin/urlfilter-regex/ivy.xml  | 4 +++-
 src/plugin/urlfilter-suffix/ivy.xml | 4 +++-
 src/plugin/urlfilter-validator/ivy.xml  | 4 +++-
 src/plugin/urlmeta/ivy.xml  | 4 +++-
 src/plugin/urlnormalizer-ajax/ivy.xml   | 4 +++-
 src/plugin/urlnormalizer-basic

(nutch) branch master updated: Update Dockerfile / JAVA_HOME - 2nd try (#805)

2024-03-12 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 42b55f6a9 Update Dockerfile / JAVA_HOME - 2nd try (#805)
42b55f6a9 is described below

commit 42b55f6a9b369d8e7f6b93735107abb187f65c39
Author: Jakob Berlin 
AuthorDate: Wed Mar 13 06:11:30 2024 +0100

Update Dockerfile / JAVA_HOME - 2nd try (#805)

* Nutch 1.19 release
- update current year in API docs etc.
- update version number
- add changes / release notes
- update links to Hadoop API docs

* Update Dockerfile / JAVA_HOME

Alpine is using ash shell by default which results in an not set JAVA_HOME 
environment variable

* Update Dockerfile

Remove empty line at the end

-

Co-authored-by: Sebastian Nagel 
---
 docker/Dockerfile | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/docker/Dockerfile b/docker/Dockerfile
index cffa00a95..fb93fe98a 100644
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@@ -46,6 +46,7 @@ RUN apk --no-cache add apache-ant bash git openjdk11 
supervisor
 
 # Establish environment variables
 RUN echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk' >> $HOME/.bashrc
+RUN echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk' >> $HOME/.ashrc
 ENV JAVA_HOME='/usr/lib/jvm/java-11-openjdk'
 ENV NUTCH_HOME='/root/nutch_source/runtime/local'
 
@@ -112,4 +113,4 @@ EXPOSE $WEBAPP_PORT
 ENTRYPOINT [ "supervisord", "--nodaemon", "--configuration", 
"/etc/supervisord.conf" ]
 
 FROM branch-version-$BUILD_MODE AS final
-RUN echo "Successfully built image, see https://s.apache.org/m5933 for 
guidance on running a container instance."
\ No newline at end of file
+RUN echo "Successfully built image, see https://s.apache.org/m5933 for 
guidance on running a container instance."



(nutch) branch revert-801-patch-2 deleted (was 54394b9ed)

2024-03-11 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a change to branch revert-801-patch-2
in repository https://gitbox.apache.org/repos/asf/nutch.git


 was 54394b9ed Revert "Update Dockerfile / JAVA_HOME (#801)"

The revisions that were on this branch are still contained in
other references; therefore, this change does not discard any commits
from the repository.



(nutch) branch branch-1.19 updated: Revert "Update Dockerfile / JAVA_HOME (#801)" (#804)

2024-03-11 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch branch-1.19
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/branch-1.19 by this push:
 new 19bfd00bb Revert "Update Dockerfile / JAVA_HOME (#801)" (#804)
19bfd00bb is described below

commit 19bfd00bbce1298a956c646798200df5ae89fb71
Author: Lewis John McGibbney 
AuthorDate: Mon Mar 11 13:25:01 2024 -0700

Revert "Update Dockerfile / JAVA_HOME (#801)" (#804)

This reverts commit 0b04db65ad32634aa1a63a191a404c52a5d29e46.
---
 docker/Dockerfile | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/docker/Dockerfile b/docker/Dockerfile
index ea734bd06..29ead46ba 100644
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@@ -23,8 +23,6 @@ RUN apk update
 RUN apk --no-cache add apache-ant bash git openjdk11
 
 RUN echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk' >> $HOME/.bashrc
-RUN echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk' >> $HOME/.ashrc
-ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk
 env NUTCH_HOME='/root/nutch_source/runtime/local'
 
 # Checkout and build the Nutch master branch (1.x)
@@ -36,4 +34,4 @@ RUN git clone https://github.com/apache/nutch.git 
nutch_source && \
 
 # Create symlinks for runtime/local/bin/nutch and runtime/local/bin/crawl
 RUN ln -sf $NUTCH_HOME/bin/nutch /usr/local/bin/
-RUN ln -sf $NUTCH_HOME/bin/crawl /usr/local/bin/
+RUN ln -sf $NUTCH_HOME/bin/crawl /usr/local/bin/
\ No newline at end of file



(nutch) 01/01: Revert "Update Dockerfile / JAVA_HOME (#801)"

2024-03-11 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch revert-801-patch-2
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 54394b9ed860eda9e60fc3e469534cf447dd0518
Author: Lewis John McGibbney 
AuthorDate: Mon Mar 11 13:24:44 2024 -0700

Revert "Update Dockerfile / JAVA_HOME (#801)"

This reverts commit 0b04db65ad32634aa1a63a191a404c52a5d29e46.
---
 docker/Dockerfile | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/docker/Dockerfile b/docker/Dockerfile
index ea734bd06..29ead46ba 100644
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@@ -23,8 +23,6 @@ RUN apk update
 RUN apk --no-cache add apache-ant bash git openjdk11
 
 RUN echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk' >> $HOME/.bashrc
-RUN echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk' >> $HOME/.ashrc
-ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk
 env NUTCH_HOME='/root/nutch_source/runtime/local'
 
 # Checkout and build the Nutch master branch (1.x)
@@ -36,4 +34,4 @@ RUN git clone https://github.com/apache/nutch.git 
nutch_source && \
 
 # Create symlinks for runtime/local/bin/nutch and runtime/local/bin/crawl
 RUN ln -sf $NUTCH_HOME/bin/nutch /usr/local/bin/
-RUN ln -sf $NUTCH_HOME/bin/crawl /usr/local/bin/
+RUN ln -sf $NUTCH_HOME/bin/crawl /usr/local/bin/
\ No newline at end of file



(nutch) branch revert-801-patch-2 created (now 54394b9ed)

2024-03-11 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a change to branch revert-801-patch-2
in repository https://gitbox.apache.org/repos/asf/nutch.git


  at 54394b9ed Revert "Update Dockerfile / JAVA_HOME (#801)"

This branch includes the following new commits:

 new 54394b9ed Revert "Update Dockerfile / JAVA_HOME (#801)"

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.




(nutch) branch branch-1.19 updated: Update Dockerfile / JAVA_HOME (#801)

2024-03-11 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch branch-1.19
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/branch-1.19 by this push:
 new 0b04db65a Update Dockerfile / JAVA_HOME (#801)
0b04db65a is described below

commit 0b04db65ad32634aa1a63a191a404c52a5d29e46
Author: Jakob Berlin 
AuthorDate: Mon Mar 11 21:24:21 2024 +0100

Update Dockerfile / JAVA_HOME (#801)

Alpine is using ash shell by default which results in an not set JAVA_HOME 
environment variable
---
 docker/Dockerfile | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/docker/Dockerfile b/docker/Dockerfile
index 29ead46ba..ea734bd06 100644
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@@ -23,6 +23,8 @@ RUN apk update
 RUN apk --no-cache add apache-ant bash git openjdk11
 
 RUN echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk' >> $HOME/.bashrc
+RUN echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk' >> $HOME/.ashrc
+ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk
 env NUTCH_HOME='/root/nutch_source/runtime/local'
 
 # Checkout and build the Nutch master branch (1.x)
@@ -34,4 +36,4 @@ RUN git clone https://github.com/apache/nutch.git 
nutch_source && \
 
 # Create symlinks for runtime/local/bin/nutch and runtime/local/bin/crawl
 RUN ln -sf $NUTCH_HOME/bin/nutch /usr/local/bin/
-RUN ln -sf $NUTCH_HOME/bin/crawl /usr/local/bin/
\ No newline at end of file
+RUN ln -sf $NUTCH_HOME/bin/crawl /usr/local/bin/



(nutch) branch master updated: NUTCH-3024 Remove flaky 'dependency check' target (#795)

2023-11-24 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 85fea6e46 NUTCH-3024 Remove flaky 'dependency check' target (#795)
85fea6e46 is described below

commit 85fea6e46475cb74c61c13193fff008a7e7e6a37
Author: Lewis John McGibbney 
AuthorDate: Fri Nov 24 12:33:50 2023 -0800

NUTCH-3024 Remove flaky 'dependency check' target (#795)
---
 .github/workflows/dependency-check.yml | 37 --
 build.xml  | 47 --
 2 files changed, 84 deletions(-)

diff --git a/.github/workflows/dependency-check.yml 
b/.github/workflows/dependency-check.yml
deleted file mode 100644
index f07f746a0..0
--- a/.github/workflows/dependency-check.yml
+++ /dev/null
@@ -1,37 +0,0 @@
-# Licensed to the Apache Software Foundation (ASF) under one or more
-# contributor license agreements.  See the NOTICE file distributed with
-# this work for additional information regarding copyright ownership.
-# The ASF licenses this file to You under the Apache License, Version 2.0
-# (the "License"); you may not use this file except in compliance with
-# the License.  You may obtain a copy of the License at
-#
-#  http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-name: master pr build
-
-on:
-  schedule:
-- cron: '0 0 * * *'  # every day at midnight
-
-jobs:
-  dependency-check:
-strategy:
-  matrix:
-java: ['11']
-os: [ubuntu-latest]
-runs-on: ${{ matrix.os }}
-steps:
-  - uses: actions/checkout@v4
-  - name: Set up JDK ${{ matrix.java }}
-uses: actions/setup-java@v3
-with:
-  java-version: ${{ matrix.java }}
-  distribution: 'temurin'
-  - name: Dependency check
-run: ant clean dependency-check -buildfile build.xml
diff --git a/build.xml b/build.xml
index dd9797302..70c8e8a9e 100644
--- a/build.xml
+++ b/build.xml
@@ -38,10 +38,6 @@
   
   
 
-  
-  
-  
-
   
 
   
@@ -615,49 +611,6 @@
 
   
 
-  
-  
-
-
-  
-
-  
-https://github.com/jeremylong/DependencyCheck/releases/download/v${dependency-check-ant.version}/dependency-check-ant-${dependency-check-ant.version}-release.zip;
- 
dest="${ivy.dir}/dependency-check-ant-${dependency-check-ant.version}-release.zip"
 usetimestamp="false" />
-
-
-
-
-
-  
-
-  
-
-
-  
-
-  
-
-  
-
-  
-
-
-
-
-
-  
-  
-
-
-  
-
   
   
   



(nutch) branch master updated: NUTCH-3014 Standardize Job names (#789)

2023-11-02 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new bbf086726 NUTCH-3014 Standardize Job names (#789)
bbf086726 is described below

commit bbf0867263ed1764c56fe7794c17942d0e8bf1c4
Author: Lewis John McGibbney 
AuthorDate: Thu Nov 2 20:36:43 2023 -0700

NUTCH-3014 Standardize Job names (#789)
---
 src/java/org/apache/nutch/crawl/CrawlDb.java   |  3 +-
 src/java/org/apache/nutch/crawl/CrawlDbMerger.java |  3 +-
 src/java/org/apache/nutch/crawl/CrawlDbReader.java | 20 +
 .../org/apache/nutch/crawl/DeduplicationJob.java   |  3 +-
 src/java/org/apache/nutch/crawl/Generator.java | 13 -
 src/java/org/apache/nutch/crawl/Injector.java  |  2 +-
 src/java/org/apache/nutch/crawl/LinkDb.java|  3 +-
 src/java/org/apache/nutch/crawl/LinkDbMerger.java  |  3 +-
 src/java/org/apache/nutch/crawl/LinkDbReader.java  |  3 +-
 src/java/org/apache/nutch/fetcher/Fetcher.java |  2 +-
 src/java/org/apache/nutch/hostdb/ReadHostDb.java   |  3 +-
 src/java/org/apache/nutch/hostdb/UpdateHostDb.java |  3 +-
 src/java/org/apache/nutch/indexer/CleaningJob.java |  4 +--
 src/java/org/apache/nutch/indexer/IndexingJob.java |  3 +-
 src/java/org/apache/nutch/parse/ParseSegment.java  |  3 +-
 .../apache/nutch/scoring/webgraph/LinkDumper.java  |  6 ++--
 .../apache/nutch/scoring/webgraph/LinkRank.java| 15 --
 .../apache/nutch/scoring/webgraph/NodeDumper.java  |  3 +-
 .../nutch/scoring/webgraph/ScoreUpdater.java   |  3 +-
 .../apache/nutch/scoring/webgraph/WebGraph.java|  9 ++
 .../org/apache/nutch/segment/SegmentMerger.java|  3 +-
 .../org/apache/nutch/segment/SegmentReader.java|  3 +-
 src/java/org/apache/nutch/tools/FreeGenerator.java |  2 +-
 .../apache/nutch/tools/arc/ArcSegmentCreator.java  |  9 ++
 .../org/apache/nutch/tools/warc/WARCExporter.java  |  3 +-
 .../apache/nutch/util/CrawlCompletionStats.java|  6 ++--
 src/java/org/apache/nutch/util/NutchJob.java   |  4 ---
 .../nutch/util/ProtocolStatusStatistics.java   |  2 +-
 .../org/apache/nutch/util/SitemapProcessor.java| 34 ++
 .../apache/nutch/util/domain/DomainStatistics.java | 10 +++
 .../org/apache/nutch/crawl/TestCrawlDbFilter.java  |  3 +-
 .../org/apache/nutch/plugin/TestPluginSystem.java  |  5 ++--
 32 files changed, 74 insertions(+), 117 deletions(-)

diff --git a/src/java/org/apache/nutch/crawl/CrawlDb.java 
b/src/java/org/apache/nutch/crawl/CrawlDb.java
index 16394832b..2b609c0a6 100644
--- a/src/java/org/apache/nutch/crawl/CrawlDb.java
+++ b/src/java/org/apache/nutch/crawl/CrawlDb.java
@@ -165,8 +165,7 @@ public class CrawlDb extends NutchTool implements Tool {
 Path newCrawlDb = new Path(crawlDb, Integer.toString(new Random()
 .nextInt(Integer.MAX_VALUE)));
 
-Job job = NutchJob.getInstance(config);
-job.setJobName("crawldb " + crawlDb);
+Job job = Job.getInstance(config, "Nutch CrawlDb: " + crawlDb);
 
 Path current = new Path(crawlDb, CURRENT_NAME);
 if (current.getFileSystem(job.getConfiguration()).exists(current)) {
diff --git a/src/java/org/apache/nutch/crawl/CrawlDbMerger.java 
b/src/java/org/apache/nutch/crawl/CrawlDbMerger.java
index 1bf7243d3..6ee4b43cd 100644
--- a/src/java/org/apache/nutch/crawl/CrawlDbMerger.java
+++ b/src/java/org/apache/nutch/crawl/CrawlDbMerger.java
@@ -165,9 +165,8 @@ public class CrawlDbMerger extends Configured implements 
Tool {
 Path newCrawlDb = new Path(output,
 "merge-" + Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));
 
-Job job = NutchJob.getInstance(conf);
+Job job = Job.getInstance(conf, "Nutch CrawlDbMerger: " + output);
 conf = job.getConfiguration();
-job.setJobName("crawldb merge " + output);
 
 job.setInputFormatClass(SequenceFileInputFormat.class);
 
diff --git a/src/java/org/apache/nutch/crawl/CrawlDbReader.java 
b/src/java/org/apache/nutch/crawl/CrawlDbReader.java
index bd3e6f38d..29e8efe17 100644
--- a/src/java/org/apache/nutch/crawl/CrawlDbReader.java
+++ b/src/java/org/apache/nutch/crawl/CrawlDbReader.java
@@ -564,9 +564,8 @@ public class CrawlDbReader extends AbstractChecker 
implements Closeable {
   throws IOException, InterruptedException, ClassNotFoundException {
 Path tmpFolder = new Path(crawlDb, "stat_tmp" + 
System.currentTimeMillis());
 
-Job job = NutchJob.getInstance(config);
+Job job = Job.getInstance(config, "Nutch CrawlDbReader: " + crawlDb);
 config = job.getConfiguration();
-job.setJobName("stats " + crawlDb);
 config.setBoolean("db.reader.stats.sort", sort);
 
 FileInputFormat.addInputPath(job, new Path(crawlDb, CrawlDb.CURRENT_NAME));
@@ -812,7 +811,7 @@ public class CrawlDbReader exte

(nutch) branch master updated: NUTCH-3015 Add more CI steps to GitHub master-build.yml (#790)

2023-10-27 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 792ed2891 NUTCH-3015 Add more CI steps to GitHub master-build.yml 
(#790)
792ed2891 is described below

commit 792ed28914f4beb2fb8b8ce28eebe17196c92af1
Author: Lewis John McGibbney 
AuthorDate: Fri Oct 27 15:04:22 2023 -0700

NUTCH-3015 Add more CI steps to GitHub master-build.yml (#790)
---
 .../{master-build.yml => dependency-check.yml} | 25 -
 .github/workflows/master-build.yml | 64 +-
 .gitignore |  1 +
 build.xml  | 52 +++---
 .../dependency-check-suppressions.xml  |  5 --
 src/java/overview.html | 16 ++
 .../creativecommons/conf/crawl-urlfilter.txt   | 15 +
 src/plugin/creativecommons/conf/nutch-site.xml | 16 ++
 src/plugin/creativecommons/data/anchor.html| 16 ++
 src/plugin/creativecommons/data/rdf.html   | 16 ++
 src/plugin/creativecommons/data/rel.html   | 16 ++
 src/plugin/creativecommons/ivy.xml |  1 -
 src/plugin/exchange-jexl/README.md | 17 ++
 src/plugin/exchange-jexl/ivy.xml   |  1 -
 src/plugin/feed/ivy.xml|  1 -
 src/plugin/headings/ivy.xml|  1 -
 src/plugin/index-anchor/ivy.xml|  1 -
 src/plugin/index-basic/ivy.xml |  1 -
 src/plugin/index-geoip/ivy.xml |  1 -
 src/plugin/index-geoip/plugin.xml  |  1 +
 src/plugin/index-jexl-filter/ivy.xml   |  1 -
 src/plugin/index-links/README.md   | 17 ++
 src/plugin/index-links/ivy.xml |  1 -
 src/plugin/index-metadata/ivy.xml  |  1 -
 src/plugin/index-more/ivy.xml  |  1 -
 src/plugin/index-replace/ivy.xml   |  1 -
 .../index-replace/sample/testIndexReplace.html | 16 ++
 src/plugin/index-static/ivy.xml|  1 -
 src/plugin/indexer-cloudsearch/README.md   | 17 ++
 src/plugin/indexer-cloudsearch/createCSDomain.sh   | 15 +
 src/plugin/indexer-csv/README.md   | 17 ++
 src/plugin/indexer-csv/ivy.xml |  1 -
 src/plugin/indexer-dummy/README.md | 17 ++
 src/plugin/indexer-dummy/ivy.xml   |  1 -
 src/plugin/indexer-elastic/README.md   | 17 ++
 .../{howto_upgrade_es.txt => howto_upgrade_es.md}  | 17 ++
 src/plugin/indexer-kafka/ivy.xml   |  1 -
 src/plugin/indexer-opensearch-1x/README.md | 17 ++
 ..._opensearch.txt => howto_upgrade_opensearch.md} | 17 ++
 src/plugin/indexer-rabbit/README.md| 17 ++
 src/plugin/indexer-rabbit/ivy.xml  |  1 -
 src/plugin/indexer-solr/README.md  | 17 ++
 ...owto_upgrade_solr.txt => howto_upgrade_solr.md} | 17 ++
 src/plugin/indexer-solr/ivy.xml| 25 +
 src/plugin/indexer-solr/plugin.xml | 26 +
 src/plugin/language-identifier/ivy.xml |  1 -
 src/plugin/lib-htmlunit/ivy.xml|  1 -
 src/plugin/lib-http/ivy.xml|  1 -
 src/plugin/lib-nekohtml/ivy.xml|  1 -
 src/plugin/lib-rabbitmq/ivy.xml|  1 -
 src/plugin/lib-regex-filter/ivy.xml|  1 -
 src/plugin/lib-selenium/README.md  | 17 ++
 .../howto_upgrade_selenium.md} | 42 +-
 src/plugin/lib-selenium/howto_upgrade_selenium.txt | 15 -
 src/plugin/lib-selenium/ivy.xml|  1 -
 src/plugin/lib-xml/ivy.xml |  1 -
 src/plugin/microformats-reltag/ivy.xml |  1 -
 src/plugin/mimetype-filter/ivy.xml |  1 -
 src/plugin/nutch-extensionpoints/ivy.xml   |  1 -
 src/plugin/parse-ext/command   | 15 +
 src/plugin/parse-ext/ivy.xml   |  1 -
 src/plugin/parse-html/ivy.xml  |  1 -
 src/plugin/parse-js/ivy.xml|  1 -
 .../parse-js/sample/parse_embedded_js_test.html| 16 ++
 src/plugin/parse-js/sample/parse_pure_js_test.js   | 15 +
 src/plugin/parse-metatags/ivy.xml  |  1 -
 src/plugin/parse-metatags/sample/testMetatags.html | 16 ++
 .../sample/testMultivalueMetatags.html | 16 ++
 ...owto_upgrade_tika.txt => howto_upgrade_tika.md} | 17 ++
 src/plugin/parse-tika/ivy.xml  |  1 -
 src/plugin/parse-tika/sample/nutch.html| 16 ++
 src/plugin/pa

[nutch] branch master updated: NUTCH-3013 Employ commons-lang3's StopWatch to simplify timing logic (#788)

2023-10-21 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 8431dcfe5 NUTCH-3013 Employ commons-lang3's StopWatch to simplify 
timing logic (#788)
8431dcfe5 is described below

commit 8431dcfe52f5395a0fd9e3c00db009dbb2bcf6f5
Author: Lewis John McGibbney 
AuthorDate: Sat Oct 21 11:09:31 2023 -0700

NUTCH-3013 Employ commons-lang3's StopWatch to simplify timing logic (#788)
---
 .github/workflows/master-build.yml |  1 -
 .gitignore |  1 +
 src/java/org/apache/nutch/crawl/CrawlDb.java   | 19 +
 src/java/org/apache/nutch/crawl/CrawlDbMerger.java | 16 +++
 .../org/apache/nutch/crawl/DeduplicationJob.java   | 16 +++
 src/java/org/apache/nutch/crawl/Generator.java | 17 +++
 src/java/org/apache/nutch/crawl/Injector.java  | 16 +++
 src/java/org/apache/nutch/crawl/LinkDb.java| 15 +++---
 src/java/org/apache/nutch/crawl/LinkDbMerger.java  | 16 +++
 src/java/org/apache/nutch/crawl/LinkDbReader.java  | 24 ++
 src/java/org/apache/nutch/fetcher/Fetcher.java | 17 +++
 src/java/org/apache/nutch/hostdb/ReadHostDb.java   | 15 +++---
 src/java/org/apache/nutch/hostdb/UpdateHostDb.java | 16 +++
 src/java/org/apache/nutch/indexer/CleaningJob.java | 16 +++
 src/java/org/apache/nutch/indexer/IndexingJob.java | 16 +++
 src/java/org/apache/nutch/parse/ParseSegment.java  | 21 ---
 .../apache/nutch/scoring/webgraph/LinkDumper.java  | 17 +++
 .../apache/nutch/scoring/webgraph/LinkRank.java| 16 +++
 .../apache/nutch/scoring/webgraph/NodeDumper.java  | 16 +++
 .../nutch/scoring/webgraph/ScoreUpdater.java   | 16 +++
 .../apache/nutch/scoring/webgraph/WebGraph.java| 24 ++
 src/java/org/apache/nutch/tools/FreeGenerator.java | 16 +++
 .../apache/nutch/tools/arc/ArcSegmentCreator.java  | 16 +++
 .../org/apache/nutch/tools/warc/WARCExporter.java  | 15 +++---
 .../apache/nutch/util/CrawlCompletionStats.java| 15 +++---
 .../nutch/util/ProtocolStatusStatistics.java   | 19 -
 .../org/apache/nutch/util/SitemapProcessor.java| 12 +++
 .../apache/nutch/util/domain/DomainStatistics.java | 16 +++
 .../urlfilter/api/RegexURLFilterBaseTest.java  | 11 +-
 .../regex/TestRegexURLNormalizer.java  |  8 ++--
 30 files changed, 234 insertions(+), 225 deletions(-)

diff --git a/.github/workflows/master-build.yml 
b/.github/workflows/master-build.yml
index e3ed11c86..ba1d470ec 100644
--- a/.github/workflows/master-build.yml
+++ b/.github/workflows/master-build.yml
@@ -22,7 +22,6 @@ on:
 branches: [ master ]
   pull_request:
 branches: [ master ]
-
 
 jobs:
   build:
diff --git a/.gitignore b/.gitignore
index 0612a99c2..b46690852 100644
--- a/.gitignore
+++ b/.gitignore
@@ -27,3 +27,4 @@ naivebayes-model
 csvindexwriter
 lib/spotbugs-*
 ivy/dependency-check-ant/*
+.gradle*
diff --git a/src/java/org/apache/nutch/crawl/CrawlDb.java 
b/src/java/org/apache/nutch/crawl/CrawlDb.java
index 3819bb3a0..16394832b 100644
--- a/src/java/org/apache/nutch/crawl/CrawlDb.java
+++ b/src/java/org/apache/nutch/crawl/CrawlDb.java
@@ -19,14 +19,15 @@ package org.apache.nutch.crawl;
 import java.io.File;
 import java.io.IOException;
 import java.lang.invoke.MethodHandles;
-import java.text.SimpleDateFormat;
 import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.HashMap;
 import java.util.HashSet;
 import java.util.Map;
 import java.util.Random;
+import java.util.concurrent.TimeUnit;
 
+import org.apache.commons.lang3.time.StopWatch;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 import org.apache.hadoop.conf.Configuration;
@@ -49,7 +50,6 @@ import org.apache.nutch.util.LockUtil;
 import org.apache.nutch.util.NutchConfiguration;
 import org.apache.nutch.util.NutchJob;
 import org.apache.nutch.util.NutchTool;
-import org.apache.nutch.util.TimingUtil;
 
 /**
  * This class takes the output of the fetcher and updates the crawldb
@@ -85,10 +85,11 @@ public class CrawlDb extends NutchTool implements Tool {
   public void update(Path crawlDb, Path[] segments, boolean normalize,
   boolean filter, boolean additionsAllowed, boolean force)
   throws IOException, InterruptedException, ClassNotFoundException {
-Path lock = lock(getConf(), crawlDb, force);
 
-SimpleDateFormat sdf = new SimpleDateFormat("-MM-dd HH:mm:ss");
-long start = System.currentTimeMillis();
+StopWatch stopWatch = new StopWatch();
+stopWatch.start();
+
+Path lock = lock(getConf(), crawlDb, force);

[nutch] branch master updated: NUTCH-2936 Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode (#726)

2022-05-20 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 02dca3b6d NUTCH-2936 Early registration of URL stream handlers 
provided by plugins may fail Hadoop jobs running in distributed mode (#726)
02dca3b6d is described below

commit 02dca3b6d097af0f8fa76ce17f0a33267964bf19
Author: Lewis John McGibbney 
AuthorDate: Fri May 20 11:04:22 2022 -0700

NUTCH-2936 Early registration of URL stream handlers provided by plugins 
may fail Hadoop jobs running in distributed mode (#726)

* NUTCH-2936 Early registration of URL stream handlers provided by plugins 
may fail Hadoop jobs running in distributed mode
---
 src/java/org/apache/nutch/parse/ParserChecker.java |  48 +++
 src/java/org/apache/nutch/plugin/Extension.java|  12 +-
 .../org/apache/nutch/plugin/ExtensionPoint.java|  18 +--
 .../apache/nutch/plugin/PluginManifestParser.java  |   3 +-
 .../org/apache/nutch/plugin/PluginRepository.java  |  57 +
 .../nutch/plugin/URLStreamHandlerFactory.java  |  13 +-
 .../apache/nutch/protocol/http/api/HttpBase.java   | 140 +++--
 src/plugin/protocol-foo/plugin.xml |   2 +-
 src/plugin/protocol-okhttp/ivy.xml |   4 +-
 src/plugin/protocol-okhttp/plugin.xml  |  12 +-
 .../org/apache/nutch/protocol/okhttp/OkHttp.java   |  48 +++
 .../nutch/protocol/okhttp/OkHttpResponse.java  |  26 ++--
 12 files changed, 195 insertions(+), 188 deletions(-)

diff --git a/src/java/org/apache/nutch/parse/ParserChecker.java 
b/src/java/org/apache/nutch/parse/ParserChecker.java
index 6c82a516b..5da023fdc 100644
--- a/src/java/org/apache/nutch/parse/ParserChecker.java
+++ b/src/java/org/apache/nutch/parse/ParserChecker.java
@@ -114,15 +114,15 @@ public class ParserChecker extends AbstractChecker {
 int numConsumed;
 for (int i = 0; i < args.length; i++) {
   if (args[i].equals("-normalize")) {
-normalizers = new URLNormalizers(getConf(), 
URLNormalizers.SCOPE_DEFAULT);
+this.normalizers = new URLNormalizers(getConf(), 
URLNormalizers.SCOPE_DEFAULT);
   } else if (args[i].equals("-followRedirects")) {
-followRedirects = true;
+this.followRedirects = true;
   } else if (args[i].equals("-checkRobotsTxt")) {
-checkRobotsTxt = true;
+this.checkRobotsTxt = true;
   } else if (args[i].equals("-forceAs")) {
-forceAsContentType = args[++i];
+this.forceAsContentType = args[++i];
   } else if (args[i].equals("-dumpText")) {
-dumpText = true;
+this.dumpText = true;
   } else if (args[i].equals("-md")) {
 String k = null, v = null;
 String nextOne = args[++i];
@@ -132,7 +132,7 @@ public class ParserChecker extends AbstractChecker {
   v = nextOne.substring(firstEquals + 1);
 } else
   k = nextOne;
-metadata.put(k, v);
+this.metadata.put(k, v);
   } else if ((numConsumed = super.parseArgs(args, i)) > 0) {
 i += numConsumed - 1;
   } else if (i != args.length - 1) {
@@ -144,7 +144,7 @@ public class ParserChecker extends AbstractChecker {
   }
 }
 
-scfilters = new ScoringFilters(getConf());
+this.scfilters = new ScoringFilters(getConf());
 
 if (url != null) {
   return super.processSingle(url);
@@ -155,25 +155,25 @@ public class ParserChecker extends AbstractChecker {
   }
 
   protected int process(String url, StringBuilder output) throws Exception {
-if (normalizers != null) {
-  url = normalizers.normalize(url, URLNormalizers.SCOPE_DEFAULT);
+if (this.normalizers != null) {
+  url = this.normalizers.normalize(url, URLNormalizers.SCOPE_DEFAULT);
 }
 
 LOG.info("fetching: " + url);
 
 CrawlDatum datum = new CrawlDatum();
 
-Iterator iter = metadata.keySet().iterator();
+Iterator iter = this.metadata.keySet().iterator();
 while (iter.hasNext()) {
   String key = iter.next();
-  String value = metadata.get(key);
+  String value = this.metadata.get(key);
   if (value == null)
 value = "";
   datum.getMetaData().put(new Text(key), new Text(value));
 }
 
 int maxRedirects = getConf().getInt("http.redirect.max", 3);
-if (followRedirects) {
+if (this.followRedirects) {
   if (maxRedirects == 0) {
 LOG.info("Following max. 3 redirects (ignored http.redirect.max == 
0)");
 maxRedirects = 3;
@@ -183,30 +183,30 @@ public class ParserChecker extends AbstractChecker {
 }
 
 ProtocolOutput protocolOutput = getProtocolOutput(url, datum,
-checkRobotsTxt);
+this.checkRobotsTxt);
 Text turl = new Text(url);
 
 // Follo

[nutch] branch master updated: NUTCH-2919 Upgrade to Tika 2.2.1 and Any23 2.6 (#717)

2022-01-15 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 847e19d  NUTCH-2919 Upgrade to Tika 2.2.1 and Any23 2.6 (#717)
847e19d is described below

commit 847e19d984503d333fd8fdd430fe347dd370dc4c
Author: Lewis John McGibbney 
AuthorDate: Sat Jan 15 15:24:21 2022 -0800

NUTCH-2919 Upgrade to Tika 2.2.1 and Any23 2.6 (#717)

* NUTCH-2919 Upgrade to Tika 2.2.1 and Any23 2.6
---
 ivy/ivy.xml|   2 +-
 src/plugin/any23/ivy.xml   |   2 +-
 src/plugin/any23/plugin.xml| 272 ++---
 .../apache/nutch/any23/Any23IndexingFilter.java|   2 +-
 .../org/apache/nutch/any23/Any23ParseFilter.java   |  35 +--
 src/plugin/build-plugin.xml|   3 +-
 src/plugin/language-identifier/ivy.xml |   2 +-
 src/plugin/language-identifier/plugin.xml  |   6 +-
 src/plugin/parse-tika/howto_upgrade_tika.txt   |   5 +-
 src/plugin/parse-tika/ivy.xml  |   2 +-
 src/plugin/parse-tika/plugin.xml   |  70 +++---
 .../org/apache/nutch/parse/tika/TestRTFParser.java |   4 +-
 12 files changed, 192 insertions(+), 213 deletions(-)

diff --git a/ivy/ivy.xml b/ivy/ivy.xml
index 8d154bf..34e298f 100644
--- a/ivy/ivy.xml
+++ b/ivy/ivy.xml
@@ -63,7 +63,7 @@


 
-   
+   
 


diff --git a/src/plugin/any23/ivy.xml b/src/plugin/any23/ivy.xml
index a5a0077..7220a25 100644
--- a/src/plugin/any23/ivy.xml
+++ b/src/plugin/any23/ivy.xml
@@ -36,7 +36,7 @@
   
 
   
-
+
   
   
   
diff --git a/src/plugin/any23/plugin.xml b/src/plugin/any23/plugin.xml
index cc941b2..40a42c7 100644
--- a/src/plugin/any23/plugin.xml
+++ b/src/plugin/any23/plugin.xml
@@ -26,194 +26,168 @@
   
 
   
-  
-  
   
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
   
-  
-  
-  
-  
-  
-  
+  
+  
+  
   
-  
+  
   
-  
-  
+  
+  
   
   
   
+  
   
-  
-  
-  
-  
-  
+  
   
-  
   
   
   
-  
-  
-  
-  
-  
-  
-  
-  
+  
+  
+  
   
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
   
-  
-  
-  
-  
+  
+  
   
-  
   
-  
   
-  
   
-  
-  
+  
   
-  
-  
-  
-  
-  
+  
+  
+  
   
   
   
-  
-  
-  
-  
-  
+  
+  
   
-  
   
-  
+  
   
+  
+  
   
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
   
   
   
   
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
   
   
   
   
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
+  
   
-  
-  
-  
-  
-  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
   
   
   
-  
   
-  
-  
-  
-  
-  
+  
+  
   
   
 
diff --git 
a/src/plugin/any23/src/java/org/apache/nutch/any23/Any23IndexingFilter.java 
b/src/plugin/any23/src/java/org/apache/nutch/any23/Any23IndexingFilter.java
index c0f1d6f..09dc32e 100644
--- a/src/plugin/any23/src/java/org/apache/nutch/any23/Any23IndexingFilter.java
+++ b/src/plugin/any23/src/java/org/apache/nutch/any23/Any23IndexingFilter.java
@@ -106,7 +106,7 @@ public class Any23IndexingFilter implements IndexingFilter

[nutch] branch master updated: NUTCH-2429 Fix Plugin System to allow protocol plugins to bundle their URLStreamHandlers (#720)

2022-01-07 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new e76d69f  NUTCH-2429 Fix Plugin System to allow protocol plugins to 
bundle their URLStreamHandlers (#720)
e76d69f is described below

commit e76d69fe13902fd2f3a98660dd2bac52c2ea568c
Author: Lewis John McGibbney 
AuthorDate: Fri Jan 7 20:07:54 2022 -0800

NUTCH-2429 Fix Plugin System to allow protocol plugins to bundle their 
URLStreamHandlers (#720)

* NUTCH-2429 Fix Plugin System to allow protocol plugins to bundle their 
URLStreamHandlers

Co-authored-by: Hiran Chaudhuri 
---
 build.xml  |   1 +
 src/java/org/apache/nutch/crawl/CrawlDbReader.java |  43 ++--
 src/java/org/apache/nutch/parse/ParserChecker.java |   5 +
 .../apache/nutch/plugin/PluginManifestParser.java  |  66 +++---
 .../org/apache/nutch/plugin/PluginRepository.java  | 244 +++--
 .../nutch/plugin/URLStreamHandlerFactory.java  | 115 ++
 .../apache/nutch/util/CrawlCompletionStats.java|  40 ++--
 src/java/org/apache/nutch/util/NutchJob.java   |  12 +-
 src/java/org/apache/nutch/util/NutchTool.java  |   9 +
 .../org/apache/nutch/util/SitemapProcessor.java|  10 +-
 .../apache/nutch/util/domain/DomainStatistics.java |  20 +-
 .../apache/nutch/any23/Any23IndexingFilter.java|   2 +-
 .../org/apache/nutch/any23/Any23ParseFilter.java   |   2 +-
 src/plugin/build.xml   |   2 +
 .../nutch/indexwriter/csv/CSVIndexWriter.java  |   2 +-
 .../indexwriter/rabbit/RabbitIndexWriter.java  |   2 +-
 src/plugin/protocol-foo/build.xml  |  22 ++
 src/plugin/protocol-foo/ivy.xml|  41 
 src/plugin/protocol-foo/plugin.xml |  48 
 .../java/org/apache/nutch/protocol/foo/Foo.java| 141 
 .../org/apache/nutch/protocol/foo/Handler.java |  28 +++
 21 files changed, 696 insertions(+), 159 deletions(-)

diff --git a/build.xml b/build.xml
index ecef1e7..2c0eef0 100644
--- a/build.xml
+++ b/build.xml
@@ -1272,6 +1272,7 @@
 
 
 
+
 
 
 
diff --git a/src/java/org/apache/nutch/crawl/CrawlDbReader.java 
b/src/java/org/apache/nutch/crawl/CrawlDbReader.java
index 2a20a56..f31210a 100644
--- a/src/java/org/apache/nutch/crawl/CrawlDbReader.java
+++ b/src/java/org/apache/nutch/crawl/CrawlDbReader.java
@@ -16,11 +16,12 @@
  */
 package org.apache.nutch.crawl;
 
+import java.io.Closeable;
 import java.io.DataOutputStream;
 import java.io.File;
 import java.io.IOException;
-import java.io.Closeable;
 import java.lang.invoke.MethodHandles;
+import java.net.MalformedURLException;
 import java.net.URL;
 import java.nio.ByteBuffer;
 import java.util.ArrayList;
@@ -32,16 +33,11 @@ import java.util.List;
 import java.util.Map;
 import java.util.Map.Entry;
 import java.util.Random;
+import java.util.TreeMap;
 import java.util.regex.Matcher;
 import java.util.regex.Pattern;
-import java.util.TreeMap;
-
-import org.slf4j.Logger;
-import org.slf4j.LoggerFactory;
-
-import com.tdunning.math.stats.MergingDigest;
-import com.tdunning.math.stats.TDigest;
 
+import org.apache.commons.jexl3.JexlScript;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.FileStatus;
 import org.apache.hadoop.fs.FileSystem;
@@ -55,18 +51,18 @@ import org.apache.hadoop.io.Text;
 import org.apache.hadoop.io.Writable;
 import org.apache.hadoop.mapreduce.Job;
 import org.apache.hadoop.mapreduce.Mapper;
+import org.apache.hadoop.mapreduce.RecordWriter;
 import org.apache.hadoop.mapreduce.Reducer;
-import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
-import org.apache.hadoop.mapreduce.lib.output.MapFileOutputFormat;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
+import org.apache.hadoop.mapreduce.lib.output.MapFileOutputFormat;
+import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
-import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
-import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
 import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;
-import org.apache.hadoop.mapreduce.RecordWriter;
-import org.apache.hadoop.mapreduce.TaskAttemptContext;
-import org.apache.hadoop.util.ToolRunner;
 import org.apache.hadoop.util.StringUtils;
+import org.apache.hadoop.util.ToolRunner;
 import org.apache.nutch.util.AbstractChecker;
 import org.apache.nutch.util.JexlUtil;
 import org.apache.nutch.util.NutchConfiguration;
@@ -74,7 +70,8 @@ import

[nutch] branch master updated: NUTCH-2449 Replace Tika LanguageIdentifier in language-identifier (#716)

2021-12-17 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new a9b50a7  NUTCH-2449 Replace Tika LanguageIdentifier in 
language-identifier (#716)
a9b50a7 is described below

commit a9b50a7c7e0ab83865883bf87f2c98f1ce354388
Author: Lewis John McGibbney 
AuthorDate: Fri Dec 17 20:11:01 2021 -0800

NUTCH-2449 Replace Tika LanguageIdentifier in language-identifier (#716)
---
 src/plugin/language-identifier/build-ivy.xml | 47 
 src/plugin/language-identifier/build.xml |  4 +--
 2 files changed, 49 insertions(+), 2 deletions(-)

diff --git a/src/plugin/language-identifier/build-ivy.xml 
b/src/plugin/language-identifier/build-ivy.xml
new file mode 100644
index 000..c735501
--- /dev/null
+++ b/src/plugin/language-identifier/build-ivy.xml
@@ -0,0 +1,47 @@
+
+
+
+
+  
+  
+
+  
+
+
+
+  
+
+  
+
+  
+
+  
+
+
+
+  
+
+  
+
+  
+
+
diff --git a/src/plugin/language-identifier/build.xml 
b/src/plugin/language-identifier/build.xml
index 668075e..4efb786 100644
--- a/src/plugin/language-identifier/build.xml
+++ b/src/plugin/language-identifier/build.xml
@@ -20,9 +20,9 @@
   
 
   
-Copying language profiles
+Copying language mappings (language codes to names)
 
-  
+  
 
 Copying test files
 


[nutch-site] branch main updated (b720870 -> 198d962)

2021-11-24 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a change to branch main
in repository https://gitbox.apache.org/repos/asf/nutch-site.git.


from b720870  Add .asf.yaml file to Nutch website
 add 335dac0  Add public directory to SCM
 new 198d962  Remove public directory from main branch

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .gitignore | 1 -
 1 file changed, 1 deletion(-)


[nutch-site] branch main updated (819de2a -> b720870)

2021-11-24 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a change to branch main
in repository https://gitbox.apache.org/repos/asf/nutch-site.git.


from 819de2a  Initial code import
 add b720870  Add .asf.yaml file to Nutch website

No new revisions were added by this update.

Summary of changes:
 .asf.yaml | 34 ++
 1 file changed, 34 insertions(+)
 create mode 100644 .asf.yaml


[nutch-site] branch asf-site created (now b720870)

2021-11-24 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a change to branch asf-site
in repository https://gitbox.apache.org/repos/asf/nutch-site.git.


  at b720870  Add .asf.yaml file to Nutch website

This branch includes the following new commits:

 new b720870  Add .asf.yaml file to Nutch website

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.



[nutch-site] 01/01: Add .asf.yaml file to Nutch website

2021-11-24 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/nutch-site.git

commit b720870ebcc1a21abf3e9add6a2170560c423836
Author: Lewis John McGibbney 
AuthorDate: Wed Nov 24 08:41:06 2021 -0800

Add .asf.yaml file to Nutch website
---
 .asf.yaml | 34 ++
 1 file changed, 34 insertions(+)

diff --git a/.asf.yaml b/.asf.yaml
new file mode 100644
index 000..0cc84e6
--- /dev/null
+++ b/.asf.yaml
@@ -0,0 +1,34 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# 
https://cwiki.apache.org/confluence/display/INFRA/.asf.yaml+features+for+git+repositories
+
+github:
+  description: "Apache Nutch Website"
+  homepage: https://nutch.apache.org/
+  labels:
+- apache
+- nutch
+- hugo
+
+  enabled_merge_buttons:
+squash: true
+merge:  false
+rebase: false
+
+publish:
+  whoami: asf-site
\ No newline at end of file


[nutch-site] branch main updated: Remove broken site

2021-11-23 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/nutch-site.git


The following commit(s) were added to refs/heads/main by this push:
 new a2bcc5c  Remove broken site
a2bcc5c is described below

commit a2bcc5cf4ec05ed921506b702ff39befbdba4a39
Author: Lewis John McGibbney 
AuthorDate: Tue Nov 23 20:14:53 2021 -0800

Remove broken site
---
 .gitmodules |   3 -
 README.md   |  64 --
 archetypes/default.md   |   6 --
 config.toml |  55 
 public/categories/index.xml |  10 ---
 public/css/index.css|  86 ---
 public/css/navbar.css   |  53 
 public/img/IMG_0292.png | Bin 15728243 -> 0 bytes
 public/img/IMG_0295.png | Bin 16275273 -> 0 bytes
 public/img/server_rack.jpg  | Bin 214612 -> 0 bytes
 public/img/wave.png | Bin 4789 -> 0 bytes
 public/index.html   | 198 
 public/index.xml|  20 -
 public/posts/index.xml  |  20 -
 public/sitemap.xml  |  28 ---
 public/tags/index.xml   |  10 ---
 themes/SimpleIntro  |   1 -
 17 files changed, 554 deletions(-)

diff --git a/.gitmodules b/.gitmodules
index 6378a22..e69de29 100644
--- a/.gitmodules
+++ b/.gitmodules
@@ -1,3 +0,0 @@
-[submodule "themes/SimpleIntro"]
-   path = themes/SimpleIntro
-   url = https://github.com/gangjun06/SimpleIntro
diff --git a/README.md b/README.md
deleted file mode 100644
index b5e146f..000
--- a/README.md
+++ /dev/null
@@ -1,64 +0,0 @@
-Nutch Website
-=
-
-https://nutch.apache.org/assets/img/nutch_logo_tm.png; align="right" 
width="300" />
-
-This repository contains the website source code for the [Apache 
Nutch](https://nutch.apache.org) project.
-
-# Tooling
-
-The Website is built using [Hugo](https://gohugo.io/) a popular open-source 
static website generation framework.
-
-# Prerequisites
-* [Install Hugo](https://gohugo.io/getting-started/installing/)
-
-# Local Build and Deploy
-
-```bash
-$ hugo server
-...
-Start building sites …
-
-   | EN
+-
-  Pages| 10
-  Paginator pages  |  0
-  Non-page files   |  0
-  Static files | 10
-  Processed images |  0
-  Aliases  |  0
-  Sitemaps |  1
-  Cleaned  |  0
-
-Built in 107 ms
-Watching for changes in 
/path/to/nutch_site/{archetypes,content,data,layouts,static,themes}
-Watching for config changes in /path/to/nutch_site/config.toml
-Environment: "development"
-Serving pages from memory
-Running in Fast Render Mode. For full rebuilds on change: hugo server 
--disableFastRender
-Web Server is available at http://localhost:1313/ (bind address 127.0.0.1)
-Press Ctrl+C to stop
-```
-
-# Contributing
-
-To contribute a patch, follow these instructions (note that installing
-[Hub](https://hub.github.com/) is not strictly required, but is recommended).
-
-```
-0. Download and install hub.github.com
-1. File JIRA issue for your fix at 
https://issues.apache.org/jira/projects/NUTCH/issues
-- you will get issue id NUTCH-xxx where xxx is the issue ID.
-2. git clone https://github.com/apache/nutch-site.git
-3. cd nutch-site
-4. git checkout -b NUTCH-xxx
-5. edit files
-6. git status (make sure it shows what files you expected to edit)
-7. git add 
-8. git commit -m “fix for NUTCH-xxx contributed by ”
-9. git fork
-10. git push -u  NUTCH-xxx
-11. git pull-request
-```
-
-# License
diff --git a/archetypes/default.md b/archetypes/default.md
deleted file mode 100644
index 00e77bd..000
--- a/archetypes/default.md
+++ /dev/null
@@ -1,6 +0,0 @@

-title: "{{ replace .Name "-" " " | title }}"
-date: {{ .Date }}
-draft: true

-
diff --git a/config.toml b/config.toml
deleted file mode 100644
index 80101b7..000
--- a/config.toml
+++ /dev/null
@@ -1,55 +0,0 @@
-baseURL = "http://nutch.apache.org/;
-languageCode = "en-us"
-theme = "SimpleIntro"
-#title = "Apache Nutch"
-publishDir = "public"
-
-[params]
-mainbg = "./img/IMG_0295.png"
-pagebg = "../img/background.jpg"
-name = "Apache Nutch"
-mainTitle = "Apache Nutch"
-mainText = "Highly extensible, highly scalable, production-ready Web 
crawler"
-
-[menus]
-[[menu.main]]
-identifier = "about"
-name = "About"
-url = "#about"
-[[menu.main]]
-identifier = "community"
-name = "Community"
-url = "/community"
-[[menu.main]]
-identifier = "development"
-name = "Development"
-url = "/development"
-[[menu.main]]
-identifier = &quo

[nutch] branch master updated: quick IntelliJ IDEA setup docs added (#698)

2021-10-19 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new b9a4856  quick IntelliJ IDEA setup docs added (#698)
b9a4856 is described below

commit b9a4856ac172f64659682d3e2e7437b780516f73
Author: Abu Sufyan 
AuthorDate: Tue Oct 19 21:35:49 2021 +0600

quick IntelliJ IDEA setup docs added (#698)

Co-authored-by: Abu Sufian 
---
 README.md | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index dccdb02..307ead3 100644
--- a/README.md
+++ b/README.md
@@ -48,7 +48,15 @@ ant eclipse
 
 and follow the instructions in [Importing existing 
projects](https://help.eclipse.org/2019-06/topic/org.eclipse.platform.doc.user/tasks/tasks-importproject.htm).
 
-IntelliJ IDEA users can also import Eclipse projects using the ["Eclipser" 
plugin](https://www.tutorialspoint.com/intellij_idea/intellij_idea_migrating_from_eclipse.htm)https://plugins.jetbrains.com/plugin/7153-eclipser),
 see also [Importing Eclipse Projects into IntelliJ 
IDEA](https://www.jetbrains.com/help/idea/migrating-from-eclipse-to-intellij-idea.html#migratingEclipseProject).
+For Intellij IDEA, first install the [IvyIDEA 
Plugin](https://plugins.jetbrains.com/plugin/3612-ivyidea). then run ```ant 
eclipse```. 
+
+Then open the project in IntelliJ. You may see popups like "Ant build scripts 
found", "Frameworks detected - IvyIDEA Framework detected". Just follow the 
simple steps in these dialogs.  
+
+You must [configure the 
nutch-site.xml](https://cwiki.apache.org/confluence/display/NUTCH/RunNutchInEclipse)
 before running. Make sure, you've added ```http.agent.name``` and 
```plugin.folders``` properties. The plugin.folders normally points to 
```/build/plugins```. 
+
+Now create a Java Application Configuration, choose 
org.apache.nutch.crawl.Injector, add two paths as arguments. First one is the 
crawldb directory, second one is the URL directory where, the injector can read 
urls. Now run your configuration. 
+
+If we still see the ```No plugins found on paths of property 
plugin.folders="plugins"```, update the plugin.folders in the 
nutch-default.xml, this is a quick fix, but should not be used.
 
 
 Export Control


[nutch] branch master updated: fireant upgrade dependency elasticsearch-rest-high-level-client in src/plugin/indexer-elastic/ivy.xml from 7.11.1 to 7.13.2 (#688)

2021-09-17 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 004b62d  fireant upgrade dependency 
elasticsearch-rest-high-level-client in src/plugin/indexer-elastic/ivy.xml from 
7.11.1 to 7.13.2 (#688)
004b62d is described below

commit 004b62dedb8fd25fc3ae278b1647d7d2826f509e
Author: Lewis John McGibbney 
AuthorDate: Fri Sep 17 19:42:25 2021 -0700

fireant upgrade dependency elasticsearch-rest-high-level-client in 
src/plugin/indexer-elastic/ivy.xml from 7.11.1 to 7.13.2 (#688)

* fireant upgrade dependency elasticsearch-rest-high-level-client in 
src/plugin/indexer-elastic/ivy.xml from 7.11.1 to 7.13.2

* fireant upgrade dependency elasticsearch-rest-high-level-client in 
src/plugin/indexer-elastic/ivy.xml from 7.11.1 to 7.13.2
---
 src/plugin/indexer-elastic/ivy.xml|  18 ++---
 src/plugin/indexer-elastic/plugin.xml | 136 --
 2 files changed, 72 insertions(+), 82 deletions(-)

diff --git a/src/plugin/indexer-elastic/ivy.xml 
b/src/plugin/indexer-elastic/ivy.xml
index 3da98e3..9ee8e1c 100644
--- a/src/plugin/indexer-elastic/ivy.xml
+++ b/src/plugin/indexer-elastic/ivy.xml
@@ -1,6 +1,5 @@
-
-
-
-
 
   
-
-https://nutch.apache.org/"/>
+
+https://nutch.apache.org/; />
 
 Apache Nutch
 
   
 
   
-
+
   
 
   
 
-
+
   
 
   
-
+
   
   
   
@@ -44,4 +42,4 @@
 
   
   
-
+
\ No newline at end of file
diff --git a/src/plugin/indexer-elastic/plugin.xml 
b/src/plugin/indexer-elastic/plugin.xml
index 1e41b7e..387a3ac 100644
--- a/src/plugin/indexer-elastic/plugin.xml
+++ b/src/plugin/indexer-elastic/plugin.xml
@@ -1,84 +1,76 @@
-
-
-
+
+
+
   
 
   
 
-
 
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
 
-
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
   
-
   
 
   
-
-  
-
+  
+
   
-
-
+
\ No newline at end of file


[nutch-site] branch main updated: Attempt to implement single page templating.

2021-08-27 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/nutch-site.git


The following commit(s) were added to refs/heads/main by this push:
 new 69525d8  Attempt to implement single page templating.
69525d8 is described below

commit 69525d8ce6a4a7ff4fa92a1d30cc680090be9811
Author: Lewis John McGibbney 
AuthorDate: Fri Aug 27 15:41:56 2021 -0700

Attempt to implement single page templating.
---
 config.toml  | 33 
 content/community/mailing_lists/index.md |  0
 content/community/people/index.md|  0
 content/community/robots/index.md|  0
 content/development/scm/index.md |  0
 content/index.md |  0
 content/javadoc/inddex.md|  0
 content/posts/_index.md  |  0
 8 files changed, 25 insertions(+), 8 deletions(-)

diff --git a/config.toml b/config.toml
index 454ef27..80101b7 100644
--- a/config.toml
+++ b/config.toml
@@ -6,6 +6,7 @@ publishDir = "public"
 
 [params]
 mainbg = "./img/IMG_0295.png"
+pagebg = "../img/background.jpg"
 name = "Apache Nutch"
 mainTitle = "Apache Nutch"
 mainText = "Highly extensible, highly scalable, production-ready Web 
crawler"
@@ -18,21 +19,37 @@ publishDir = "public"
 [[menu.main]]
 identifier = "community"
 name = "Community"
-[[menu.main]]
-identifier = "reporting"
-name = "Board Reporting"
-parent = "community"
-url = "https://whimsy.apache.org/board/minutes/Nutch.html;
-weight = 1
+url = "/community"
 [[menu.main]]
 identifier = "development"
 name = "Development"
-url = "#development"
+url = "/development"
 [[menu.main]]
 identifier = "documentation"
 name = "Documentation"
-url = "#documentation"
+url = "/documentation"
 [[menu.main]]
 identifier = "downloads"
 name = "Downloads"
+url = "/downloads"
+[[menu.single]]
+identifier = "home"
+name = "Home"
+url = "/"
+weight = 20
+[[menu.single]]
+identifier = "community"
+name = "Community"
+url = "/community"
+[[menu.single]]
+identifier = "development"
+name = "Development"
+url = "/development"
+[[menu.single]]
+identifier = "documentation"
+name = "Documentation"
+url = "/documentation"
+[[menu.single]]
+identifier = "downloads"
+name = "Downloads"
 url = "/downloads"
\ No newline at end of file
diff --git a/content/community/mailing_lists/index.md 
b/content/community/mailing_lists/index.md
deleted file mode 100644
index e69de29..000
diff --git a/content/community/people/index.md 
b/content/community/people/index.md
deleted file mode 100644
index e69de29..000
diff --git a/content/community/robots/index.md 
b/content/community/robots/index.md
deleted file mode 100644
index e69de29..000
diff --git a/content/development/scm/index.md b/content/development/scm/index.md
deleted file mode 100644
index e69de29..000
diff --git a/content/index.md b/content/index.md
deleted file mode 100644
index e69de29..000
diff --git a/content/javadoc/inddex.md b/content/javadoc/inddex.md
deleted file mode 100644
index e69de29..000
diff --git a/content/posts/_index.md b/content/posts/_index.md
deleted file mode 100644
index e69de29..000


[nutch-site] branch main created (now ae6f9f2)

2021-08-26 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a change to branch main
in repository https://gitbox.apache.org/repos/asf/nutch-site.git.


  at ae6f9f2  NUTCH-2826 Migrate Nutch Site from Apache CMS to Hugo

This branch includes the following new commits:

 new ae6f9f2  NUTCH-2826 Migrate Nutch Site from Apache CMS to Hugo

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.



[nutch-site] 01/01: NUTCH-2826 Migrate Nutch Site from Apache CMS to Hugo

2021-08-26 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/nutch-site.git

commit ae6f9f2cf51e7e6e84500cbcd8751ce99daa0ce3
Author: Lewis John McGibbney 
AuthorDate: Thu Aug 26 22:40:50 2021 -0700

NUTCH-2826 Migrate Nutch Site from Apache CMS to Hugo
---
 .gitmodules  |   3 +
 README.md|  64 ++
 archetypes/default.md|   6 +
 config.toml  |  38 ++
 content/community/mailing_lists/index.md |   0
 content/community/people/index.md|   0
 content/community/robots/index.md|   0
 content/development/scm/index.md |   0
 content/index.md |   0
 content/javadoc/inddex.md|   0
 content/posts/_index.md  |   0
 public/categories/index.xml  |  10 ++
 public/css/index.css |  86 ++
 public/css/navbar.css|  53 +
 public/img/IMG_0292.png  | Bin 0 -> 15728243 bytes
 public/img/IMG_0295.png  | Bin 0 -> 16275273 bytes
 public/img/server_rack.jpg   | Bin 0 -> 214612 bytes
 public/img/wave.png  | Bin 0 -> 4789 bytes
 public/index.html| 198 +++
 public/index.xml |  20 
 public/posts/index.xml   |  20 
 public/sitemap.xml   |  28 +
 public/tags/index.xml|  10 ++
 themes/SimpleIntro   |   1 +
 24 files changed, 537 insertions(+)

diff --git a/.gitmodules b/.gitmodules
new file mode 100644
index 000..6378a22
--- /dev/null
+++ b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "themes/SimpleIntro"]
+   path = themes/SimpleIntro
+   url = https://github.com/gangjun06/SimpleIntro
diff --git a/README.md b/README.md
new file mode 100644
index 000..b5e146f
--- /dev/null
+++ b/README.md
@@ -0,0 +1,64 @@
+Nutch Website
+=
+
+https://nutch.apache.org/assets/img/nutch_logo_tm.png; align="right" 
width="300" />
+
+This repository contains the website source code for the [Apache 
Nutch](https://nutch.apache.org) project.
+
+# Tooling
+
+The Website is built using [Hugo](https://gohugo.io/) a popular open-source 
static website generation framework.
+
+# Prerequisites
+* [Install Hugo](https://gohugo.io/getting-started/installing/)
+
+# Local Build and Deploy
+
+```bash
+$ hugo server
+...
+Start building sites …
+
+   | EN
+---+-
+  Pages| 10
+  Paginator pages  |  0
+  Non-page files   |  0
+  Static files | 10
+  Processed images |  0
+  Aliases  |  0
+  Sitemaps |  1
+  Cleaned  |  0
+
+Built in 107 ms
+Watching for changes in 
/path/to/nutch_site/{archetypes,content,data,layouts,static,themes}
+Watching for config changes in /path/to/nutch_site/config.toml
+Environment: "development"
+Serving pages from memory
+Running in Fast Render Mode. For full rebuilds on change: hugo server 
--disableFastRender
+Web Server is available at http://localhost:1313/ (bind address 127.0.0.1)
+Press Ctrl+C to stop
+```
+
+# Contributing
+
+To contribute a patch, follow these instructions (note that installing
+[Hub](https://hub.github.com/) is not strictly required, but is recommended).
+
+```
+0. Download and install hub.github.com
+1. File JIRA issue for your fix at 
https://issues.apache.org/jira/projects/NUTCH/issues
+- you will get issue id NUTCH-xxx where xxx is the issue ID.
+2. git clone https://github.com/apache/nutch-site.git
+3. cd nutch-site
+4. git checkout -b NUTCH-xxx
+5. edit files
+6. git status (make sure it shows what files you expected to edit)
+7. git add 
+8. git commit -m “fix for NUTCH-xxx contributed by ”
+9. git fork
+10. git push -u  NUTCH-xxx
+11. git pull-request
+```
+
+# License
diff --git a/archetypes/default.md b/archetypes/default.md
new file mode 100644
index 000..00e77bd
--- /dev/null
+++ b/archetypes/default.md
@@ -0,0 +1,6 @@
+---
+title: "{{ replace .Name "-" " " | title }}"
+date: {{ .Date }}
+draft: true
+---
+
diff --git a/config.toml b/config.toml
new file mode 100644
index 000..454ef27
--- /dev/null
+++ b/config.toml
@@ -0,0 +1,38 @@
+baseURL = "http://nutch.apache.org/;
+languageCode = "en-us"
+theme = "SimpleIntro"
+#title = "Apache Nutch"
+publishDir = "public"
+
+[params]
+mainbg = "./img/IMG_0295.png"
+name = "Apache Nutch"
+mainTitle = "Apache Nutch"
+mainText = "Highly extensible, highly scalable, production-ready Web 
crawler"
+
+[menus]
+[[menu.main]]
+identifier = "about"
+name = "About"

[nutch] branch master updated: NUTCH-2885 Upgrade to Log4j2 (#692)

2021-08-04 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new e4b7be9  NUTCH-2885 Upgrade to Log4j2 (#692)
e4b7be9 is described below

commit e4b7be9bc30935211c3e7e302788e488b811
Author: Lewis John McGibbney 
AuthorDate: Wed Aug 4 10:00:56 2021 -0700

NUTCH-2885 Upgrade to Log4j2 (#692)

* NUTCH-2885 Upgrade to Log4j2
---
 conf/log4j.properties | 123 --
 conf/log4j2.xml   |  51 +
 ivy/ivy.xml   |  13 ++
 3 files changed, 56 insertions(+), 131 deletions(-)

diff --git a/conf/log4j.properties b/conf/log4j.properties
deleted file mode 100644
index 7b010cb..000
--- a/conf/log4j.properties
+++ /dev/null
@@ -1,123 +0,0 @@
-# Licensed to the Apache Software Foundation (ASF) under one or more
-# contributor license agreements. See the NOTICE file distributed with
-# this work for additional information regarding copyright ownership.
-# The ASF licenses this file to You under the Apache License, Version 2.0
-# (the "License"); you may not use this file except in compliance with
-# the License. You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# Define some default values that can be overridden by system properties
-hadoop.log.dir=.
-hadoop.log.file=hadoop.log
-
-# RootLogger - DailyRollingFileAppender
-log4j.rootLogger=INFO,DRFA
-
-# Logging Threshold
-log4j.threshold=ALL
-
-#special logging requirements for some commandline tools
-log4j.logger.org.apache.nutch.crawl.CrawlDb=INFO,cmdstdout
-log4j.logger.org.apache.nutch.crawl.CrawlDbMerger=INFO,cmdstdout
-log4j.logger.org.apache.nutch.crawl.CrawlDbReader=INFO,cmdstdout
-log4j.logger.org.apache.nutch.crawl.Crawl=INFO,cmdstdout
-log4j.logger.org.apache.nutch.crawl.DeduplicationJob=INFO,cmdstdout
-log4j.logger.org.apache.nutch.crawl.Generator=INFO,cmdstdout
-log4j.logger.org.apache.nutch.crawl.Injector=INFO,cmdstdout
-log4j.logger.org.apache.nutch.crawl.LinkDb=INFO,cmdstdout
-log4j.logger.org.apache.nutch.crawl.LinkDbMerger=INFO,cmdstdout
-log4j.logger.org.apache.nutch.crawl.LinkDbReader=INFO,cmdstdout
-log4j.logger.org.apache.nutch.fetcher.Fetcher=INFO,cmdstdout
-log4j.logger.org.apache.nutch.fetcher.FetcherItem=INFO,cmdstdout
-log4j.logger.org.apache.nutch.fetcher.FetcherItemQueue=INFO,cmdstdout
-log4j.logger.org.apache.nutch.fetcher.FetcherItemQueues=INFO,cmdstdout
-log4j.logger.org.apache.nutch.fetcher.FetcherThread=INFO,cmdstdout
-log4j.logger.org.apache.nutch.fetcher.QueueFeeder=INFO,cmdstdout
-log4j.logger.org.apache.nutch.hostdb.UpdateHostDb=INFO,cmdstdout
-log4j.logger.org.apache.nutch.hostdb.ReadHostDb=INFO,cmdstdout
-log4j.logger.org.apache.nutch.indexer.IndexingFiltersChecker=INFO,cmdstdout
-log4j.logger.org.apache.nutch.indexer.IndexingJob=INFO,cmdstdout
-log4j.logger.org.apache.nutch.indexer.IndexerOutputFormat=INFO,cmdstdout
-log4j.logger.org.apache.nutch.indexwriter.solr.SolrIndexWriter=INFO,cmdstdout
-log4j.logger.org.apache.nutch.indexwriter.solr.SolrUtils=INFO,cmdstdout
-log4j.logger.org.apache.nutch.exchange.Exchanges=INFO,cmdstdout
-log4j.logger.org.apache.nutch.parse.ParserChecker=INFO,cmdstdout
-log4j.logger.org.apache.nutch.parse.ParseSegment=INFO,cmdstdout
-log4j.logger.org.apache.nutch.plugin.PluginRepository=WARN
-log4j.logger.org.apache.nutch.protocol.RobotRulesParser=INFO,cmdstdout
-log4j.logger.org.apache.nutch.scoring.webgraph.LinkRank=INFO,cmdstdout
-log4j.logger.org.apache.nutch.scoring.webgraph.Loops=INFO,cmdstdout
-log4j.logger.org.apache.nutch.scoring.webgraph.ScoreUpdater=INFO,cmdstdout
-log4j.logger.org.apache.nutch.scoring.webgraph.WebGraph=INFO,cmdstdout
-log4j.logger.org.apache.nutch.scoring.webgraph.NodeDumper=INFO,cmdstdout
-log4j.logger.org.apache.nutch.segment.SegmentChecker=INFO,cmdstdout
-log4j.logger.org.apache.nutch.segment.SegmentMerger=INFO,cmdstdout
-log4j.logger.org.apache.nutch.segment.SegmentReader=INFO,cmdstdout
-log4j.logger.org.apache.nutch.service.NutchServer=INFO,cmdstdout
-log4j.logger.org.apache.nutch.tools.FreeGenerator=INFO,cmdstdout
-log4j.logger.org.apache.nutch.util.domain.DomainStatistics=INFO,cmdstdout
-log4j.logger.org.apache.nutch.util.SitemapProcessor=INFO,cmdstdout
-log4j.logger.org.apache.nutch.webui.NutchUiServer=INFO,cmdstdout
-
-log4j.logger.org.apache.nutch=INFO
-log4j.logger.org.apache.hadoop=WARN
-# log mapreduce job messages and counters
-log4j.logger.org.apache.hadoop.mapreduce.Job=INFO
-
-#
-# Daily R

[nutch-webapp] branch master updated: Add missing files

2021-07-13 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch-webapp.git


The following commit(s) were added to refs/heads/master by this push:
 new 93e7b23  Add missing files
93e7b23 is described below

commit 93e7b23812cbfabdd4fca87fd01b3f82c64a4057
Author: Lewis John McGibbney 
AuthorDate: Tue Jul 13 20:35:53 2021 -0700

Add missing files
---
 .asf.yaml  |  16 ++
 .github/pull_request_template.md   |  13 ++
 .github/workflows/master-build.yml |  41 +
 KEYS   | 364 +
 NOTICE.txt |  13 ++
 5 files changed, 447 insertions(+)

diff --git a/.asf.yaml b/.asf.yaml
new file mode 100644
index 000..aa9a939
--- /dev/null
+++ b/.asf.yaml
@@ -0,0 +1,16 @@
+github:
+  description: "Apache Nutch is an extensible and scalable web crawler"
+  homepage: https://nutch.apache.org/
+  labels:
+- web-crawler
+- crawling
+- java
+- nutch
+- hadoop
+- apache
+
+notifications:
+  commits:  commits@nutch.apache.org
+  issues:   d...@nutch.apache.org
+  pullrequests: d...@nutch.apache.org
+  jira_options: link label comment
diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md
new file mode 100644
index 000..d1f9c54
--- /dev/null
+++ b/.github/pull_request_template.md
@@ -0,0 +1,13 @@
+Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! 
Your help is appreciated!
+
+Before opening the pull request, please verify that
+* there is an open issue on the [Nutch issue 
tracker](https://issues.apache.org/jira/projects/NUTCH) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
+* the issue ID (`NUTCH-`)
+  - is referenced in the title of the pull request
+  - and placed in front of your commit messages surrounded by square brackets 
(`[NUTCH-] Issue or pull request title`)
+* commits are squashed into a single one (or few commits for larger changes)
+* Java source code follows [Nutch Eclipse Code Formatting 
rules](https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml)
+* Nutch is successfully built and unit tests pass by running `mvn clean 
install javadoc:aggregate`
+* there should be no conflicts when merging the pull request branch into the 
*recent* master branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled master branch.
+
+We will be able to faster integrate your pull request if these conditions are 
met. If you have any questions how to fix your problem or about using Nutch in 
general, please sign up for the [Nutch mailing 
list](https://nutch.apache.org/mailing_lists.html). Thanks!
diff --git a/.github/workflows/master-build.yml 
b/.github/workflows/master-build.yml
new file mode 100644
index 000..c1a409c
--- /dev/null
+++ b/.github/workflows/master-build.yml
@@ -0,0 +1,41 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+name: master pr build
+
+on:
+  push:
+branches: [ master ]
+  pull_request:
+branches: [ master ]
+
+
+jobs:
+  build:
+runs-on: ubuntu-latest
+strategy:
+  matrix:
+java: [ '11' ]
+
+steps:
+  - uses: actions/checkout@v2
+  - name: Set up JDK ${{ matrix.java }}
+uses: actions/setup-java@v1
+with:
+  java-version: ${{ matrix.java }}
+  - name: Build with Maven
+run: mvn clean install javadoc:aggregate
diff --git a/KEYS b/KEYS
new file mode 100644
index 000..a1331f9
--- /dev/null
+++ b/KEYS
@@ -0,0 +1,364 @@
+This file contains the PGP keys of various developers.
+Please don't use them for email unless you have to. Their main
+purpose is code signing.
+
+Examples of importing this file in your keystore:
+ gpg --import KEYS.txt
+ (need pgp and other examples here)
+
+Examples of adding your key to this file:
+ pgp -kxa  and append it to this file.
+ (pgpk -ll  && pgpk -xa ) >> this file.
+ (gpg --list-sigs 
+ && gpg --armor --expor

[nutch-webapp] branch master created (now da3c282)

2021-07-12 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch-webapp.git.


  at da3c282  Move Nutch WebApp to separate repository

This branch includes the following new commits:

 new da3c282  Move Nutch WebApp to separate repository

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.



[nutch] branch master updated: fireant upgrade dependency httpcore in ivy/ivy.xml from 4.4.9 to 4.4.14 (#681)

2021-07-01 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 53ed506  fireant upgrade dependency httpcore in ivy/ivy.xml from 4.4.9 
to 4.4.14 (#681)
53ed506 is described below

commit 53ed50626b371d163033015b4f8c87167393c33d
Author: Lewis John McGibbney 
AuthorDate: Wed Jun 30 23:07:39 2021 -0700

fireant upgrade dependency httpcore in ivy/ivy.xml from 4.4.9 to 4.4.14 
(#681)
---
 ivy/ivy.xml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ivy/ivy.xml b/ivy/ivy.xml
index e05c81c..2781c6c 100644
--- a/ivy/ivy.xml
+++ b/ivy/ivy.xml
@@ -143,7 +143,7 @@



-   
+   

 



[nutch] branch master updated: NUTCH-2882 Configure NutchUiServer for DEPLOYMENT and improve logging (#690)

2021-06-28 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new d6875e1  NUTCH-2882 Configure NutchUiServer for DEPLOYMENT and improve 
logging (#690)
d6875e1 is described below

commit d6875e13515328b759e204a6b4bba8725f2ea7c2
Author: Lewis John McGibbney 
AuthorDate: Mon Jun 28 09:12:27 2021 -0700

NUTCH-2882 Configure NutchUiServer for DEPLOYMENT and improve logging (#690)
---
 conf/log4j.properties   | 2 ++
 src/java/org/apache/nutch/webui/NutchUiApplication.java | 6 ++
 2 files changed, 8 insertions(+)

diff --git a/conf/log4j.properties b/conf/log4j.properties
index 67311d1..7b010cb 100644
--- a/conf/log4j.properties
+++ b/conf/log4j.properties
@@ -60,9 +60,11 @@ 
log4j.logger.org.apache.nutch.scoring.webgraph.NodeDumper=INFO,cmdstdout
 log4j.logger.org.apache.nutch.segment.SegmentChecker=INFO,cmdstdout
 log4j.logger.org.apache.nutch.segment.SegmentMerger=INFO,cmdstdout
 log4j.logger.org.apache.nutch.segment.SegmentReader=INFO,cmdstdout
+log4j.logger.org.apache.nutch.service.NutchServer=INFO,cmdstdout
 log4j.logger.org.apache.nutch.tools.FreeGenerator=INFO,cmdstdout
 log4j.logger.org.apache.nutch.util.domain.DomainStatistics=INFO,cmdstdout
 log4j.logger.org.apache.nutch.util.SitemapProcessor=INFO,cmdstdout
+log4j.logger.org.apache.nutch.webui.NutchUiServer=INFO,cmdstdout
 
 log4j.logger.org.apache.nutch=INFO
 log4j.logger.org.apache.hadoop=WARN
diff --git a/src/java/org/apache/nutch/webui/NutchUiApplication.java 
b/src/java/org/apache/nutch/webui/NutchUiApplication.java
index 67ac281..fc08874 100644
--- a/src/java/org/apache/nutch/webui/NutchUiApplication.java
+++ b/src/java/org/apache/nutch/webui/NutchUiApplication.java
@@ -18,6 +18,7 @@ package org.apache.nutch.webui;
 
 import org.apache.nutch.webui.pages.DashboardPage;
 import org.apache.nutch.webui.pages.assets.NutchUiCssReference;
+import org.apache.wicket.RuntimeConfigurationType;
 import org.apache.wicket.markup.html.WebPage;
 import org.apache.wicket.protocol.http.WebApplication;
 import org.apache.wicket.spring.injection.annot.SpringComponentInjector;
@@ -61,6 +62,11 @@ public class NutchUiApplication extends WebApplication 
implements
 new SpringComponentInjector(this, context));
   }
 
+  @Override
+  public RuntimeConfigurationType getConfigurationType() {
+return RuntimeConfigurationType.DEPLOYMENT;
+  }
+
   private void configureTheme(BootstrapSettings settings) {
 Theme theme = new Theme(THEME_NAME, BootstrapCssReference.instance(),
 FontAwesomeCssReference.instance(), NutchUiCssReference.instance());


[nutch] branch master updated: NUTCH-2881 bug in 'nutch' symlink in docker container (#689)

2021-06-26 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 08de742  NUTCH-2881 bug in 'nutch' symlink in docker container (#689)
08de742 is described below

commit 08de74266b2e502d6915831a6e19fea21b099e28
Author: Lewis John McGibbney 
AuthorDate: Sat Jun 26 19:04:59 2021 -0700

NUTCH-2881 bug in 'nutch' symlink in docker container (#689)

* NUTCH-2881 bug in 'nutch' symlink in docker container
---
 docker/Dockerfile | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/docker/Dockerfile b/docker/Dockerfile
index 0f06894..29ead46 100644
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@@ -23,6 +23,7 @@ RUN apk update
 RUN apk --no-cache add apache-ant bash git openjdk11
 
 RUN echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk' >> $HOME/.bashrc
+env NUTCH_HOME='/root/nutch_source/runtime/local'
 
 # Checkout and build the Nutch master branch (1.x)
 RUN git clone https://github.com/apache/nutch.git nutch_source && \
@@ -31,5 +32,6 @@ RUN git clone https://github.com/apache/nutch.git 
nutch_source && \
  rm -rf build/ && \
  rm -rf /root/.ivy2/
 
-# Convenience symlink to Nutch runtime local
-RUN ln -s nutch_source/runtime/local $HOME/nutch
+# Create symlinks for runtime/local/bin/nutch and runtime/local/bin/crawl
+RUN ln -sf $NUTCH_HOME/bin/nutch /usr/local/bin/
+RUN ln -sf $NUTCH_HOME/bin/crawl /usr/local/bin/
\ No newline at end of file


[nutch] branch master updated: fireant upgrade dependency junit in ivy/ivy.xml from 4.13.1 to 4.13.2 (#666)

2021-06-13 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 9c8ae8e  fireant upgrade dependency junit in ivy/ivy.xml from 4.13.1 
to 4.13.2 (#666)
9c8ae8e is described below

commit 9c8ae8e9ad9c0e4b9a9b8cbe53b5021c5485762b
Author: Lewis John McGibbney 
AuthorDate: Sun Jun 13 19:57:30 2021 -0700

fireant upgrade dependency junit in ivy/ivy.xml from 4.13.1 to 4.13.2 (#666)
---
 ivy/ivy.xml | 94 -
 1 file changed, 49 insertions(+), 45 deletions(-)

diff --git a/ivy/ivy.xml b/ivy/ivy.xml
index 00d67eb..e05c81c 100644
--- a/ivy/ivy.xml
+++ b/ivy/ivy.xml
@@ -1,20 +1,24 @@
-
-
-
-
-http://ant.apache.org/ivy/maven;>
+
+
+
+
+http://ant.apache.org/ivy/maven; version="1.0">

-   https://www.apache.org/licenses/LICENSE-2.0.txt; />
+   https://www.apache.org/licenses/LICENSE-2.0.txt; />
https://nutch.apache.org/; />
https://nutch.apache.org/;>Nutch is an 
open source web-search
software. It builds on Hadoop, Tika and Solr, adding 
web-specifics,
@@ -46,7 +50,7 @@



-   
+   

 

@@ -58,14 +62,14 @@



-   
-   
-   
+   
+   
+   

 

 
-   
+   

 

@@ -78,36 +82,36 @@


 
-   
-   
-   
-   
-   
-   
-   
-   
-   
+   
+   
+   
+   
+   
+   
+   
+   
+   
 


-   
-   
-   
+   
+   
+   

-   
+   

-   
-   
+   
+   

 

-   
+   

-   
+   



@@ -130,7 +134,7 @@



-   
+   

 

@@ -138,18 +142,18 @@
 


-   
-   
-   
+   
+   
+   
 
-   
+   
 




-   
+   
 

 
-
+
\ No newline at end of file


[nutch] branch master updated: NUTCH-2864 Upgrade Dockerfile to use JDK 11 (#647)

2021-06-03 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new cc8d76a  NUTCH-2864 Upgrade Dockerfile to use JDK 11 (#647)
cc8d76a is described below

commit cc8d76afe4f86691008b5673b182bb0e54a59710
Author: Lewis John McGibbney 
AuthorDate: Thu Jun 3 13:15:03 2021 -0700

NUTCH-2864 Upgrade Dockerfile to use JDK 11 (#647)

* NUTCH-2864 Upgrade Dockerfile to use JDK 11
---
 docker/Dockerfile | 16 +---
 docker/README.md  |  9 -
 2 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/docker/Dockerfile b/docker/Dockerfile
index 3077d1a..0f06894 100644
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@@ -13,21 +13,23 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-FROM ubuntu:18.04
+FROM alpine:3.13
 MAINTAINER Apache Nutch Committers 
 
 WORKDIR /root/
 
-
 # Install dependencies
-RUN apt update
-RUN apt install -y ant git openjdk-8-jdk-headless
+RUN apk update
+RUN apk --no-cache add apache-ant bash git openjdk11
 
-# Set up JAVA_HOME
-RUN echo 'export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64' >> $HOME/.bashrc
+RUN echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk' >> $HOME/.bashrc
 
 # Checkout and build the Nutch master branch (1.x)
-RUN git clone https://github.com/apache/nutch.git nutch_source && cd 
nutch_source && ant runtime
+RUN git clone https://github.com/apache/nutch.git nutch_source && \
+ cd nutch_source && \
+ ant runtime && \
+ rm -rf build/ && \
+ rm -rf /root/.ivy2/
 
 # Convenience symlink to Nutch runtime local
 RUN ln -s nutch_source/runtime/local $HOME/nutch
diff --git a/docker/README.md b/docker/README.md
index 58a3b5e..2ac88cc 100644
--- a/docker/README.md
+++ b/docker/README.md
@@ -1,5 +1,12 @@
 # Nutch Dockerfile #
 
+![Docker 
Pulls](https://img.shields.io/docker/pulls/apache/nutch?style=for-the-badge)
+![Docker Image Size (latest by 
date)](https://img.shields.io/docker/image-size/apache/nutch?style=for-the-badge)
+![Docker Image Version (latest 
semver)](https://img.shields.io/docker/v/apache/nutch?style=for-the-badge)
+![MicroBadger 
Layers](https://img.shields.io/microbadger/layers/apache/nutch?style=for-the-badge)
+![Docker 
Stars](https://img.shields.io/docker/stars/apache/nutch?style=for-the-badge)
+![Docker Automated 
build](https://img.shields.io/docker/automated/apache/nutch?style=for-the-badge)
+
 Get up and running quickly with Nutch on Docker.
 
 ## What is Nutch?
@@ -18,7 +25,7 @@ Current configuration of this image consists of components:
 
 ##  Base Image
 
-* [ubuntu:18.04](https://hub.docker.com/_/ubuntu/)
+* [alpine:3.13](https://hub.docker.com/_/alpine/)
 
 ## Tips
 


[nutch] branch master updated: NUTCH-2855 Update org.elasticsearch.client (#577)

2021-04-01 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 2837039  NUTCH-2855 Update org.elasticsearch.client (#577)
2837039 is described below

commit 2837039b9c5b52a88c2029a5e29c81cecd8953f3
Author: Lewis John McGibbney 
AuthorDate: Thu Apr 1 08:56:43 2021 -0700

NUTCH-2855 Update org.elasticsearch.client (#577)

* NUTCH-2855 Update org.elasticsearch.client
---
 src/plugin/indexer-elastic/ivy.xml|  2 +-
 src/plugin/indexer-elastic/plugin.xml | 79 ++-
 2 files changed, 41 insertions(+), 40 deletions(-)

diff --git a/src/plugin/indexer-elastic/ivy.xml 
b/src/plugin/indexer-elastic/ivy.xml
index 4b8d4a7..3da98e3 100644
--- a/src/plugin/indexer-elastic/ivy.xml
+++ b/src/plugin/indexer-elastic/ivy.xml
@@ -36,7 +36,7 @@
   
 
   
-
+
   
   
   
diff --git a/src/plugin/indexer-elastic/plugin.xml 
b/src/plugin/indexer-elastic/plugin.xml
index 45ac61e..1e41b7e 100644
--- a/src/plugin/indexer-elastic/plugin.xml
+++ b/src/plugin/indexer-elastic/plugin.xml
@@ -22,49 +22,50 @@
 
 
 
-
-
-
-
-
-
-
-
-
-
+
+
+
+
+
+
+
+
+
+
+
 
 
 
-
-
-
-
-
-
-
-
-
+
+
+
+
+
+
+
+
+
 
-
+
 
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
 
 
   
@@ -80,4 +81,4 @@
   class="org.apache.nutch.indexwriter.elastic.ElasticIndexWriter" />
   
 
-
\ No newline at end of file
+


[nutch] branch master updated: NUTCH-2857 Upgrade from JDK1.8 --> JDK11 (#573)

2021-03-21 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new b91fae5  NUTCH-2857 Upgrade from JDK1.8 --> JDK11 (#573)
b91fae5 is described below

commit b91fae53e7de1d4c240ba91c951024441f2ea01f
Author: Lewis John McGibbney 
AuthorDate: Sun Mar 21 08:30:41 2021 -0700

NUTCH-2857 Upgrade from JDK1.8 --> JDK11 (#573)

* NUTCH-2857 Upgrade from JDK1.8 --> JDK11
---
 .github/workflows/master-build.yml |  2 +-
 default.properties |  4 ++--
 ivy/mvn.template   |  4 ++--
 .../org/apache/nutch/indexer/IndexWriterParams.java|  6 +++---
 src/java/org/apache/nutch/metadata/MetaWrapper.java|  2 +-
 src/java/org/apache/nutch/net/URLNormalizers.java  |  4 ++--
 src/java/org/apache/nutch/parse/ParserChecker.java | 18 +-
 .../org/apache/nutch/segment/SegmentMergeFilter.java   |  2 +-
 .../org/apache/nutch/segment/SegmentMergeFilters.java  |  4 ++--
 .../net/urlnormalizer/regex/RegexURLNormalizer.java|  2 +-
 10 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/.github/workflows/master-build.yml 
b/.github/workflows/master-build.yml
index 7e74840..e3ed11c 100644
--- a/.github/workflows/master-build.yml
+++ b/.github/workflows/master-build.yml
@@ -29,7 +29,7 @@ jobs:
 runs-on: ubuntu-latest
 strategy:
   matrix:
-java: [ '1.8' ]
+java: [ '11' ]
 
 steps:
   - uses: actions/checkout@v2
diff --git a/default.properties b/default.properties
index f250904..cf82c84 100644
--- a/default.properties
+++ b/default.properties
@@ -43,7 +43,7 @@ test.junit.output.format = plain
 # Proxy Host and Port to use for building JavaDoc
 javadoc.proxy.host=-J-DproxyHost=
 javadoc.proxy.port=-J-DproxyPort=
-javadoc.link.java=https://docs.oracle.com/javase/8/docs/api/
+javadoc.link.java=https://docs.oracle.com/en/java/javase/11/docs/api/
 javadoc.link.hadoop=https://hadoop.apache.org/docs/r3.1.3/api/
 #javadoc.link.lucene.core=https://lucene.apache.org/core/8_4_1/core/
 
#javadoc.link.lucene.analyzers-common=https://lucene.apache.org/core/8_4_1/analyzers-common/
@@ -57,7 +57,7 @@ bin.dist.version.dir=${dist.dir}/${final.name}-bin
 javac.debug=on
 javac.optimize=on
 javac.deprecation=on
-javac.version=1.8
+javac.version=11
 
 runtime.dir=./runtime
 runtime.deploy=${runtime.dir}/deploy
diff --git a/ivy/mvn.template b/ivy/mvn.template
index edfb550..b38b37f 100644
--- a/ivy/mvn.template
+++ b/ivy/mvn.template
@@ -130,8 +130,8 @@
   maven-compiler-plugin
   3.8.1
   
-1.8
-1.8
+11
+11
   
 
   
diff --git a/src/java/org/apache/nutch/indexer/IndexWriterParams.java 
b/src/java/org/apache/nutch/indexer/IndexWriterParams.java
index e7b3152..52cc4f9 100644
--- a/src/java/org/apache/nutch/indexer/IndexWriterParams.java
+++ b/src/java/org/apache/nutch/indexer/IndexWriterParams.java
@@ -24,10 +24,10 @@ import java.util.Map;
 public class IndexWriterParams extends HashMap {
 
   /**
-   * Constructs a new HashMap with the same mappings as the
-   * specified Map.  The HashMap is created with
+   * Constructs a new HashMap with the same mappings as the
+   * specified Map.  The HashMap is created with
* default load factor (0.75) and an initial capacity sufficient to
-   * hold the mappings in the specified Map.
+   * hold the mappings in the specified Map.
*
* @param m the map whose mappings are to be placed in this map
* @throws NullPointerException if the specified map is null
diff --git a/src/java/org/apache/nutch/metadata/MetaWrapper.java 
b/src/java/org/apache/nutch/metadata/MetaWrapper.java
index a58253c..2547734 100644
--- a/src/java/org/apache/nutch/metadata/MetaWrapper.java
+++ b/src/java/org/apache/nutch/metadata/MetaWrapper.java
@@ -26,7 +26,7 @@ import org.apache.nutch.crawl.NutchWritable;
 
 /**
  * This is a simple decorator that adds metadata to any Writable-s that can be
- * serialized by NutchWritable. This is useful when data needs to be
+ * serialized by {@link NutchWritable}. This is useful when data needs to be
  * temporarily enriched during processing, but this temporary metadata doesn't
  * need to be permanently stored after the job is done.
  * 
diff --git a/src/java/org/apache/nutch/net/URLNormalizers.java 
b/src/java/org/apache/nutch/net/URLNormalizers.java
index 4ec904d..bf947f7 100644
--- a/src/java/org/apache/nutch/net/URLNormalizers.java
+++ b/src/java/org/apache/nutch/net/URLNormalizers.java
@@ -42,7 +42,7 @@ import org.apache.nutch.util.ObjectCache;
  * This class uses a "chained filter" pattern to run defined normalizers.
  * Different lists of normalizers may be defined for different "scopes", or
  * contexts wher

[nutch] branch master updated: NUTCH-2850 Method ignores exceptional return value (#570)

2021-02-18 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 2724578  NUTCH-2850 Method ignores exceptional return value (#570)
2724578 is described below

commit 2724578ab41cb9e8098975bddbde7df2085b1c61
Author: Lewis John McGibbney 
AuthorDate: Thu Feb 18 07:21:43 2021 -0800

NUTCH-2850 Method ignores exceptional return value (#570)
---
 src/java/org/apache/nutch/tools/FileDumper.java | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/src/java/org/apache/nutch/tools/FileDumper.java 
b/src/java/org/apache/nutch/tools/FileDumper.java
index 4e7338e..65c7dca 100644
--- a/src/java/org/apache/nutch/tools/FileDumper.java
+++ b/src/java/org/apache/nutch/tools/FileDumper.java
@@ -234,7 +234,9 @@ public class FileDumper {
 File fullOutputDir = new 
File(org.apache.commons.lang3.StringUtils.join(Arrays.copyOf(splitPath, 
splitPath.length - 1), "/"));
 
 if (!fullOutputDir.exists()) {
-  fullOutputDir.mkdirs();
+  if(!fullOutputDir.mkdirs());
+throw new Exception("Unable to create: ["
+  + fullOutputDir.getAbsolutePath() + "]"); 
 }
   } else {
 outputFullPath = String.format("%s/%s", fullDir, 
DumpFileUtil.createFileName(md5Ofurl, baseName, extension));



[nutch] branch master updated: NUTCH-2851 Random object created and used only once (#571)

2021-02-18 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 5250d62  NUTCH-2851 Random object created and used only once (#571)
5250d62 is described below

commit 5250d62986468b23509a82d2aaa32bdc11cf02a8
Author: Lewis John McGibbney 
AuthorDate: Thu Feb 18 07:20:59 2021 -0800

NUTCH-2851 Random object created and used only once (#571)
---
 src/java/org/apache/nutch/crawl/Generator.java   | 5 +++--
 src/java/org/apache/nutch/indexer/IndexingJob.java   | 4 +++-
 src/java/org/apache/nutch/segment/SegmentReader.java | 5 -
 src/java/org/apache/nutch/tools/DmozParser.java  | 5 -
 4 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/src/java/org/apache/nutch/crawl/Generator.java 
b/src/java/org/apache/nutch/crawl/Generator.java
index dcba9bf..00eb18f 100644
--- a/src/java/org/apache/nutch/crawl/Generator.java
+++ b/src/java/org/apache/nutch/crawl/Generator.java
@@ -35,7 +35,6 @@ import org.apache.hadoop.conf.Configurable;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 import org.apache.commons.jexl3.JexlExpression;
-import org.antlr.v4.parse.ANTLRParser.throwsSpec_return;
 import org.apache.commons.jexl3.JexlContext;
 import org.apache.commons.jexl3.MapContext;
 import org.apache.hadoop.mapreduce.Counter;
@@ -90,6 +89,8 @@ import org.apache.nutch.util.URLUtil;
  **/
 public class Generator extends NutchTool implements Tool {
 
+  private static final Random RANDOM = new Random();
+
   protected static final Logger LOG = LoggerFactory
   .getLogger(MethodHandles.lookup().lookupClass());
 
@@ -1013,7 +1014,7 @@ public class Generator extends NutchTool implements Tool {
 Job job = NutchJob.getInstance(getConf());
 job.setJobName("generate: partition " + segment);
 Configuration conf = job.getConfiguration();
-conf.setInt("partition.url.seed", new Random().nextInt());
+conf.setInt("partition.url.seed", RANDOM.nextInt());
 
 FileInputFormat.addInputPath(job, inputDir);
 job.setInputFormatClass(SequenceFileInputFormat.class);
diff --git a/src/java/org/apache/nutch/indexer/IndexingJob.java 
b/src/java/org/apache/nutch/indexer/IndexingJob.java
index 0966276..0fe29a7 100644
--- a/src/java/org/apache/nutch/indexer/IndexingJob.java
+++ b/src/java/org/apache/nutch/indexer/IndexingJob.java
@@ -54,6 +54,8 @@ import org.slf4j.LoggerFactory;
 
 public class IndexingJob extends NutchTool implements Tool {
 
+  private static final Random RANDOM = new Random();
+
   private static final Logger LOG = LoggerFactory
   .getLogger(MethodHandles.lookup().lookupClass());
 
@@ -136,7 +138,7 @@ public class IndexingJob extends NutchTool implements Tool {
 job.setReduceSpeculativeExecution(false);
 
 final Path tmp = new Path("tmp_" + System.currentTimeMillis() + "-"
-+ new Random().nextInt());
++ RANDOM.nextInt());
 
 FileOutputFormat.setOutputPath(job, tmp);
 try {
diff --git a/src/java/org/apache/nutch/segment/SegmentReader.java 
b/src/java/org/apache/nutch/segment/SegmentReader.java
index 284daed..2f2fefd 100644
--- a/src/java/org/apache/nutch/segment/SegmentReader.java
+++ b/src/java/org/apache/nutch/segment/SegmentReader.java
@@ -35,6 +35,7 @@ import java.util.HashMap;
 import java.util.Iterator;
 import java.util.List;
 import java.util.Map;
+import java.util.Random;
 
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
@@ -74,6 +75,8 @@ import org.apache.nutch.util.SegmentReaderUtil;
 /** Dump the content of a segment. */
 public class SegmentReader extends Configured implements Tool {
 
+  private static final Random RANDOM = new Random();
+
   private static final Logger LOG = LoggerFactory
   .getLogger(MethodHandles.lookup().lookupClass());
 
@@ -220,7 +223,7 @@ public class SegmentReader extends Configured implements 
Tool {
 job.setJarByClass(SegmentReader.class);
 
 Path tempDir = new Path(conf.get("hadoop.tmp.dir", "/tmp") + "/segread-"
-+ new java.util.Random().nextInt());
++ RANDOM.nextInt());
 FileSystem fs = tempDir.getFileSystem(conf);
 fs.delete(tempDir, true);
 
diff --git a/src/java/org/apache/nutch/tools/DmozParser.java 
b/src/java/org/apache/nutch/tools/DmozParser.java
index b68facb..8db4817 100644
--- a/src/java/org/apache/nutch/tools/DmozParser.java
+++ b/src/java/org/apache/nutch/tools/DmozParser.java
@@ -54,6 +54,9 @@ import org.apache.nutch.util.NutchConfiguration;
  * RDF into a flat file of URLs to be injected. 
  */
 public class DmozParser {
+
+  private static final Random RANDOM = new Random();
+
   private static final Logger LOG = LoggerFactory
   .getLogger(MethodHandles.lookup().lookupClass());
 
@@ -134,7 +137,7 @@ public class DmozParser {
   this.includeAdult = incl

[nutch] branch master updated: NUTCH-2849 Replace remaining package.html files with package-info.java (#569)

2021-02-16 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 2fae4cd  NUTCH-2849 Replace remaining package.html files with 
package-info.java (#569)
2fae4cd is described below

commit 2fae4cde67a05cf1fa9ecdd6b6bd5307c0e46fe7
Author: Lewis John McGibbney 
AuthorDate: Tue Feb 16 10:40:00 2021 -0800

NUTCH-2849 Replace remaining package.html files with package-info.java 
(#569)
---
 build.xml  |  7 +++-
 .../org/apache/nutch/crawl/package-info.java}  |  8 ++--
 src/java/org/apache/nutch/crawl/package.html   |  5 ---
 .../org/apache/nutch/fetcher/package-info.java}|  8 ++--
 src/java/org/apache/nutch/fetcher/package.html |  5 ---
 .../org/apache/nutch/indexer/package-info.java}| 16 ---
 src/java/org/apache/nutch/indexer/package.html | 10 -
 .../org/apache/nutch/metadata/package-info.java}   | 11 ++---
 src/java/org/apache/nutch/metadata/package.html|  6 ---
 src/java/org/apache/nutch/plugin/package-info.java | 42 +++
 src/java/org/apache/nutch/plugin/package.html  | 40 --
 .../apache/nutch/util/domain/package-info.java}| 17 +---
 src/java/org/apache/nutch/util/domain/package.html | 14 ---
 .../org/creativecommons/nutch/package-info.java}   |  8 ++--
 .../java/org/creativecommons/nutch/package.html|  5 ---
 .../apache/nutch/indexer/anchor/package-info.java} |  8 ++--
 .../org/apache/nutch/indexer/anchor/package.html   |  5 ---
 .../apache/nutch/indexer/basic/package-info.java}  | 10 ++---
 .../org/apache/nutch/indexer/basic/package.html|  5 ---
 .../apache/nutch/indexer/more/package-info.java}   | 11 ++---
 .../org/apache/nutch/indexer/more/package.html |  6 ---
 .../nutch/indexer/staticfield/package-info.java}   | 12 +++---
 .../apache/nutch/indexer/staticfield/package.html  |  5 ---
 .../apache/nutch/analysis/lang/package-info.java}  | 13 +++---
 .../org/apache/nutch/analysis/lang/package.html|  6 ---
 .../nutch/protocol/http/api/package-info.java} | 11 ++---
 .../apache/nutch/protocol/http/api/package.html|  6 ---
 .../nutch/microformats/reltag/package-info.java}   | 11 ++---
 .../apache/nutch/microformats/reltag/package.html  |  8 
 .../org/apache/nutch/parse/html/package-info.java} | 11 ++---
 .../java/org/apache/nutch/parse/html/package.html  |  5 ---
 .../apache/nutch/protocol/file/package-info.java}  |  8 ++--
 .../org/apache/nutch/protocol/file/package.html|  5 ---
 .../apache/nutch/protocol/ftp/package-info.java}   |  8 ++--
 .../org/apache/nutch/protocol/ftp/package.html |  5 ---
 .../htmlunit/{package.html => package-info.java}   |  8 ++--
 .../apache/nutch/protocol/http/package-info.java}  |  8 ++--
 .../org/apache/nutch/protocol/http/package.html|  5 ---
 .../nutch/protocol/httpclient/package-info.java}   | 15 ---
 .../apache/nutch/protocol/httpclient/package.html  |  9 
 .../interactiveselenium/package-info.java} |  8 ++--
 .../protocol/interactiveselenium/package.html  |  5 ---
 .../nutch/protocol/selenium/package-info.java} |  8 ++--
 .../apache/nutch/protocol/selenium/package.html|  5 ---
 .../nutch/scoring/metadata/package-info.java   | 32 ++
 .../org/apache/nutch/scoring/metadata/package.html | 33 ---
 .../org/apache/nutch/collection/package-info.java  | 49 ++
 .../java/org/apache/nutch/collection/package.html  | 36 
 .../apache/nutch/indexer/tld/package-info.java}|  8 ++--
 .../java/org/apache/nutch/indexer/tld/package.html |  5 ---
 .../apache/nutch/scoring/tld/package-info.java}|  8 ++--
 .../java/org/apache/nutch/scoring/tld/package.html |  5 ---
 .../nutch/urlfilter/automaton/package-info.java}   | 12 +++---
 .../apache/nutch/urlfilter/automaton/package.html  |  9 
 .../nutch/urlfilter/prefix/package-info.java}  |  8 ++--
 .../org/apache/nutch/urlfilter/prefix/package.html |  5 ---
 .../nutch/urlfilter/regex/package-info.java}   | 10 ++---
 .../org/apache/nutch/urlfilter/regex/package.html  |  5 ---
 .../nutch/urlfilter/validator/package-info.java}   | 14 ---
 .../apache/nutch/urlfilter/validator/package.html  |  9 
 .../nutch/indexer/urlmeta/package-info.java}   | 16 ---
 .../org/apache/nutch/indexer/urlmeta/package.html  | 12 --
 .../nutch/scoring/urlmeta/package-info.java}   | 15 ---
 .../org/apache/nutch/scoring/urlmeta/package.html  | 11 -
 64 files changed, 292 insertions(+), 442 deletions(-)

diff --git a/build.xml b/build.xml
index ec003c3..dcb7b94 100644
--- a/build.xml
+++ b/build.xml
@@ -186,6 +186,7 @@
   doctitle="${name} ${version} API"
   bottom="Copyright copy; ${year} The Apache Software Foundation"
   failonerror="true

[nutch] branch master updated: NUTCH-2840 Fix 'report-vulnerabilities' ant target in build.xml (#561)

2021-01-31 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 66bb62a  NUTCH-2840 Fix 'report-vulnerabilities' ant target in 
build.xml (#561)
66bb62a is described below

commit 66bb62a589ac2651771bf61b62786991e65539f8
Author: Lewis John McGibbney 
AuthorDate: Sun Jan 31 16:06:52 2021 -0800

NUTCH-2840 Fix 'report-vulnerabilities' ant target in build.xml (#561)

* NUTCH-2840 Fix 'report-vulnerabilities' ant target in build.xml
---
 .gitignore  |  2 ++
 build.xml   | 46 ++---
 ivy/dependency-check-ant/lib/.gitignore | 19 ++
 3 files changed, 52 insertions(+), 15 deletions(-)

diff --git a/.gitignore b/.gitignore
index 6d96644..0612a99 100644
--- a/.gitignore
+++ b/.gitignore
@@ -25,3 +25,5 @@ naivebayes-model
 *.iml
 *.swp
 csvindexwriter
+lib/spotbugs-*
+ivy/dependency-check-ant/*
diff --git a/build.xml b/build.xml
index 882a54a..02a7cdd 100644
--- a/build.xml
+++ b/build.xml
@@ -37,9 +37,11 @@
   
   
 
-  
+  
+  
+  
 
-  
+  
 
   
   
@@ -646,24 +648,38 @@
   
 
   
-  
-  
-  
-  
-  
-
-
+  
+
+
+  
+
+  
+https://github.com/jeremylong/DependencyCheck/releases/download/v${dependency-check-ant.version}/dependency-check-ant-${dependency-check-ant.version}-release.zip;
+ 
dest="${ivy.dir}/dependency-check-ant-${dependency-check-ant.version}-release.zip"
 usetimestamp="false" />
+
+
+
+
+
+  
+
+  
+
+
   
 
   
-  
-
-  
-  
+
+  
+
+  
+
 
-
+
 
 
   
diff --git a/ivy/dependency-check-ant/lib/.gitignore 
b/ivy/dependency-check-ant/lib/.gitignore
new file mode 100644
index 000..e2dec72
--- /dev/null
+++ b/ivy/dependency-check-ant/lib/.gitignore
@@ -0,0 +1,19 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Ignore everything in this directory
+*
+# Except this file
+!.gitignore



[nutch] branch master updated: NUTCH-2819 Move spotbugs "installation" directory to avoid that spotbugs is shipped in Nutch runtime (#565)

2021-01-31 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new cc0da7e  NUTCH-2819 Move spotbugs "installation" directory to avoid 
that spotbugs is shipped in Nutch runtime (#565)
cc0da7e is described below

commit cc0da7e860723f7b8e89429a8f1f11551ecf118f
Author: Sebastian Nagel 
AuthorDate: Mon Feb 1 01:05:27 2021 +0100

NUTCH-2819 Move spotbugs "installation" directory to avoid that spotbugs is 
shipped in Nutch runtime (#565)

- install spotbugs into to ivy/spotbugs-x.x.x/
- upgrade to Spotbugs 4.2.0
- move task definition into spotbugs target, otherwise
  running download/installation and bug spotting together fails
---
 .gitignore |  1 +
 build.xml  | 19 +--
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/.gitignore b/.gitignore
index 249ca77..6d96644 100644
--- a/.gitignore
+++ b/.gitignore
@@ -11,6 +11,7 @@ ivy/ivy-2.3.0.jar
 ivy/ivy-2.4.0.jar
 ivy/ivy-2.5.0-rc1.jar
 ivy/ivy-2.5.0.jar
+ivy/spotbugs-*/
 naivebayes-model
 .naivebayes-model.crc
 .gitconfig
diff --git a/build.xml b/build.xml
index 68a0f44..882a54a 100644
--- a/build.xml
+++ b/build.xml
@@ -41,8 +41,8 @@
 
   
 
-  
-  
+  
+  
   
 
   
@@ -1066,20 +1066,19 @@
   
 https://github.com/spotbugs/spotbugs/releases/download/${spotbugs.version}/spotbugs-${spotbugs.version}.tgz
 "
- dest="${basedir}/lib/spotbugs-${spotbugs.version}.tgz" 
usetimestamp="false" />
+ dest="${ivy.dir}/spotbugs-${spotbugs.version}.tgz" 
usetimestamp="false" />
 
-
+
 
 
-
+
   
 
-  
-
   
+
 

[nutch] branch master updated: Prepare for Nutch 1.19-SNAPSHOT development

2021-01-25 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new ebf348c  Prepare for Nutch 1.19-SNAPSHOT development
ebf348c is described below

commit ebf348cc6ec88a15ca0243c12fe18c31157ede89
Author: Lewis John McGibbney 
AuthorDate: Mon Jan 25 20:05:00 2021 -0800

Prepare for Nutch 1.19-SNAPSHOT development
---
 CHANGES.txt| 49 +++--
 NOTICE.txt |  2 +-
 build.xml  | 25 -
 conf/nutch-default.xml |  2 +-
 default.properties |  4 ++--
 ivy/mvn.template   | 12 +++-
 src/bin/nutch  |  2 +-
 7 files changed, 79 insertions(+), 17 deletions(-)

diff --git a/CHANGES.txt b/CHANGES.txt
index e5c5984..9946bc9 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -1,10 +1,55 @@
 # Nutch Change Log
 
-Nutch 1.18 Development
+Nutch 1.18 Release 14/01/2021 (dd/mm/)
+Release Report: https://s.apache.org/lqara
 
 Breaking Changes
 
--  As part of NUTCH-2805, the plugin urlfilter-domainblacklist has been 
renamed to urlfilter-domaindenylist. And the fields required for the plugin 
urlfilter.domainblacklist.rules and urlfilter.domainblacklist.file has been 
replaced with urlfilter.domaindenylist.rules and urlfilter.domaindenylist.file 
respectively. See NUTCH-2802 for more details.
+- As part of NUTCH-2805, the plugin urlfilter-domainblacklist has been 
renamed to urlfilter-domaindenylist. And the fields required for the plugin 
urlfilter.domainblacklist.rules and urlfilter.domainblacklist.file has been 
replaced with urlfilter.domaindenylist.rules and urlfilter.domaindenylist.file 
respectively. See NUTCH-2802 for more details.
+
+Sub-task
+
+[NUTCH-2671] - Upgrade ant ivy library
+[NUTCH-2672] - Ant build erronously installs *-test.jar instead *.jar for 
target "nightly"
+[NUTCH-2805] - Rename plugin urlfilter-domainblacklist
+[NUTCH-2809] - Upgrade any23 plugin dependency to 2.4
+[NUTCH-2816] - Add Spotbugs target to ant build
+[NUTCH-2817] - Avoid check for equality of URL path and file part using 
==/!=
+[NUTCH-2829] - Fix ant target "clean-cache"
+
+Bug
+
+[NUTCH-2669] - Reliable solution for javax.ws packaging.type
+[NUTCH-2697] - Upgrade Ivy to fix the issue of an unset packaging.type 
property
+[NUTCH-2801] - RobotsRulesParser command-line checker to use 
http.robots.agents as fall-back
+[NUTCH-2810] - FreeGenerator to actually apply configured number of fetch 
lists
+[NUTCH-2813] - MoreIndexingFilter - can't parse erroneous date - 
2019-07-03T10:28:14
+[NUTCH-2814] - HttpDateFormat's internal time zone may change after 
parsing a date
+[NUTCH-2818] - Ant build: upgrade Apache Rat report task
+[NUTCH-2823] - IllegalStateException in IndexWriters.describe() when 
validating url param for SolrIndexer
+[NUTCH-2824] - urlnormalizer-basic to unescape percent-encoded host names
+
+Improvement
+
+[NUTCH-1190] - MoreIndexingFilter refactor: move data formats used to 
parse "lastModified" to a config file.
+[NUTCH-2582] - Set pool size of XML SAX parsers used for MIME detection in 
Tika 1.19
+[NUTCH-2730] - SitemapProcessor to treat sitemap URLs as Set instead of 
List
+[NUTCH-2782] - protocol-http / lib-http: support TLSv1.3
+[NUTCH-2796] - Upgrade to crawler-commons 1.1
+[NUTCH-2799] - Add .asf.yaml file
+[NUTCH-2833] - Upgrade to Tika 1.25
+[NUTCH-2835] - Upgrade commons-jexl from 2 --> 3
+[NUTCH-2836] - Upgrade various commons dependencies
+[NUTCH-2837] - Update multiple dependencies
+[NUTCH-2841] - Upgrade xercesImpl dependency
+
+Wish
+
+[NUTCH-2834] - Deduplication mode via command line in crawl script
+
+Task
+
+[NUTCH-2830] - Upgrade any23 to v2.4
 
 Nutch 1.17 Release 18/06/2020 (dd/mm/)
 Release Report: https://s.apache.org/ovhry
diff --git a/NOTICE.txt b/NOTICE.txt
index 71f29fa..1c9efd0 100644
--- a/NOTICE.txt
+++ b/NOTICE.txt
@@ -1,5 +1,5 @@
 Apache Nutch
-Copyright 2020 The Apache Software Foundation
+Copyright 2021 The Apache Software Foundation
 
 This product includes software developed by The Apache Software
 Foundation (http://www.apache.org/).
diff --git a/build.xml b/build.xml
index 62ed5d1..68a0f44 100644
--- a/build.xml
+++ b/build.xml
@@ -37,6 +37,8 @@
   
   
 
+  
+
   
 
   
@@ -311,8 +313,9 @@
   
 
   
-  
-
+  
+
+
 
 
 
@@ -321,8 +324,9 @@
   
 
   
-  
-
+  
+
+
 
 
 
@@ -332,8 +336,9 @@
   
 
   
-  
-
+  
+
+
 
 
 
@@ -362,10 +367,12 @@
   
 
 
-
-  
+
+  
+  
+  
   
-  
+  
   

svn commit: r45580 - /release/nutch/1.17/

2021-01-24 Thread lewismc
Author: lewismc
Date: Sun Jan 24 21:09:56 2021
New Revision: 45580

Log:
Remove Nutch 1.17 from release area


Removed:
release/nutch/1.17/



svn commit: r1885887 - in /nutch/cms_site/trunk/content: ./ apidocs/apidocs-1.18/ apidocs/apidocs-1.18/org/ apidocs/apidocs-1.18/org/apache/ apidocs/apidocs-1.18/org/apache/nutch/ apidocs/apidocs-1.18

2021-01-24 Thread lewismc
Author: lewismc
Date: Sun Jan 24 20:45:18 2021
New Revision: 1885887

URL: http://svn.apache.org/viewvc?rev=1885887=rev
Log:
Update Nutch CMR website for 1.18


[This commit notification would consist of 260 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]


svn commit: r45570 - in /release/nutch/1.18: apache-nutch-1.18-bin.zip.md5 apache-nutch-1.18-src.tar.gz.md5 apache-nutch-1.18-src.zip.md5

2021-01-24 Thread lewismc
Author: lewismc
Date: Sun Jan 24 20:04:00 2021
New Revision: 45570

Log:
Remove Nutch 1.18 .md5 artifacts


Removed:
release/nutch/1.18/apache-nutch-1.18-bin.zip.md5
release/nutch/1.18/apache-nutch-1.18-src.tar.gz.md5
release/nutch/1.18/apache-nutch-1.18-src.zip.md5



svn commit: r45569 - /release/nutch/1.18/apache-nutch-1.18-bin.tar.gz.md5

2021-01-24 Thread lewismc
Author: lewismc
Date: Sun Jan 24 20:02:57 2021
New Revision: 45569

Log:
Remove .md5


Removed:
release/nutch/1.18/apache-nutch-1.18-bin.tar.gz.md5



svn commit: r45568 - /dev/nutch/1.18/ /release/nutch/1.18/

2021-01-24 Thread lewismc
Author: lewismc
Date: Sun Jan 24 20:02:19 2021
New Revision: 45568

Log:
Publish Nutch 1.18 to release area.


Added:
release/nutch/1.18/
  - copied from r45567, dev/nutch/1.18/
Removed:
dev/nutch/1.18/



svn commit: r45520 [3/3] - /dev/nutch/1.18/

2021-01-20 Thread lewismc
Added: dev/nutch/1.18/apache-nutch-1.18-bin.tar.gz
==
Binary file - no diff available.

Propchange: dev/nutch/1.18/apache-nutch-1.18-bin.tar.gz
--
svn:mime-type = application/octet-stream

Added: dev/nutch/1.18/apache-nutch-1.18-bin.tar.gz.asc
==
--- dev/nutch/1.18/apache-nutch-1.18-bin.tar.gz.asc (added)
+++ dev/nutch/1.18/apache-nutch-1.18-bin.tar.gz.asc Thu Jan 21 00:44:46 2021
@@ -0,0 +1,16 @@
+-BEGIN PGP SIGNATURE-
+
+iQIzBAABCAAdFiEE23tRmRIcCKXI9AUrOkcX8Ei66/YFAmAHg5oACgkQOkcX8Ei6
+6/bhThAAu2KbUKiuaqomM4M+Kl9QLLfKU5fqwl5ffQ9I4ZOC/yWqaqpJbOriPNvX
+2t4hDpTEKFA6yJTE1DggxxTLugsCSNYapRQc1ZBCf2gcPEoGEbDdMIxDyvZsZeQ0
+/XSDqP+OOFbX/Ggpl6MsJjO+1dM1Xn/QpRkAG65aW4rP2b0xR0gO3Uv9yonld1Fr
+jrGbarItalZmKhuvlWQKidOYmpXeuIs1rQ0MBHgWbFVpgo/cLxbNRSk71nIrZKia
+CAWMVQx43CuukqvjSwBTbrb04lI3I2F6PMC8pIiQPcXhCi8oHSrZ11I2TOaw4LnC
+0WGN0qgQb/fJuI8nqCfOqaJY254r+Gy01BPO+boDH5XdcQy/OhlTm4smKaFOmACv
+KoY0Y/lpf3eWumn51saMGjzpkYRTGB/p8zkEOmYfIUoLDT8MdMDfTZzkxn7lYiw/
+eGJvv6hD+pPksvNQdIFa3yydTEVsWST+z2jvsUK2gI8TwUUlp9JR63NMNijg4sN9
+JW64TjAopuWQrciuq1mGcTAGK7b/uxdmGk4NSX76cHFbRu6J8FOPIBnY2IF3rx03
+30UPu9c9SI0dokzsTLNNSxnXmN5LGawZ1tmqi7SiL6kOARKWljNzu06C5ZslYPLZ
+eHJLzne64g6FuvkslIotkoEPVZ/fS3UetHh14jSr5QzYZ7KAyN8=
+=sEcc
+-END PGP SIGNATURE-

Added: dev/nutch/1.18/apache-nutch-1.18-bin.tar.gz.md5
==
--- dev/nutch/1.18/apache-nutch-1.18-bin.tar.gz.md5 (added)
+++ dev/nutch/1.18/apache-nutch-1.18-bin.tar.gz.md5 Thu Jan 21 00:44:46 2021
@@ -0,0 +1 @@
+MD5(apache-nutch-1.18-bin.tar.gz)= 6f024cd88ee098340f0667125ad0578c

Added: dev/nutch/1.18/apache-nutch-1.18-bin.tar.gz.sha512
==
--- dev/nutch/1.18/apache-nutch-1.18-bin.tar.gz.sha512 (added)
+++ dev/nutch/1.18/apache-nutch-1.18-bin.tar.gz.sha512 Thu Jan 21 00:44:46 2021
@@ -0,0 +1 @@
+SHA512(apache-nutch-1.18-bin.tar.gz)= 
f52d97a98c1facfb4586344068be8d57ff961abcfa8f4b416bde8d568fb5c44e78f9ecb5e4afa067b1162f01599fa52401d5eb31812f01c18e4fae5229968ff0

Added: dev/nutch/1.18/apache-nutch-1.18-bin.zip
==
Binary file - no diff available.

Propchange: dev/nutch/1.18/apache-nutch-1.18-bin.zip
--
svn:mime-type = application/octet-stream

Added: dev/nutch/1.18/apache-nutch-1.18-bin.zip.asc
==
--- dev/nutch/1.18/apache-nutch-1.18-bin.zip.asc (added)
+++ dev/nutch/1.18/apache-nutch-1.18-bin.zip.asc Thu Jan 21 00:44:46 2021
@@ -0,0 +1,16 @@
+-BEGIN PGP SIGNATURE-
+
+iQIzBAABCAAdFiEE23tRmRIcCKXI9AUrOkcX8Ei66/YFAmAHg6YACgkQOkcX8Ei6
+6/YUmxAAgckunK+2hpSpUzEQ2r2hnRvFt+jq8fqpBAPSjh6H9lnn7zBEK/aJHRgb
+zskF4ZtATfkCmxHC5JYYA3noOwvjz/cSbNEYCXF8bngBUW02CtiZEXAHSIr2aPVD
+4HuplylbdZ1ihlhSRiTKAzqA1f3LaGRR+Kpw4ag/eSrPquBeN+VSl+8ThvszJlwN
+btZdOFshOYYkV6dVgI17Qp2rYY/XwUG/crBTlOIV9HBASYXs2sxFpkuUIlI60N8m
+KYbVlxngFtCaNaBii76qh2mLCTD4SSN48XjY4cvD2bJOlCwdEGXRuAo+NB8auuTp
+qdqMG4/3upgpsxeCErq5fTbgqR7weyPQAx3kelw9T/rM86YyXRva9pVs0mk9JxHv
+yi5LcqrjnhD2xVa26vMQacfvVkBw8ev1a/Gahv8Xq1B1YzAn8YpTqsb1kC/nWKVe
+1fD5KZPwYDBCGI0/puwXin90Y0jZ/D4xuI0sP/M5ZZ8fQuYWV3JGReI6+vH9KGha
+x5jjfXMQ7k5BVFZA7DmirdW5IGfoHJLT7sRo0dTWMRTevlNC02TMp5jf62LbzZtW
+dW6Nw1DPGcTmdHR2Cob8zgPwRV8iDoM2yj0290zcw5h59JDqyhhg9yG7kt23TAQS
+xrlxTT76dTGrgtB3QsuR7uWf3xmNKFah9aeGdxb+j9cb2PWbZDY=
+=BHRm
+-END PGP SIGNATURE-

Added: dev/nutch/1.18/apache-nutch-1.18-bin.zip.md5
==
--- dev/nutch/1.18/apache-nutch-1.18-bin.zip.md5 (added)
+++ dev/nutch/1.18/apache-nutch-1.18-bin.zip.md5 Thu Jan 21 00:44:46 2021
@@ -0,0 +1 @@
+MD5(apache-nutch-1.18-bin.zip)= 4563aa7c3216078ede022d4f182f48be

Added: dev/nutch/1.18/apache-nutch-1.18-bin.zip.sha512
==
--- dev/nutch/1.18/apache-nutch-1.18-bin.zip.sha512 (added)
+++ dev/nutch/1.18/apache-nutch-1.18-bin.zip.sha512 Thu Jan 21 00:44:46 2021
@@ -0,0 +1 @@
+SHA512(apache-nutch-1.18-bin.zip)= 
be681ff067691d680669ca3fd84ca9f8c86d3d3ed04ab9c7b65b11eeb45c19d324820f24cd9f80db0d4b82034e9993c8412f2ac9a7d9943b262a03bb86f41595

Added: dev/nutch/1.18/apache-nutch-1.18-src.tar.gz
==
Binary file - no diff available.

Propchange: dev/nutch/1.18/apache-nutch-1.18-src.tar.gz
--
svn:mime-type = application/octet-stream

Added: dev/nutch/1.18/apache-nutch-1.18-src.tar.gz.asc

svn commit: r45520 [2/3] - /dev/nutch/1.18/

2021-01-20 Thread lewismc
ue in tika 
mimetype detection
+[NUTCH-2224] - Average bytes/second calculated incorrectly in fetcher
+[NUTCH-2225] - Parsed time calculated incorrectly
+[NUTCH-2228] - Plugin index-replace unit test broken on Java 8
+[NUTCH-2232] - DeduplicationJob should decode URL's before length is 
compared
+[NUTCH-2241] - Unstable Selenium plugin in Nutch. Fixed bugs and enhanced 
configuration
+[NUTCH-2256] - Inconsistent log level practice
+
+Improvement
+
+[NUTCH-1233] - Rely on Tika for outlink extraction
+[NUTCH-1712] - Use MultipleInputs in Injector to make it a single 
mapreduce job
+[NUTCH-2172] - index-more: document format of contenttype-mapping.txt
+[NUTCH-2178] - DeduplicationJob to optionally group on host or domain
+[NUTCH-2182] - Make reverseUrlDirs file dumper option hash the URL for 
consistency
+[NUTCH-2183] - Improvement to SegmentChecker for skipping non-segments 
present in segments directory
+[NUTCH-2187] - Change FileDumper SHAs to all uppercase
+[NUTCH-2195] - IndexingFilterChecker to optionally follow N redirects
+[NUTCH-2196] - IndexingFilterChecker to optionally normalize
+[NUTCH-2197] - Add solr5 solrcloud indexer support
+[NUTCH-2204] - Remove junit lib from runtime
+[NUTCH-2218] - Switch CrawlCompletion arg parsing to Commons CLI
+[NUTCH-2221] - Introduce db.ignore.internal.links to FetcherThread
+[NUTCH-2229] - Allow Jexl expressions on CrawlDatum's fixed attributes
+[NUTCH-2231] - Jexl support in generator job
+[NUTCH-2252] - Allow phantomjs as a browser for selenium options
+[NUTCH-2263] - Support for mingram and maxgram at Unigram Cosine 
Similarity Model
+
+New Feature
+
+[NUTCH-961] - Expose Tika's boilerpipe support
+[NUTCH-1325] - HostDB for Nutch
+[NUTCH-2144] - Plugin to override db.ignore.external to exempt interesting 
external domain URLs
+[NUTCH-2190] - Protocol normalizer
+[NUTCH-2191] - Add protocol-htmlunit
+[NUTCH-2194] - Run IndexingFilterChecker as simple Telnet server
+[NUTCH-2219] - Criteria order to be configurable in DeduplicationJob
+[NUTCH-2227] - RegexParseFilter
+[NUTCH-2245] - Developed the NGram Model on the existing Unigram Cosine 
Similarity Model
+
+Task
+
+[NUTCH-2201] - Remove loops program from webgraph package
+[NUTCH-2211] - Filter and normalizer checkers missing in bin/nutch
+[NUTCH-2220] - Rename db.* options used only by the linkdb to linkdb.*
+
+Nutch 1.11 Release 03/12/2015 (dd/mm/)
+Release Report: http://s.apache.org/nutch11
+
+* NUTCH-2176 Clean up of log4j.properties (markus)
+
+* NUTCH-2107 plugin.xml to validate against plugin.dtd (snagel)
+
+* NUTCH-2177 Generator produces only one partition even in distributed mode 
(jnioche, snagel)
+
+* NUTCH-2158 Upgrade to Tika 1.11 (jnioche, snagel)
+
+* NUTCH-2175 Typos in property descriptions in nutch-default.xml (Roannel 
Fernández Hernández via snagel)
+
+* NUTCH-2069 Ignore external links based on domain (jnioche)
+
+* NUTCH-2173 String.join in FileDumper breaks the build (joyce)
+
+* NUTCH-2166 Add reverse URL format to dump tool (joyce)
+
+* NUTCH-2157 Addressing Miredot REST API Warnings (Sujen Shah)
+
+* NUTCH-2165 FileDumper Util hard codes part-# folder name (joyce)
+
+* NUTCH-2167 Backport TableUtil from 2.x for URL reversing (joyce)
+
+* NUTCH-2160 Upgrade Selenium Java to 2.48.2 (lewismc, kwhitehall)
+
+* NUTCH-2120 Remove MapWritable from trunk codebase (lewismc)
+
+* NUTCH-1911 Improve DomainStatistics tool command line parsing (joyce)
+
+* NUTCH-2064 URLNormalizer basic to encode reserved chars and decode 
non-reserved chars (markus, snagel)
+
+* NUTCH-2159 Ensure that all WebApp files are copied into generated artifacts 
for 1.X Webapp (lewismc)
+
+* NUTCH-2154 Nutch REST API (DB) suffering NullPointerException (Aron Ahmadia, 
Sujen Shah via mattmann)
+
+* NUTCH-2150 Add protocolstats utility (Michael Joyce via mattmann)
+
+* NUTCH-2146 hashCode on the Outlink class (jorgelbg via mattmann)
+
+* NUTCH-2155 Create a "crawl completeness" utility (Michael Joyce via mattmann)
+
+* NUTCH-1988 Make nested output directory dump optional... again (Michael 
Joyce via lewismc)
+
+* NUTCH-1800 Documentation for Nutch 1.X and 2.X REST APIs (lewismc)
+
+* NUTCH-2149 REST endpoint to read Nutch sequence files (Sujen Shah)
+
+* NUTCH-2139 Basic plugin to index inlinks and outlinks (jorgelbg)
+
+* NUTCH-2128 Review and update mapred --> mapreduce config params in crawl 
script (lewismc)
+
+* NUTCH-2141 Change the InteractiveSelenium plugin handler Interface to return 
page content
+  (Balaji Gurumurthy via mattmann)
+
+* NUTCH-2129 Add protocol status tracking to crawl datum (Michael Joyce via 
mattmann)
+
+* NUTCH-2142 Nutch File Dump - FileNotFoundException (Invalid Argument) Error 
(Karanjeet Singh via mattmann)
+
+* NUTCH-2136 Implement a different version of Naive Bayes Parse Filter 
(Asitang Mishra)
+
+* NUTCH-2109 Create a brute fo

svn commit: r45520 [1/3] - /dev/nutch/1.18/

2021-01-20 Thread lewismc
Author: lewismc
Date: Thu Jan 21 00:44:46 2021
New Revision: 45520

Log:
Stage Apache Nutch 1.18 RC#1

Added:
dev/nutch/1.18/
dev/nutch/1.18/CHANGES.txt
dev/nutch/1.18/apache-nutch-1.18-bin.tar.gz   (with props)
dev/nutch/1.18/apache-nutch-1.18-bin.tar.gz.asc
dev/nutch/1.18/apache-nutch-1.18-bin.tar.gz.md5
dev/nutch/1.18/apache-nutch-1.18-bin.tar.gz.sha512
dev/nutch/1.18/apache-nutch-1.18-bin.zip   (with props)
dev/nutch/1.18/apache-nutch-1.18-bin.zip.asc
dev/nutch/1.18/apache-nutch-1.18-bin.zip.md5
dev/nutch/1.18/apache-nutch-1.18-bin.zip.sha512
dev/nutch/1.18/apache-nutch-1.18-src.tar.gz   (with props)
dev/nutch/1.18/apache-nutch-1.18-src.tar.gz.asc
dev/nutch/1.18/apache-nutch-1.18-src.tar.gz.md5
dev/nutch/1.18/apache-nutch-1.18-src.tar.gz.sha512
dev/nutch/1.18/apache-nutch-1.18-src.zip   (with props)
dev/nutch/1.18/apache-nutch-1.18-src.zip.asc
dev/nutch/1.18/apache-nutch-1.18-src.zip.md5
dev/nutch/1.18/apache-nutch-1.18-src.zip.sha512



[nutch] annotated tag release-1.18 updated (43f3550 -> a8ef299)

2021-01-19 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a change to annotated tag release-1.18
in repository https://gitbox.apache.org/repos/asf/nutch.git.


*** WARNING: tag release-1.18 was modified! ***

from 43f3550  (commit)
  to a8ef299  (tag)
 tagging 43f3550c1adef70a0acd9938737c5c3f899bc2be (commit)
 replaces release-1.13
  by Lewis John McGibbney
  on Tue Jan 19 15:36:39 2021 -0800

- Log -
Apache Nutch 1.18 RC#1 Tag
---


No new revisions were added by this update.

Summary of changes:



[nutch] branch branch-1.18 updated: Prepare for Nutch 1.18 release

2021-01-19 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch branch-1.18
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/branch-1.18 by this push:
 new 43f3550  Prepare for Nutch 1.18 release
43f3550 is described below

commit 43f3550c1adef70a0acd9938737c5c3f899bc2be
Author: Lewis John McGibbney 
AuthorDate: Tue Jan 19 15:33:48 2021 -0800

Prepare for Nutch 1.18 release
---
 build.xml| 25 -
 ivy/mvn.template | 12 +++-
 2 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/build.xml b/build.xml
index 62ed5d1..1d71bc2 100644
--- a/build.xml
+++ b/build.xml
@@ -37,6 +37,8 @@
   
   
 
+  
+
   
 
   
@@ -311,8 +313,9 @@
   
 
   
-  
-
+  
+   
+
 
 
 
@@ -321,8 +324,9 @@
   
 
   
-  
-
+  
+   
+
 
 
 
@@ -332,8 +336,9 @@
   
 
   
-  
-
+  
+   
+
 
 
 
@@ -362,10 +367,12 @@
   
 
 
-
-  
+
+  
+  
+  
   
-  
+  
   

[nutch] branch branch-1.18 created (now e9f125c)

2021-01-14 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a change to branch branch-1.18
in repository https://gitbox.apache.org/repos/asf/nutch.git.


  at e9f125c  Prepare for Nutch 1.18 release

This branch includes the following new commits:

 new e9f125c  Prepare for Nutch 1.18 release

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.




[nutch] 01/01: Prepare for Nutch 1.18 release

2021-01-14 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch branch-1.18
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit e9f125c62ae71903187959351b0f72da29937749
Author: Lewis John McGibbney 
AuthorDate: Thu Jan 14 15:27:00 2021 -0800

Prepare for Nutch 1.18 release
---
 CHANGES.txt| 50 --
 NOTICE.txt |  2 +-
 conf/nutch-default.xml |  2 +-
 default.properties |  4 ++--
 src/bin/nutch  |  2 +-
 5 files changed, 53 insertions(+), 7 deletions(-)

diff --git a/CHANGES.txt b/CHANGES.txt
index e5c5984..0613585 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -1,10 +1,56 @@
 # Nutch Change Log
 
-Nutch 1.18 Development
+Nutch 1.18 Release 14/01/2021 (dd/mm/)
+Release Report: https://s.apache.org/lqara
 
 Breaking Changes
 
--  As part of NUTCH-2805, the plugin urlfilter-domainblacklist has been 
renamed to urlfilter-domaindenylist. And the fields required for the plugin 
urlfilter.domainblacklist.rules and urlfilter.domainblacklist.file has been 
replaced with urlfilter.domaindenylist.rules and urlfilter.domaindenylist.file 
respectively. See NUTCH-2802 for more details.
+- As part of NUTCH-2805, the plugin urlfilter-domainblacklist has been 
renamed to urlfilter-domaindenylist. And the fields required for the plugin 
urlfilter.domainblacklist.rules and urlfilter.domainblacklist.file has been 
replaced with urlfilter.domaindenylist.rules and urlfilter.domaindenylist.file 
respectively. See NUTCH-2802 for more details.
+
+Sub-task
+
+[NUTCH-2671] - Upgrade ant ivy library
+[NUTCH-2672] - Ant build erronously installs *-test.jar instead *.jar for 
target "nightly"
+[NUTCH-2805] - Rename plugin urlfilter-domainblacklist
+[NUTCH-2809] - Upgrade any23 plugin dependency to 2.4
+[NUTCH-2816] - Add Spotbugs target to ant build
+[NUTCH-2817] - Avoid check for equality of URL path and file part using 
==/!=
+[NUTCH-2829] - Fix ant target "clean-cache"
+
+Bug
+
+[NUTCH-2669] - Reliable solution for javax.ws packaging.type
+[NUTCH-2697] - Upgrade Ivy to fix the issue of an unset packaging.type 
property
+[NUTCH-2801] - RobotsRulesParser command-line checker to use 
http.robots.agents as fall-back
+[NUTCH-2810] - FreeGenerator to actually apply configured number of fetch 
lists
+[NUTCH-2813] - MoreIndexingFilter - can't parse erroneous date - 
2019-07-03T10:28:14
+[NUTCH-2814] - HttpDateFormat's internal time zone may change after 
parsing a date
+[NUTCH-2818] - Ant build: upgrade Apache Rat report task
+[NUTCH-2823] - IllegalStateException in IndexWriters.describe() when 
validating url param for SolrIndexer
+[NUTCH-2824] - urlnormalizer-basic to unescape percent-encoded host names
+
+Improvement
+
+[NUTCH-1190] - MoreIndexingFilter refactor: move data formats used to 
parse "lastModified" to a config file.
+[NUTCH-2582] - Set pool size of XML SAX parsers used for MIME detection in 
Tika 1.19
+[NUTCH-2730] - SitemapProcessor to treat sitemap URLs as Set instead of 
List
+[NUTCH-2782] - protocol-http / lib-http: support TLSv1.3
+[NUTCH-2796] - Upgrade to crawler-commons 1.1
+[NUTCH-2799] - Add .asf.yaml file
+[NUTCH-2833] - Upgrade to Tika 1.25
+[NUTCH-2835] - Upgrade commons-jexl from 2 --> 3
+[NUTCH-2836] - Upgrade various commons dependencies
+[NUTCH-2837] - Update multiple dependencies
+[NUTCH-2841] - Upgrade xercesImpl dependency
+
+Wish
+
+[NUTCH-2834] - Deduplication mode via command line in crawl script
+
+Task
+
+[NUTCH-2830] - Upgrade any23 to v2.4
+
 
 Nutch 1.17 Release 18/06/2020 (dd/mm/)
 Release Report: https://s.apache.org/ovhry
diff --git a/NOTICE.txt b/NOTICE.txt
index 71f29fa..1c9efd0 100644
--- a/NOTICE.txt
+++ b/NOTICE.txt
@@ -1,5 +1,5 @@
 Apache Nutch
-Copyright 2020 The Apache Software Foundation
+Copyright 2021 The Apache Software Foundation
 
 This product includes software developed by The Apache Software
 Foundation (http://www.apache.org/).
diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index 6932eb5..df6916b 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -164,7 +164,7 @@
 
 
   http.agent.version
-  Nutch-1.18-SNAPSHOT
+  Nutch-1.18
   A version string to advertise in the User-Agent 
header.
 
diff --git a/default.properties b/default.properties
index e4b9619..fdb35b9 100644
--- a/default.properties
+++ b/default.properties
@@ -14,9 +14,9 @@
 # limitations under the License.
 
 name=apache-nutch
-version=1.18-SNAPSHOT
+version=1.18
 final.name=${name}-${version}
-year=2020
+year=2021
 
 basedir = ./
 src.dir = ./src/java
diff --git a/src/bin/nutch b/src/bin/nutch
index 7d0d8ee..c501ea5 100755
--- a/src/bin/nutch
+++ b/src/bin/nutch
@@ -60,7 +60,7 @@ done
 
 # if no args specified, show usage
 if [ $# = 0 ]; then
-  echo "nutch 1.18-SNAPSHOT"

[nutch] branch master updated: NUTCH-2841 Upgrade xercesImpl dependency (#563)

2021-01-13 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 59c63c7  NUTCH-2841 Upgrade xercesImpl dependency (#563)
59c63c7 is described below

commit 59c63c7d8a13b0de1fd1da6aa4a1ab6e20fa478d
Author: Lewis John McGibbney 
AuthorDate: Wed Jan 13 10:56:07 2021 -0800

NUTCH-2841 Upgrade xercesImpl dependency (#563)

* NUTCH-2841 Upgrade xercesImpl dependency
---
 ivy/ivy.xml | 2 +-
 src/java/org/apache/nutch/tools/DmozParser.java | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/ivy/ivy.xml b/ivy/ivy.xml
index ad1e65f..3f1faf3 100644
--- a/ivy/ivy.xml
+++ b/ivy/ivy.xml
@@ -66,7 +66,7 @@

 

-   
+   
 

 
diff --git a/src/java/org/apache/nutch/tools/DmozParser.java 
b/src/java/org/apache/nutch/tools/DmozParser.java
index 63dbde8..a447646 100644
--- a/src/java/org/apache/nutch/tools/DmozParser.java
+++ b/src/java/org/apache/nutch/tools/DmozParser.java
@@ -276,8 +276,11 @@ public class DmozParser {
   throws IOException, SAXException, ParserConfigurationException {
 
 SAXParserFactory parserFactory = SAXParserFactory.newInstance();
+
parserFactory.setFeature("http://xml.org/sax/features/external-general-entities;,
 false);
+
parserFactory.setFeature("http://apache.org/xml/features/disallow-doctype-decl;,
 true);
 SAXParser parser = parserFactory.newSAXParser();
 XMLReader reader = parser.getXMLReader();
+reader.setFeature("http://xml.org/sax/features/external-general-entities;, 
false);
 
 // Create our own processor to receive SAX events
 RDFProcessor rp = new RDFProcessor(reader, subsetDenom, includeAdult, skew,



[nutch] branch master updated: NUTCH-2837 Update multiple dependencies (#560)

2021-01-08 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 7f0fdb1  NUTCH-2837 Update multiple dependencies (#560)
7f0fdb1 is described below

commit 7f0fdb15a339cae72fda9624f1260ee4869688ef
Author: Lewis John McGibbney 
AuthorDate: Fri Jan 8 10:01:38 2021 -0800

NUTCH-2837 Update multiple dependencies (#560)

* NUTCH-2837 Upgrade Slf4j dependencies

* NUTCH-2837 Update multiple dependencies
---
 ivy/ivy.xml | 30 +++---
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/ivy/ivy.xml b/ivy/ivy.xml
index 0aa1de4..ad1e65f 100644
--- a/ivy/ivy.xml
+++ b/ivy/ivy.xml
@@ -32,8 +32,8 @@

 

-   
-   
+   
+   
 


 
-   
+   
 
-   
+   
 

 
@@ -78,18 +78,18 @@


 
-   
-   
-   
-   
-   
-   
-   
-   
-   
+   
+   
+   
+   
+   
+   
+   
+   
+   
 

-   
+   



@@ -105,7 +105,7 @@

 

-   
+   






[nutch] branch master updated: NUTCH-2836 Upgrade various commons dependencies (#559)

2021-01-07 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new fbd53ba  NUTCH-2836 Upgrade various commons dependencies (#559)
fbd53ba is described below

commit fbd53ba16bc8dd751425757273996216ec80cd78
Author: Lewis John McGibbney 
AuthorDate: Thu Jan 7 20:41:37 2021 -0800

NUTCH-2836 Upgrade various commons dependencies (#559)
---
 ivy/ivy.xml | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/ivy/ivy.xml b/ivy/ivy.xml
index a20d8a6..0aa1de4 100644
--- a/ivy/ivy.xml
+++ b/ivy/ivy.xml
@@ -41,11 +41,11 @@


 
-   
-   
-   
-   
-   
+   
+   
+   
+   
+   


 



[nutch] branch master updated: Add possibility to setup deduplication group mode in crawl script (#557)

2020-12-17 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 88a17f2  Add possibility to setup deduplication group mode in crawl 
script (#557)
88a17f2 is described below

commit 88a17f26b4160720bacb3ead1cad71ae24a559bc
Author: Jakob Berlin 
AuthorDate: Thu Dec 17 17:59:30 2020 +0100

Add possibility to setup deduplication group mode in crawl script (#557)
---
 src/bin/crawl | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/src/bin/crawl b/src/bin/crawl
index 23a2940..db42218 100755
--- a/src/bin/crawl
+++ b/src/bin/crawl
@@ -48,6 +48,8 @@
 #   --time-limit-fetch  Number of minutes allocated to the 
fetching [default: 180]
 #   --num-threadsNumber of threads for fetching / 
sitemap processing [default: 50]
 #
+#   -dedup-groupDeduplication group method [default: 
none]
+#
 
 function __to_seconds() {
   NUMBER=$(echo $1 | tr -dc '0-9')
@@ -107,6 +109,7 @@ function __print_usage {
   echo -e "  \t\t\t\t\t  - never [default]"
   echo -e "  \t\t\t\t\t  - always (processing takes place in every iteration)"
   echo -e "  \t\t\t\t\t  - once (processing only takes place in the first 
iteration)"
+  echo -e "  -dedup-group \tDeduplication group method 
[default: none]"
 
   exit 1
 }
@@ -124,6 +127,7 @@ SIZE_FETCHLIST=5 # 25K x NUM_TASKS
 TIME_LIMIT_FETCH=180
 NUM_THREADS=50
 SITEMAPS_FROM_HOSTDB_FREQUENCY=never
+DEDUP_GROUP=none
 
 while [[ $# > 0 ]]
 do
@@ -177,6 +181,10 @@ do
 SITEMAPS_FROM_HOSTDB_FREQUENCY="${2}"
 shift 2
 ;;
+--dedup-group)
+DEDUP_GROUP="${2}"
+shift 2
+;;
 --hostdbupdate)
 HOSTDBUPDATE=true
 shift
@@ -197,6 +205,12 @@ if [[ ! "$SITEMAPS_FROM_HOSTDB_FREQUENCY" =~ 
^(never|always|once)$ ]]; then
   __print_usage
 fi
 
+if [[ ! "$DEDUP_GROUP" =~ ^(none|host|domain)$ ]]; then
+  echo "Error: --dedup-group  has to be one of none, host, domain."
+  echo -e ""
+  __print_usage
+fi
+
 if [[ $# != 2 ]]; then
   __print_usage
 fi
@@ -385,7 +399,7 @@ do
   __bin_nutch invertlinks "${commonOptions[@]}" "$CRAWL_PATH"/linkdb 
"$CRAWL_PATH"/segments/$SEGMENT -noNormalize -nofilter
 
   echo "Dedup on crawldb"
-  __bin_nutch dedup "${commonOptions[@]}" "$CRAWL_PATH"/crawldb
+  __bin_nutch dedup "${commonOptions[@]}" "$CRAWL_PATH"/crawldb -group 
"$DEDUP_GROUP"
 
   if $INDEXFLAG; then
   echo "Indexing $SEGMENT to index"



[nutch] branch master updated: NUTCH-2835 Upgrade commons-jexl from 2 --> 3 (#558)

2020-12-17 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 8d8e08b  NUTCH-2835 Upgrade commons-jexl from 2 --> 3 (#558)
8d8e08b is described below

commit 8d8e08b354fd94fced548c0b73623a375bcc8b2b
Author: Lewis John McGibbney 
AuthorDate: Thu Dec 17 08:56:04 2020 -0800

NUTCH-2835 Upgrade commons-jexl from 2 --> 3 (#558)
---
 ivy/ivy.xml |  2 +-
 src/java/org/apache/nutch/crawl/CrawlDatum.java |  8 
 src/java/org/apache/nutch/crawl/CrawlDbReader.java  |  4 ++--
 src/java/org/apache/nutch/crawl/Generator.java  | 12 ++--
 src/java/org/apache/nutch/hostdb/ReadHostDb.java| 17 +++--
 src/java/org/apache/nutch/util/JexlUtil.java| 12 +---
 .../org/apache/nutch/exchange/jexl/JexlExchange.java|  8 
 .../apache/nutch/indexer/jexl/JexlIndexingFilter.java   | 10 +-
 8 files changed, 34 insertions(+), 39 deletions(-)

diff --git a/ivy/ivy.xml b/ivy/ivy.xml
index 16ed8a6..a20d8a6 100644
--- a/ivy/ivy.xml
+++ b/ivy/ivy.xml
@@ -46,7 +46,7 @@



-   
+   

 

diff --git a/src/java/org/apache/nutch/crawl/CrawlDatum.java 
b/src/java/org/apache/nutch/crawl/CrawlDatum.java
index e05d7fd..5159bdb 100644
--- a/src/java/org/apache/nutch/crawl/CrawlDatum.java
+++ b/src/java/org/apache/nutch/crawl/CrawlDatum.java
@@ -25,9 +25,9 @@ import java.util.HashSet;
 import java.util.Map;
 import java.util.Map.Entry;
 
-import org.apache.commons.jexl2.JexlContext;
-import org.apache.commons.jexl2.Expression;
-import org.apache.commons.jexl2.MapContext;
+import org.apache.commons.jexl3.JexlContext;
+import org.apache.commons.jexl3.JexlExpression;
+import org.apache.commons.jexl3.MapContext;
 import org.apache.hadoop.io.FloatWritable;
 import org.apache.hadoop.io.IntWritable;
 import org.apache.hadoop.io.Text;
@@ -542,7 +542,7 @@ public class CrawlDatum implements 
WritableComparable, Cloneable {
 }
   }
   
-  public boolean evaluate(Expression expr, String url) {
+  public boolean evaluate(JexlExpression expr, String url) {
 if (expr != null && url != null) {
   // Create a context and add data
   JexlContext jcontext = new MapContext();
diff --git a/src/java/org/apache/nutch/crawl/CrawlDbReader.java 
b/src/java/org/apache/nutch/crawl/CrawlDbReader.java
index 1bb8160..3af63d3 100644
--- a/src/java/org/apache/nutch/crawl/CrawlDbReader.java
+++ b/src/java/org/apache/nutch/crawl/CrawlDbReader.java
@@ -74,7 +74,7 @@ import org.apache.nutch.util.NutchJob;
 import org.apache.nutch.util.SegmentReaderUtil;
 import org.apache.nutch.util.StringUtil;
 import org.apache.nutch.util.TimingUtil;
-import org.apache.commons.jexl2.Expression;
+import org.apache.commons.jexl3.JexlExpression;
 
 import com.fasterxml.jackson.core.JsonGenerationException;
 import com.fasterxml.jackson.core.JsonGenerator;
@@ -864,7 +864,7 @@ public class CrawlDbReader extends AbstractChecker 
implements Closeable {
 Matcher matcher = null;
 String status = null;
 Integer retry = null;
-Expression expr = null;
+JexlExpression expr = null;
 float sample;
 
 @Override
diff --git a/src/java/org/apache/nutch/crawl/Generator.java 
b/src/java/org/apache/nutch/crawl/Generator.java
index 04c2ae8..c3f4469 100644
--- a/src/java/org/apache/nutch/crawl/Generator.java
+++ b/src/java/org/apache/nutch/crawl/Generator.java
@@ -34,9 +34,9 @@ import java.util.Random;
 import org.apache.hadoop.conf.Configurable;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
-import org.apache.commons.jexl2.Expression;
-import org.apache.commons.jexl2.JexlContext;
-import org.apache.commons.jexl2.MapContext;
+import org.apache.commons.jexl3.JexlExpression;
+import org.apache.commons.jexl3.JexlContext;
+import org.apache.commons.jexl3.MapContext;
 import org.apache.hadoop.mapreduce.Counter;
 import org.apache.hadoop.mapreduce.Job;
 import org.apache.hadoop.mapreduce.Mapper;
@@ -182,7 +182,7 @@ public class Generator extends NutchTool implements Tool {
 private float scoreThreshold = 0f;
 private int intervalThreshold = -1;
 private byte restrictStatus = -1;
-private Expression expr = null;
+private JexlExpression expr = null;
 
 @Override
 public void setup(
@@ -306,8 +306,8 @@ public class Generator extends NutchTool implements Tool {
 private URLNormalizers normalizers;
 private static boolean normalise;
 private SequenceFile.Reader[] hostdbReaders = null;
-private Expression maxCountExpr = null;
-private Expression fetchDelayExpr = null;
+private JexlExpression maxCountExpr = null;
+private JexlExpression fetchDelayExpr = null;
 
 pu

[nutch] branch master updated: NUTCH-2809 Upgrade any23 plugin dependency to 2.4 (#553)

2020-11-17 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 235af3c  NUTCH-2809 Upgrade any23 plugin dependency to 2.4 (#553)
235af3c is described below

commit 235af3c8ed547590dd83049d90ec0f86b78e5f7a
Author: Lewis John McGibbney 
AuthorDate: Tue Nov 17 19:10:18 2020 -0800

NUTCH-2809 Upgrade any23 plugin dependency to 2.4 (#553)

* NUTCH-2809 Upgrade any23 plugin dependency to 2.4
---
 .gitignore |   1 +
 src/plugin/any23/ivy.xml   |   2 +-
 src/plugin/any23/plugin.xml| 283 +++--
 .../apache/nutch/any23/TestAny23ParseFilter.java   |   4 +-
 4 files changed, 157 insertions(+), 133 deletions(-)

diff --git a/.gitignore b/.gitignore
index 02a74cf..249ca77 100644
--- a/.gitignore
+++ b/.gitignore
@@ -23,3 +23,4 @@ naivebayes-model
 .idea/
 *.iml
 *.swp
+csvindexwriter
diff --git a/src/plugin/any23/ivy.xml b/src/plugin/any23/ivy.xml
index 9a0aa34..d821b32 100644
--- a/src/plugin/any23/ivy.xml
+++ b/src/plugin/any23/ivy.xml
@@ -36,7 +36,7 @@
   
 
   
-
+
   
   
   
diff --git a/src/plugin/any23/plugin.xml b/src/plugin/any23/plugin.xml
index 71c5522..934709d 100644
--- a/src/plugin/any23/plugin.xml
+++ b/src/plugin/any23/plugin.xml
@@ -25,162 +25,185 @@
 
   
 
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
   
   
-  
-  
+  
+  
   
-  
-  
-  
+  
+  
+  
+  
   
-  
-  
-  
+  
+  
   
+  
   
-  
-  
-  
-  
-  
+  
+  
+  
+  
+  
+  
+  
   
-  
-  
-  
+  
+  
+  
+  
+  
+  
   
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
   
-  
-  
-  
-  
-  
-  
-  
-  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
   
-  
+  
+  
+  
   
-  
-  
-  
-  
-  
-  
-  
+  
+  
+  
+  
+  
+  
+  
   
-  
   
-  
-  
-  
-  
-  
+  
+  
+  
+  
   
-  
+  
+  
   
-  
-  
+  
+  
   
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
   
   
   
   
   
-  
-  
-  
-  
-  
-  
+  
+  
+  
+  
+  
+  
+  
+  
   
   
-  
-  
-  
+  
+  
+  
   
+  
+  
   
   
-  
-  
-  
-  
-  
+  
+  
+  
+  
+  
+  
+  
   
 
   
diff --git 
a/src/plugin/any23/src/test/org/apache/nutch/any23/TestAny23ParseFilter.java 
b/src/plugin/any23/src/test/org/apache/nutch/any23/TestAny23ParseFilter.java
index 3f0ace3..09c253f 100644
--- a/src/plugin/any23/src/test/org/apache/nutch/any23/TestAny23ParseFilter.java
+++ b/src/plugin/any23/src/test/org/apache/nutch/any23/TestAny23ParseFilter.java
@@ -49,9 +49,9 @@ public class TestAny23ParseFilter {
   
   private String file2 = "microdata_basic.html";
 
-  private static final int EXPECTED_TRIPLES_1 = 68;
+  private static final int EXPECTED_TRIPLES_1 = 79;
   
-  private static final int EXPECTED_TRIPLES_2 = 38;
+  private static final int EXPECTED_TRIPLES_2 = 40;
   
   @Before
   public void setUp() {



[nutch] branch branch-2.4 created (now 4944597)

2019-03-09 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a change to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/nutch.git.


  at 4944597  Prepare for Nutch 2.4 release candidate

This branch includes the following new commits:

 new 4944597  Prepare for Nutch 2.4 release candidate

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.




[nutch] 01/01: Prepare for Nutch 2.4 release candidate

2019-03-09 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 49445974a1f31d2e304c75e274aa6fd39afc95b9
Author: Lewis John McGibbney 
AuthorDate: Sat Mar 9 16:23:32 2019 -0800

Prepare for Nutch 2.4 release candidate
---
 CHANGES.txt| 108 -
 NOTICE.txt |   2 +-
 README.md  |   4 ++
 conf/nutch-default.xml |   2 +-
 default.properties |   4 +-
 5 files changed, 107 insertions(+), 13 deletions(-)

diff --git a/CHANGES.txt b/CHANGES.txt
index b7f1345..e27e358 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -1,14 +1,104 @@
 Nutch Change Log
 
-Nutch 2.4 Development
-
- * NUTCH-2256 Inconsistent log level (songwanging via snagel)
-
- * NUTCH-961 GitHub-92 Add the boilerpipe parsing adapted from NUTCH-961 
(Jeremie Bourseaux  via mattmann)
-
- * GitHub-94 Fix the issue of the bad timestamp. (Jeremie Bourseaux 
 via mattmann)
-
- * NUTCH-1314 Impose a limit on the length of outlink target urls (ferdy, 
lewismc, tejasp, Canan Girgin, Tien Nguyen Manh)
+Nutch 2.4 Release 09032018 (ddmm)
+Release Report - https://s.apache.org/bFfL
+
+Sub-task
+
+[NUTCH-2284] - Basic Authentication Support for REST API
+[NUTCH-2285] - Digest Authentication Support for REST API
+[NUTCH-2289] - SSL Support for REST API
+[NUTCH-2294] - Authorization Support for REST API
+[NUTCH-2301] - Create Tests for Security Layer of NutchServer
+
+Bug
+
+[NUTCH-2089] - Move Nutch 2.x to compile on JDK 8
+[NUTCH-2112] - Missing org.restlet.jee when building with gora-solr
+[NUTCH-] - re-fetch deletes all metadata except _csh_ and _rs_
+[NUTCH-2256] - Inconsistent log level practice
+[NUTCH-2259] - Nutch 2.x HBase Docker requires a logs folder to run 
exception free
+[NUTCH-2260] - JAVA_HOME and hbase-common dependency absent from hbase 
Docker image
+[NUTCH-2266] - Fix dead link in build.xml for javadoc
+[NUTCH-2269] - Clean not working after crawl
+[NUTCH-2282] - Incorrect content-type returned in 4 API calls
+[NUTCH-2283] - "Bad substitution" error when running cassandra docker 
scripts
+[NUTCH-2305] - generate.min.score doesn't work in 2.x
+[NUTCH-2314] - Use indexer-elastic2 Plugin for javadoc and eclipse Targets
+[NUTCH-2337] - urlnormalizer-basic to strip empty port
+[NUTCH-2346] - Check Types at Object Equality
+[NUTCH-2348] - Close GZIPInputStream
+[NUTCH-2349] - urlnormalizer-basic NPE for ill-formed URL "http:/"
+[NUTCH-2350] - Add Missing activeConfId Field to NutchStatus Object
+[NUTCH-2358] - HostInjectorJob doesn't work
+[NUTCH-2364] - http.agent.rotate: IllegalArgumentException / last element 
of agent names ignored
+[NUTCH-2388] - bin/crawl indexing only webpages containing batchID instead 
of all in 2.x
+[NUTCH-2393] - 2.x patch for MD5 duplication issue addressed in NUTCH-2391
+[NUTCH-2404] - Failed Jenkin Build #1588 error in unit test resolved
+[NUTCH-2405] - jsoup-extractor structure correction, typo fixed
+[NUTCH-2437] - gora mongodb mapping file error
+[NUTCH-2446] - URLFiltersCheck fix
+[NUTCH-2448] - Allow Sending an empty http.agent.version
+[NUTCH-2451] - protocol-ftp to resolve relative URL when following 
redirects
+[NUTCH-2469] - Documents not commited to solr in Sever mode
+[NUTCH-2475] - If and else-if branches has the same condition
+[NUTCH-2513] - ant eclipse target fails with "protocol switch unsafe"
+[NUTCH-2520] - Wrong Accept-Charset sent when http.accept.charset is not 
defined
+[NUTCH-2533] - Injector: NullPointerException if seed URL dir contains 
non-file entries
+[NUTCH-2536] - GeneratorReducer.count is a static variable
+[NUTCH-2548] - Compressed content skipped. Content of size 78 was 
truncated to 74
+[NUTCH-2581] - Caching of redirected robots.txt may overwrite correct 
robots.txt rules
+[NUTCH-2637] - Number of fetcher reducers is misconfigured when the arg 
not passed
+[NUTCH-2639] - bin/nutch fails to set native library path on Cygwin 
causing jobs to fail with UnsatisfiedLinkError
+[NUTCH-2640] - Typo: DbUpdaterJob: updatinging all
+[NUTCH-2641] - ClassCastException in webui
+[NUTCH-2642] - MoreIndexingFilter parses ISO 8601 UTC dates in local time 
zone
+
+New Feature
+
+[NUTCH-1741] - Support of Sitemaps in Nutch 2.x
+[NUTCH-2199] - Documentation for Nutch 2.X REST API
+[NUTCH-2238] - Indexer for Elasticsearch 2.x
+[NUTCH-2243] - Documentation for Nutch 2.X REST API
+[NUTCH-2344] - Authentication Support for Web GUI
+[NUTCH-2373] - Indexer for Hbase
+[NUTCH-2389] - Precise data parsing using Jsoup CSS selectors
+
+Improvement
+
+[NUTCH-1314] - Impose a limit on the length of outlink target urls
+[NUTCH-1678] - Remove dependency on o

[nutch] branch master updated: NUTCH-2698 Remove sonar build task from build.xml (#443)

2019-03-05 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 8bdec5e  NUTCH-2698 Remove sonar build task from build.xml (#443)
8bdec5e is described below

commit 8bdec5e3ef77f816c616c978c775a0eb3b4a391a
Author: Lewis John McGibbney 
AuthorDate: Tue Mar 5 13:04:36 2019 -0800

NUTCH-2698 Remove sonar build task from build.xml (#443)
---
 build.xml | 26 --
 1 file changed, 26 deletions(-)

diff --git a/build.xml b/build.xml
index 65e8f3f..18f659a 100644
--- a/build.xml
+++ b/build.xml
@@ -999,32 +999,6 @@
 
   
 
-  
-  
-  
-
-  
-  
-
-
-  
-
-  
-  
-
-
-
-
-
-
-
-
-
-
-
-  
-
 
   
   



[nutch] branch master updated: NUTCH-2697: Upgrade Ivy to fix the issue of an unset packaging.type property. (#441)

2019-03-01 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 0b0fcea  NUTCH-2697: Upgrade Ivy to fix the issue of an unset 
packaging.type property. (#441)
0b0fcea is described below

commit 0b0fcea1720cbe0722a18a2a29977e4bfec685bb
Author: Chris Gavin 
AuthorDate: Sat Mar 2 03:48:27 2019 +

NUTCH-2697: Upgrade Ivy to fix the issue of an unset packaging.type 
property. (#441)
---
 default.properties  | 2 +-
 ivy/ivysettings.xml | 8 
 2 files changed, 1 insertion(+), 9 deletions(-)

diff --git a/default.properties b/default.properties
index bb987d9..1423025 100644
--- a/default.properties
+++ b/default.properties
@@ -63,7 +63,7 @@ runtime.dir=./runtime
 runtime.deploy=${runtime.dir}/deploy
 runtime.local=${runtime.dir}/local
 
-ivy.version=2.4.0
+ivy.version=2.5.0-rc1
 ivy.dir=${basedir}/ivy
 ivy.file=${ivy.dir}/ivy.xml
 ivy.jar=${ivy.dir}/ivy-${ivy.version}.jar
diff --git a/ivy/ivysettings.xml b/ivy/ivysettings.xml
index a2dc700..d9b5044 100644
--- a/ivy/ivysettings.xml
+++ b/ivy/ivysettings.xml
@@ -38,14 +38,6 @@
 
value="[organisation]/[module]/[revision]/[module]-[revision](-[classifier])"/>
   
-  
-  
   
   
   



[nutch] branch master updated: NUTCH-2633 Fix deprecation warnings when building Nutch master branch under JDK 10.0.2+13 (#374)

2018-08-10 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new f02110f  NUTCH-2633 Fix deprecation warnings when building Nutch 
master branch under JDK 10.0.2+13 (#374)
f02110f is described below

commit f02110f42c53e77450835776cf41f22c23f030ec
Author: Lewis John McGibbney 
AuthorDate: Fri Aug 10 17:43:36 2018 -0700

NUTCH-2633 Fix deprecation warnings when building Nutch master branch under 
JDK 10.0.2+13 (#374)
---
 .../apache/nutch/crawl/AbstractFetchSchedule.java  |  0
 .../apache/nutch/crawl/AdaptiveFetchSchedule.java  |  0
 src/java/org/apache/nutch/crawl/CrawlDatum.java|  2 +-
 src/java/org/apache/nutch/crawl/CrawlDbMerger.java |  1 -
 src/java/org/apache/nutch/crawl/CrawlDbReader.java |  4 ---
 .../apache/nutch/crawl/DefaultFetchSchedule.java   |  0
 src/java/org/apache/nutch/crawl/FetchSchedule.java |  0
 .../apache/nutch/crawl/FetchScheduleFactory.java   |  2 +-
 .../nutch/crawl/MimeAdaptiveFetchSchedule.java |  2 +-
 .../org/apache/nutch/crawl/SignatureFactory.java   |  2 +-
 src/java/org/apache/nutch/fetcher/Fetcher.java |  2 +-
 src/java/org/apache/nutch/hostdb/ReadHostDb.java   |  4 +--
 .../org/apache/nutch/hostdb/ResolverThread.java|  1 +
 src/java/org/apache/nutch/indexer/CleaningJob.java |  2 ++
 src/java/org/apache/nutch/indexer/IndexWriter.java |  3 ++
 .../org/apache/nutch/indexer/IndexingFilters.java  |  8 -
 src/java/org/apache/nutch/plugin/Extension.java| 10 +--
 src/java/org/apache/nutch/plugin/Plugin.java   |  3 +-
 src/java/org/apache/nutch/protocol/Content.java|  0
 src/java/org/apache/nutch/protocol/Protocol.java   |  0
 .../apache/nutch/protocol/ProtocolException.java   |  0
 .../org/apache/nutch/protocol/ProtocolFactory.java |  6 
 .../org/apache/nutch/protocol/ProtocolStatus.java  | 34 +++---
 .../nutch/segment/ContentAsTextInputFormat.java|  1 +
 .../org/apache/nutch/segment/SegmentReader.java| 14 -
 .../org/apache/nutch/service/impl/LinkReader.java  | 22 ++
 .../org/apache/nutch/service/impl/NodeReader.java  | 22 ++
 .../service/impl/NutchServerPoolExecutor.java  |  1 +
 .../service/model/response/FetchNodeDbInfo.java|  4 +++
 .../apache/nutch/service/resources/DbResource.java |  3 ++
 src/java/org/apache/nutch/tools/Benchmark.java |  2 ++
 .../apache/nutch/tools/CommonCrawlDataDumper.java  |  2 +-
 .../apache/nutch/tools/CommonCrawlFormatWARC.java  |  2 --
 src/java/org/apache/nutch/tools/DmozParser.java| 15 ++
 src/java/org/apache/nutch/tools/FileDumper.java|  2 +-
 .../apache/nutch/tools/arc/ArcSegmentCreator.java  |  1 +
 .../org/apache/nutch/tools/warc/WARCExporter.java  |  1 -
 .../org/apache/nutch/util/AbstractChecker.java |  2 ++
 .../apache/nutch/util/CrawlCompletionStats.java|  6 ++--
 .../org/apache/nutch/util/EncodingDetector.java|  3 ++
 .../nutch/util/GenericWritableConfigurable.java|  2 +-
 .../apache/nutch/util/domain/DomainStatistics.java |  2 --
 .../apache/nutch/any23/TestAny23ParseFilter.java   | 13 -
 .../creativecommons/nutch/TestCCParseFilter.java   |  0
 .../apache/nutch/parse/feed/TestFeedParser.java| 10 +--
 .../nutch/indexer/basic/BasicIndexingFilter.java   |  6 
 .../nutch/indexer/geoip/GeoIPDocumentCreator.java  |  3 +-
 .../nutch/indexer/jexl/JexlIndexingFilter.java |  2 +-
 .../indexer/links/TestLinksIndexingFilter.java |  1 -
 .../nutch/indexer/replace/ReplaceIndexer.java  |  2 +-
 .../cloudsearch/CloudSearchIndexWriter.java|  1 +
 .../nutch/indexwriter/dummy/DummyIndexWriter.java  |  4 ---
 .../elasticrest/ElasticRestIndexWriter.java|  5 
 .../indexwriter/elastic/ElasticIndexWriter.java|  1 +
 .../elastic/TestElasticIndexWriter.java|  3 ++
 .../nutch/indexwriter/rabbit/RabbitDocument.java   |  2 ++
 .../indexer/filter/MimeTypeIndexingFilter.java |  1 +
 .../indexer/filter/MimeTypeIndexingFilterTest.java |  1 -
 .../org/apache/nutch/parse/html/HtmlParser.java|  1 +
 .../java/org/apache/nutch/parse/swf/SWFParser.java |  4 +--
 .../parse/tika/BoilerpipeExtractorRepository.java  |  2 +-
 .../org/apache/nutch/parse/tika/TikaParser.java|  2 +-
 .../apache/nutch/parse/tika/TestFeedParser.java|  7 -
 .../nutch/parsefilter/regex/RegexParseFilter.java  |  1 -
 .../parsefilter/regex/TestRegexParseFilter.java|  2 --
 .../org/apache/nutch/protocol/file/FileError.java  |  1 +
 .../apache/nutch/protocol/file/FileResponse.java   |  4 +--
 .../java/org/apache/nutch/protocol/ftp/Ftp.java|  1 +
 .../org/apache/nutch/protocol/ftp/FtpError.java|  1 +
 .../org/apache/nutch/protocol/ftp/FtpResponse.java |  8 ++---
 .../nutch/protocol/htmlunit/HttpResponse.java  |  2 ++
 .../java/org/apache/nutch/protocol/http/Http.java  |  0

[nutch] branch 2.x updated: NUTCH-2222 re-fetch deletes all metadata except _csh_ and _rs_

2018-08-01 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch 2.x
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/2.x by this push:
 new c43c2c8  NUTCH- re-fetch deletes all metadata except _csh_ and _rs_
c43c2c8 is described below

commit c43c2c85874295ef94982694fc28c068d5447234
Author: Lewis John McGibbney 
AuthorDate: Wed Aug 1 11:26:04 2018 -0700

NUTCH- re-fetch deletes all metadata except _csh_ and _rs_
---
 src/java/org/apache/nutch/fetcher/FetcherJob.java | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/java/org/apache/nutch/fetcher/FetcherJob.java 
b/src/java/org/apache/nutch/fetcher/FetcherJob.java
index 82e7a12..f4b97cb 100644
--- a/src/java/org/apache/nutch/fetcher/FetcherJob.java
+++ b/src/java/org/apache/nutch/fetcher/FetcherJob.java
@@ -75,6 +75,7 @@ public class FetcherJob extends NutchTool implements Tool {
 FIELDS.add(WebPage.Field.MARKERS);
 FIELDS.add(WebPage.Field.REPR_URL);
 FIELDS.add(WebPage.Field.FETCH_TIME);
+FIELDS.add(WebPage.Field.METADATA);
   }
 
   /**



[nutch] branch master updated: NUTCH-2539 (#300)

2018-04-10 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/master by this push:
 new 666022d  NUTCH-2539 (#300)
666022d is described below

commit 666022d67ff0e3694540e4b97369cb73f1dfa377
Author: Semyon <oked...@users.noreply.github.com>
AuthorDate: Wed Apr 11 00:53:40 2018 +0200

NUTCH-2539 (#300)

* Merge branch 'master', remote branch 'origin'

* Squashed commit of the following:

commit 68363b1bba07ac8b21f6418633dec3f554996703
Author: Semyon Semyonov <semyon.semyo...@mail.com>
Date:   Mon Mar 19 14:48:11 2018 +0100

added description to crawldb.url.normalizers

commit b53039e4b877fd52cac95d3df52133a0c914e4e1
Author: Semyon Semyonov <semyon.semyo...@mail.com>
Date:   Mon Mar 19 14:27:25 2018 +0100

misspelling in nutch-default crawldb.url.filters

commit 73e3f6493f2f5cdb3b5336ee61854a3754e4b051
Author: Semyon Semyonov <semyon.semyo...@mail.com>
Date:   Mon Mar 19 14:23:26 2018 +0100

db.url.filters and db.url.normalizers renamed to crawldb.* for the code 
match
---
 conf/nutch-default.xml | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index 20b8691..405b99f 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -548,15 +548,21 @@
 
 
 
-db.url.normalizers
+crawldb.url.normalizers
 false
-Normalize urls when updating crawldb
+
+   !Temporary, can be overwritten with the command line!
+   Normalize urls when updating crawldb
+
 
 
 
-db.url.filters
+crawldb.url.filters
 false
-Filter urls when updating crawldb
+
+   !Temporary, can be overwritten with the command line!
+   Filter urls when updating crawldb
+
 
 
 

-- 
To stop receiving notification emails like this one, please contact
lewi...@apache.org.


[nutch] 01/01: Merge pull request #309 from HansBrende/NUTCH-2550

2018-04-10 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit f612c4133444e7c765a6c690b5ee6c373ee12265
Merge: 8682b96 de19028
Author: Lewis John McGibbney <lewis.mcgibb...@gmail.com>
AuthorDate: Tue Apr 10 15:51:25 2018 -0700

Merge pull request #309 from HansBrende/NUTCH-2550

fix for NUTCH-2550 contributed by Hans Brende

 src/java/org/apache/nutch/fetcher/FetcherThread.java | 2 ++
 1 file changed, 2 insertions(+)

-- 
To stop receiving notification emails like this one, please contact
lewi...@apache.org.


[nutch] branch master updated (8682b96 -> f612c41)

2018-04-10 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git.


from 8682b96  Merge pull request #307 from Omkar20895/NUTCH-2518
 add de19028  fix for NUTCH-2550 contributed by Hans Brende
 new f612c41  Merge pull request #309 from HansBrende/NUTCH-2550

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 src/java/org/apache/nutch/fetcher/FetcherThread.java | 2 ++
 1 file changed, 2 insertions(+)

-- 
To stop receiving notification emails like this one, please contact
lewi...@apache.org.


[nutch] 01/01: Merge pull request #306 from lewismc/NUTCH-2545

2018-04-02 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 615331b81e04e3f50f766df442594f0436e51bca
Merge: 2934d43 d7e8a26
Author: Lewis John McGibbney <lewis.mcgibb...@gmail.com>
AuthorDate: Mon Apr 2 09:09:45 2018 -0700

Merge pull request #306 from lewismc/NUTCH-2545

NUTCH-2545 Upgrade to Any23 2.2

 ivy/ivysettings.xml|  12 -
 src/plugin/any23/howto_upgrade_any23.txt   |   8 +-
 src/plugin/any23/ivy.xml   |   3 +-
 src/plugin/any23/plugin.xml| 323 ++---
 .../org/apache/nutch/any23/Any23ParseFilter.java   |  29 +-
 .../apache/nutch/any23/TestAny23ParseFilter.java   |   9 +-
 6 files changed, 174 insertions(+), 210 deletions(-)

-- 
To stop receiving notification emails like this one, please contact
lewi...@apache.org.


[nutch] branch master updated (2934d43 -> 615331b)

2018-04-02 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git.


from 2934d43  Merge pull request #305 from 
sebastian-nagel/NUTCH-2447-ssl-handshake-alert
 add 5233a79  NUTCH-2545 Upgrade to Any23 2.2
 add d6ed255  ANY23-2545 remove previous syntax correction.
 add 40e92a5  NUTCH-2545 Revert syntax correction to original 
implementation, add commons-rdf-api dependency
 add d7e8a26  NUTCH-2545 Upgrade to Any23 2.2
 new 615331b  Merge pull request #306 from lewismc/NUTCH-2545

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 ivy/ivysettings.xml|  12 -
 src/plugin/any23/howto_upgrade_any23.txt   |   8 +-
 src/plugin/any23/ivy.xml   |   3 +-
 src/plugin/any23/plugin.xml| 323 ++---
 .../org/apache/nutch/any23/Any23ParseFilter.java   |  29 +-
 .../apache/nutch/any23/TestAny23ParseFilter.java   |   9 +-
 6 files changed, 174 insertions(+), 210 deletions(-)

-- 
To stop receiving notification emails like this one, please contact
lewi...@apache.org.


[nutch] 01/01: Merge pull request #298 from benmvachon/NUTCH-2536

2018-03-27 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch 2.x
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit d48c67e2f1853cc1f2a7da1a04f6a22d524d5685
Merge: 4c72756 dcada64
Author: Lewis John McGibbney <lewis.mcgibb...@gmail.com>
AuthorDate: Tue Mar 27 12:04:24 2018 -0700

Merge pull request #298 from benmvachon/NUTCH-2536

NUTCH-2536 change GeneratorReducer.count field to non-static variable…

 src/java/org/apache/nutch/crawl/GeneratorReducer.java | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
To stop receiving notification emails like this one, please contact
lewi...@apache.org.


[nutch] branch 2.x updated (4c72756 -> d48c67e)

2018-03-27 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a change to branch 2.x
in repository https://gitbox.apache.org/repos/asf/nutch.git.


from 4c72756  NUTCH-2520 Use default value for Accept-Charset if 
http.accept.charset is undefined
 add dcada64  NUTCH-2536 change GeneratorReducer.count field to non-static 
variable for easier SDK experience
 new d48c67e  Merge pull request #298 from benmvachon/NUTCH-2536

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 src/java/org/apache/nutch/crawl/GeneratorReducer.java | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
To stop receiving notification emails like this one, please contact
lewi...@apache.org.


[nutch] branch master updated (31819b7 -> 7cb7abd)

2018-03-27 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git.


from 31819b7  NUTCH-2523 UpdateHostDB blocks usage of plugins 
unintentionally (contributed by Yossi Tamari)
 add b834b81  NUTCH-2516 Hadoop imports use wildcards
 add eff0b86  NUTCH-2516 Hadoop imports use wildcards
 add 303fd19  NUTCH-2516 Hadoop imports use wildcards
 new 7cb7abd  Merge pull request #295 from lewismc/NUTCH-2516

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .gitignore |   6 +
 ivy/ivy-2.4.0.jar  | Bin 1282424 -> 0 bytes
 src/java/org/apache/nutch/crawl/CrawlDatum.java|  28 ++-
 src/java/org/apache/nutch/crawl/CrawlDb.java   |  30 ++-
 src/java/org/apache/nutch/crawl/CrawlDbFilter.java |   1 -
 src/java/org/apache/nutch/crawl/CrawlDbMerger.java |  15 +-
 src/java/org/apache/nutch/crawl/CrawlDbReader.java |   4 -
 .../org/apache/nutch/crawl/CrawlDbReducer.java |   8 +-
 .../org/apache/nutch/crawl/DeduplicationJob.java   |   4 -
 src/java/org/apache/nutch/crawl/Generator.java |  31 ++-
 src/java/org/apache/nutch/crawl/Inlink.java|   8 +-
 src/java/org/apache/nutch/crawl/Inlinks.java   |  19 +-
 src/java/org/apache/nutch/crawl/LinkDbFilter.java  |   2 -
 src/java/org/apache/nutch/crawl/LinkDbMerger.java  |   1 -
 src/java/org/apache/nutch/crawl/LinkDbReader.java  |  19 +-
 .../org/apache/nutch/crawl/SignatureFactory.java   |   1 -
 .../org/apache/nutch/crawl/URLPartitioner.java |   3 +-
 src/java/org/apache/nutch/fetcher/FetchNodeDb.java |   1 -
 src/java/org/apache/nutch/fetcher/Fetcher.java |  28 ++-
 .../apache/nutch/fetcher/FetcherOutputFormat.java  |   5 -
 .../org/apache/nutch/fetcher/FetcherThread.java|   2 -
 .../apache/nutch/fetcher/FetcherThreadEvent.java   |   1 -
 src/java/org/apache/nutch/fetcher/QueueFeeder.java |   1 -
 src/java/org/apache/nutch/hostdb/HostDatum.java|   2 -
 src/java/org/apache/nutch/hostdb/ReadHostDb.java   |   5 -
 src/java/org/apache/nutch/hostdb/UpdateHostDb.java |   6 -
 .../apache/nutch/hostdb/UpdateHostDbMapper.java|   4 -
 .../apache/nutch/hostdb/UpdateHostDbReducer.java   |   3 -
 src/java/org/apache/nutch/indexer/CleaningJob.java |   3 -
 src/java/org/apache/nutch/indexer/IndexWriter.java |   1 -
 .../org/apache/nutch/indexer/IndexWriters.java |   1 -
 .../org/apache/nutch/indexer/IndexerMapReduce.java |   5 -
 .../apache/nutch/indexer/IndexerOutputFormat.java  |   1 -
 .../org/apache/nutch/indexer/IndexingFilter.java   |   2 -
 .../org/apache/nutch/indexer/IndexingFilters.java  |   1 -
 .../nutch/indexer/IndexingFiltersChecker.java  |   1 -
 src/java/org/apache/nutch/indexer/IndexingJob.java |   1 -
 src/java/org/apache/nutch/indexer/NutchField.java  |  17 +-
 .../org/apache/nutch/metadata/CreativeCommons.java |   6 +-
 .../org/apache/nutch/metadata/HttpHeaders.java |  18 +-
 .../org/apache/nutch/net/URLExemptionFilter.java   |   3 +-
 src/java/org/apache/nutch/net/URLFilter.java   |   2 -
 .../org/apache/nutch/net/URLFilterChecker.java |   7 -
 .../org/apache/nutch/net/URLNormalizerChecker.java |   7 -
 .../org/apache/nutch/net/protocols/Response.java   |   2 -
 .../org/apache/nutch/parse/HtmlParseFilter.java|   3 -
 src/java/org/apache/nutch/parse/ParseData.java |  16 +-
 src/java/org/apache/nutch/parse/ParseImpl.java |   7 +-
 .../org/apache/nutch/parse/ParseOutputFormat.java  |  19 +-
 .../org/apache/nutch/parse/ParsePluginList.java|   1 -
 .../org/apache/nutch/parse/ParsePluginsReader.java |   4 -
 src/java/org/apache/nutch/parse/ParseSegment.java  |  39 ++--
 src/java/org/apache/nutch/parse/ParseText.java |  24 +-
 src/java/org/apache/nutch/parse/ParseUtil.java |   2 -
 src/java/org/apache/nutch/parse/Parser.java|   2 -
 src/java/org/apache/nutch/parse/ParserFactory.java |   4 -
 src/java/org/apache/nutch/protocol/Content.java|   3 -
 src/java/org/apache/nutch/protocol/Protocol.java   |   2 -
 .../org/apache/nutch/protocol/ProtocolFactory.java |   6 +-
 .../apache/nutch/protocol/RobotRulesParser.java|   3 -
 .../apache/nutch/scoring/webgraph/LinkDumper.java  |   4 -
 .../apache/nutch/scoring/webgraph/LinkRank.java|   2 -
 .../apache/nutch/scoring/webgraph/NodeDumper.java  |   2 -
 .../apache/nutch/scoring/webgraph/NodeReader.java  |   1 -
 .../nutch/scoring/webgraph/ScoreUpdater.java   |   2 -
 .../apache/nutch/scoring/webgraph/WebGraph.java|   2 -
 .../org/apache/nutch/segment/SegmentReader.java|   9 +-
 src/java/org/apache/nutch/service/NutchReader.java |   1 -
 .../org/apache/nutch/service/impl/NodeReader.java  | 

[nutch] branch master updated (0e28af6 -> 8bf139d)

2018-03-14 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git.


from 0e28af6  fixed hdfs file checks in crawl script
 add dc516b7  NUTCH-2517 mergesegs corrupts segment data
 new 8bf139d  Merge pull request #293 from lewismc/NUTCH-2517

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 src/java/org/apache/nutch/crawl/LinkDb.java| 130 ++---
 .../org/apache/nutch/segment/SegmentMerger.java| 201 ++---
 2 files changed, 163 insertions(+), 168 deletions(-)

-- 
To stop receiving notification emails like this one, please contact
lewi...@apache.org.


[nutch] 01/01: Merge pull request #293 from lewismc/NUTCH-2517

2018-03-14 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 8bf139d76ac3d9c7a557fb297b0947b8bc1d6065
Merge: 0e28af6 dc516b7
Author: Lewis John McGibbney <lewis.mcgibb...@gmail.com>
AuthorDate: Wed Mar 14 08:36:00 2018 -0700

Merge pull request #293 from lewismc/NUTCH-2517

NUTCH-2517 mergesegs corrupts segment data

 src/java/org/apache/nutch/crawl/LinkDb.java| 130 ++---
 .../org/apache/nutch/segment/SegmentMerger.java| 201 ++---
 2 files changed, 163 insertions(+), 168 deletions(-)

-- 
To stop receiving notification emails like this one, please contact
lewi...@apache.org.


[nutch] branch master updated (a2f637e -> 54510e5)

2018-02-27 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git.


from a2f637e  Merge pull request #284 from YossiTamari/master
 add c93d908  NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
 add fe5bfb4  Merge branch 'master' into NUTCH-2375
 add 405682e  Merge branch 'NUTCH-2375' of 
https://github.com/Omkar20895/nutch into NUTCH-2375
 new 54510e5  Merge pull request #221 from Omkar20895/NUTCH-2375

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 src/java/org/apache/nutch/crawl/CrawlDb.java   |  48 +-
 src/java/org/apache/nutch/crawl/CrawlDbFilter.java |  38 +-
 src/java/org/apache/nutch/crawl/CrawlDbMerger.java |  46 +-
 src/java/org/apache/nutch/crawl/CrawlDbReader.java | 315 
 .../org/apache/nutch/crawl/CrawlDbReducer.java |  44 +-
 .../org/apache/nutch/crawl/DeduplicationJob.java   | 121 ++-
 src/java/org/apache/nutch/crawl/Generator.java | 873 +++--
 src/java/org/apache/nutch/crawl/LinkDb.java| 226 +++---
 src/java/org/apache/nutch/crawl/LinkDbFilter.java  |  30 +-
 src/java/org/apache/nutch/crawl/LinkDbMerger.java  |  90 +--
 src/java/org/apache/nutch/crawl/LinkDbReader.java  |  52 +-
 .../nutch/crawl/MimeAdaptiveFetchSchedule.java |   2 +-
 .../org/apache/nutch/crawl/URLPartitioner.java |  15 +-
 src/java/org/apache/nutch/fetcher/FetchNode.java   |   2 +-
 src/java/org/apache/nutch/fetcher/FetchNodeDb.java |   2 +-
 src/java/org/apache/nutch/fetcher/Fetcher.java | 576 +++---
 .../apache/nutch/fetcher/FetcherOutputFormat.java  |  70 +-
 .../org/apache/nutch/fetcher/FetcherThread.java| 118 +--
 src/java/org/apache/nutch/fetcher/QueueFeeder.java |  26 +-
 src/java/org/apache/nutch/hostdb/HostDatum.java|   2 +-
 src/java/org/apache/nutch/hostdb/ReadHostDb.java   |   5 +-
 .../org/apache/nutch/hostdb/ResolverThread.java|  37 +-
 src/java/org/apache/nutch/hostdb/UpdateHostDb.java |  56 +-
 .../apache/nutch/hostdb/UpdateHostDbMapper.java|  50 +-
 .../apache/nutch/hostdb/UpdateHostDbReducer.java   |  52 +-
 src/java/org/apache/nutch/indexer/CleaningJob.java |  76 +-
 src/java/org/apache/nutch/indexer/IndexWriter.java |   5 +-
 .../org/apache/nutch/indexer/IndexWriters.java |   6 +-
 .../org/apache/nutch/indexer/IndexerMapReduce.java | 497 ++--
 .../apache/nutch/indexer/IndexerOutputFormat.java  |  22 +-
 .../nutch/indexer/IndexingFiltersChecker.java  |   6 +-
 src/java/org/apache/nutch/indexer/IndexingJob.java |  49 +-
 .../org/apache/nutch/net/URLExemptionFilters.java  |   2 +-
 src/java/org/apache/nutch/parse/ParseCallable.java |   2 +-
 .../org/apache/nutch/parse/ParseOutputFormat.java  | 117 ++-
 src/java/org/apache/nutch/parse/ParseSegment.java  | 207 ++---
 .../apache/nutch/scoring/webgraph/LinkDumper.java  | 164 ++--
 .../apache/nutch/scoring/webgraph/LinkRank.java| 484 ++--
 .../apache/nutch/scoring/webgraph/NodeDumper.java  | 317 
 .../apache/nutch/scoring/webgraph/NodeReader.java  |   7 +-
 .../nutch/scoring/webgraph/ScoreUpdater.java   | 146 ++--
 .../apache/nutch/scoring/webgraph/WebGraph.java| 656 +---
 .../nutch/segment/ContentAsTextInputFormat.java|  50 +-
 .../org/apache/nutch/segment/SegmentChecker.java   |   2 +-
 .../org/apache/nutch/segment/SegmentMerger.java| 587 +++---
 src/java/org/apache/nutch/segment/SegmentPart.java |   2 +-
 .../org/apache/nutch/segment/SegmentReader.java| 183 +++--
 .../org/apache/nutch/service/impl/JobFactory.java  |   2 +-
 .../nutch/service/model/request/JobConfig.java |   2 +-
 src/java/org/apache/nutch/tools/Benchmark.java |  10 +-
 src/java/org/apache/nutch/tools/FreeGenerator.java | 179 +++--
 .../org/apache/nutch/tools/arc/ArcInputFormat.java |  26 +-
 .../apache/nutch/tools/arc/ArcRecordReader.java|  22 +-
 .../apache/nutch/tools/arc/ArcSegmentCreator.java  | 466 +--
 .../org/apache/nutch/tools/warc/WARCExporter.java  | 296 +++
 src/java/org/apache/nutch/util/JexlUtil.java   |   2 +-
 src/java/org/apache/nutch/util/NutchJob.java   |  17 +-
 src/java/org/apache/nutch/util/NutchTool.java  |   2 +-
 .../util/{NutchJob.java => SegmentReaderUtil.java} |  25 +-
 .../nutch/webui/client/model/ConnectionStatus.java |   2 +-
 .../pages/components/ColorEnumLabelBuilder.java|   2 +-
 .../webui/pages/components/CpmIteratorAdapter.java |   2 +-
 .../apache/nutch/indexer/geoip/package-info.java   |   2 +-
 .../indexer/links/TestLinksIndexingFilter.java |   2 +-
 .../test/org/apache/nutch/parse/TestOutlinks.java  |   2 +-
 .../cloudsearch/CloudSearchIndexWriter.java|   9 +-
 .../nutch/indexwriter/dumm

[nutch] 01/01: Merge pull request #221 from Omkar20895/NUTCH-2375

2018-02-27 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 54510e503f7da7301a59f5f0e5bf4509b37d35b4
Merge: a2f637e 405682e
Author: Lewis John McGibbney <lewis.mcgibb...@gmail.com>
AuthorDate: Tue Feb 27 14:02:02 2018 -0800

Merge pull request #221 from Omkar20895/NUTCH-2375

NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce

 src/java/org/apache/nutch/crawl/CrawlDb.java   |  48 +-
 src/java/org/apache/nutch/crawl/CrawlDbFilter.java |  38 +-
 src/java/org/apache/nutch/crawl/CrawlDbMerger.java |  46 +-
 src/java/org/apache/nutch/crawl/CrawlDbReader.java | 315 
 .../org/apache/nutch/crawl/CrawlDbReducer.java |  44 +-
 .../org/apache/nutch/crawl/DeduplicationJob.java   | 121 ++-
 src/java/org/apache/nutch/crawl/Generator.java | 873 +++--
 src/java/org/apache/nutch/crawl/LinkDb.java| 226 +++---
 src/java/org/apache/nutch/crawl/LinkDbFilter.java  |  30 +-
 src/java/org/apache/nutch/crawl/LinkDbMerger.java  |  90 +--
 src/java/org/apache/nutch/crawl/LinkDbReader.java  |  52 +-
 .../nutch/crawl/MimeAdaptiveFetchSchedule.java |   2 +-
 .../org/apache/nutch/crawl/URLPartitioner.java |  15 +-
 src/java/org/apache/nutch/fetcher/FetchNode.java   |   2 +-
 src/java/org/apache/nutch/fetcher/FetchNodeDb.java |   2 +-
 src/java/org/apache/nutch/fetcher/Fetcher.java | 576 +++---
 .../apache/nutch/fetcher/FetcherOutputFormat.java  |  70 +-
 .../org/apache/nutch/fetcher/FetcherThread.java| 118 +--
 src/java/org/apache/nutch/fetcher/QueueFeeder.java |  26 +-
 src/java/org/apache/nutch/hostdb/HostDatum.java|   2 +-
 src/java/org/apache/nutch/hostdb/ReadHostDb.java   |   5 +-
 .../org/apache/nutch/hostdb/ResolverThread.java|  37 +-
 src/java/org/apache/nutch/hostdb/UpdateHostDb.java |  56 +-
 .../apache/nutch/hostdb/UpdateHostDbMapper.java|  50 +-
 .../apache/nutch/hostdb/UpdateHostDbReducer.java   |  52 +-
 src/java/org/apache/nutch/indexer/CleaningJob.java |  76 +-
 src/java/org/apache/nutch/indexer/IndexWriter.java |   5 +-
 .../org/apache/nutch/indexer/IndexWriters.java |   6 +-
 .../org/apache/nutch/indexer/IndexerMapReduce.java | 497 ++--
 .../apache/nutch/indexer/IndexerOutputFormat.java  |  22 +-
 .../nutch/indexer/IndexingFiltersChecker.java  |   6 +-
 src/java/org/apache/nutch/indexer/IndexingJob.java |  49 +-
 .../org/apache/nutch/net/URLExemptionFilters.java  |   2 +-
 src/java/org/apache/nutch/parse/ParseCallable.java |   2 +-
 .../org/apache/nutch/parse/ParseOutputFormat.java  | 117 ++-
 src/java/org/apache/nutch/parse/ParseSegment.java  | 207 ++---
 .../apache/nutch/scoring/webgraph/LinkDumper.java  | 164 ++--
 .../apache/nutch/scoring/webgraph/LinkRank.java| 484 ++--
 .../apache/nutch/scoring/webgraph/NodeDumper.java  | 317 
 .../apache/nutch/scoring/webgraph/NodeReader.java  |   7 +-
 .../nutch/scoring/webgraph/ScoreUpdater.java   | 146 ++--
 .../apache/nutch/scoring/webgraph/WebGraph.java| 656 +---
 .../nutch/segment/ContentAsTextInputFormat.java|  50 +-
 .../org/apache/nutch/segment/SegmentChecker.java   |   2 +-
 .../org/apache/nutch/segment/SegmentMerger.java| 587 +++---
 src/java/org/apache/nutch/segment/SegmentPart.java |   2 +-
 .../org/apache/nutch/segment/SegmentReader.java| 183 +++--
 .../org/apache/nutch/service/impl/JobFactory.java  |   2 +-
 .../nutch/service/model/request/JobConfig.java |   2 +-
 src/java/org/apache/nutch/tools/Benchmark.java |  10 +-
 src/java/org/apache/nutch/tools/FreeGenerator.java | 179 +++--
 .../org/apache/nutch/tools/arc/ArcInputFormat.java |  26 +-
 .../apache/nutch/tools/arc/ArcRecordReader.java|  22 +-
 .../apache/nutch/tools/arc/ArcSegmentCreator.java  | 466 +--
 .../org/apache/nutch/tools/warc/WARCExporter.java  | 296 +++
 src/java/org/apache/nutch/util/JexlUtil.java   |   2 +-
 src/java/org/apache/nutch/util/NutchJob.java   |  17 +-
 src/java/org/apache/nutch/util/NutchTool.java  |   2 +-
 .../util/{NutchJob.java => SegmentReaderUtil.java} |  25 +-
 .../nutch/webui/client/model/ConnectionStatus.java |   2 +-
 .../pages/components/ColorEnumLabelBuilder.java|   2 +-
 .../webui/pages/components/CpmIteratorAdapter.java |   2 +-
 .../apache/nutch/indexer/geoip/package-info.java   |   2 +-
 .../indexer/links/TestLinksIndexingFilter.java |   2 +-
 .../test/org/apache/nutch/parse/TestOutlinks.java  |   2 +-
 .../cloudsearch/CloudSearchIndexWriter.java|   9 +-
 .../nutch/indexwriter/dummy/DummyIndexWriter.java  |   5 +-
 .../elasticrest/ElasticRestIndexWriter.java|  32 +-
 .../indexwriter/elastic/ElasticConstants.java  |   2 +-
 .../indexwriter/elastic/ElasticIndexWriter.java|  17 +-
 .../elastic/TestElasticIndexWriter.java|  14 +-
 .../indexwriter/rabbit/RabbitIndexWriter.java  |   3 +-
 

[nutch] branch master updated (75d0166 -> a2f637e)

2018-02-07 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git.


from 75d0166  Merge pull request #283 from smartive/NUTCH-2508
 add f18b327  NUTCH-2489: Dependency collision with lucene-analyzers-common 
in scoring-similarity plugin
 new a2f637e  Merge pull request #284 from YossiTamari/master

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 src/plugin/scoring-similarity/ivy.xml | 2 +-
 src/plugin/scoring-similarity/plugin.xml  | 4 ++--
 .../org/apache/nutch/scoring/similarity/util/LuceneAnalyzerUtil.java  | 2 +-
 .../org/apache/nutch/scoring/similarity/util/LuceneTokenizer.java | 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

-- 
To stop receiving notification emails like this one, please contact
lewi...@apache.org.


[nutch] branch master updated (2b66cda -> 75d0166)

2018-01-31 Thread lewismc
This is an automated email from the ASF dual-hosted git repository.

lewismc pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git.


from 2b66cda  NUTCH-2466
 add 4f82d8f  fix for NUTCH-2508 contributed by mfeltscher
 new 75d0166  Merge pull request #283 from smartive/NUTCH-2508

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 conf/nutch-default.xml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
To stop receiving notification emails like this one, please contact
lewi...@apache.org.


  1   2   3   4   5   6   7   >