(nutch) branch master updated: NUTCH-3041 Address confusing logging in o.a.n.net.URLExemptionFilters (#813)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 8abc78a65 NUTCH-3041 Address confusing logging in o.a.n.net.URLExemptionFilters (#813) 8abc78a65 is described below commit 8abc78a653eb7970def10031d732fb4c7aa0fb6f Author: Lewis John McGibbney AuthorDate: Wed May 15 20:07:15 2024 -0700 NUTCH-3041 Address confusing logging in o.a.n.net.URLExemptionFilters (#813) --- .../org/apache/nutch/net/URLExemptionFilters.java | 7 +-- src/plugin/urlfilter-ignoreexempt/README.md| 18 +++- .../urlfilter/ignoreexempt/ExemptionUrlFilter.java | 24 +- 3 files changed, 26 insertions(+), 23 deletions(-) diff --git a/src/java/org/apache/nutch/net/URLExemptionFilters.java b/src/java/org/apache/nutch/net/URLExemptionFilters.java index c730228e4..ed401053e 100644 --- a/src/java/org/apache/nutch/net/URLExemptionFilters.java +++ b/src/java/org/apache/nutch/net/URLExemptionFilters.java @@ -24,6 +24,7 @@ import org.slf4j.Logger; import org.slf4j.LoggerFactory; import java.lang.invoke.MethodHandles; +import java.util.Arrays; /** Creates and caches {@link URLExemptionFilter} implementing plugins. */ public class URLExemptionFilters { @@ -44,8 +45,10 @@ public class URLExemptionFilters { throw new IllegalStateException(e); } } -LOG.info("Found {} extensions at point:'{}'", filters.length, -URLExemptionFilter.X_POINT_ID); +if (filters.length > 0) { + LOG.info("Found {} URLExemptionFilter implementations: '{}'", filters.length, +Arrays.toString(filters)); +} } /** diff --git a/src/plugin/urlfilter-ignoreexempt/README.md b/src/plugin/urlfilter-ignoreexempt/README.md index a8f932e75..374b29abd 100644 --- a/src/plugin/urlfilter-ignoreexempt/README.md +++ b/src/plugin/urlfilter-ignoreexempt/README.md @@ -17,8 +17,8 @@ urlfilter-ignoreexempt == - This plugin allows certain urls to be exempted when the external links are configured to be ignored. - This is useful when focused crawl is setup but some resources like static files are linked from CDNs (external domains). +This plugin allows certain urls to be exempted when the external links are configured to be ignored. +This is useful when focused crawl is setup but some resources like static files are linked from CDNs (external domains). # How to enable ? Add `urlfilter-ignoreexempt` value to `plugin.includes` property @@ -36,25 +36,21 @@ open `conf/db-ignore-external-exemptions.txt` and add the regex rules. ## Format : The format is same same as `regex-urlfilter.txt`. - Each non-comment, non-blank line contains a regular expression - prefixed by '+' or '-'. The first matching pattern in the file - determines whether a URL is exempted or ignored. If no pattern - matches, the URL is ignored. - +Each non-comment, non-blank line contains a regular expression +prefixed by '+' or '-'. The first matching pattern in the file +determines whether a URL is exempted or ignored. If no pattern +matches, the URL is ignored. ## Example : - To exempt urls ending with image extensions, use this rule +To exempt urls ending with image extensions, use this rule `+(?i)\.(jpg|png|gif)$` - - ## Testing the Rules : After enabling the plugin and adding your rules to `conf/db-ignore-external-exemptions.txt`, run: `bin/nutch plugin urlfilter-ignoreexempt org.apache.nutch.urlfilter.ignoreexempt.ExemptionUrlFilter http://yoururl.here` - This should print `true` for urls which are accepted by configured rules. \ No newline at end of file diff --git a/src/plugin/urlfilter-ignoreexempt/src/java/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.java b/src/plugin/urlfilter-ignoreexempt/src/java/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.java index 96ca9b4ac..8028e3672 100644 --- a/src/plugin/urlfilter-ignoreexempt/src/java/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.java +++ b/src/plugin/urlfilter-ignoreexempt/src/java/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.java @@ -25,21 +25,25 @@ import java.io.Reader; import java.util.regex.Pattern; import java.util.List; - /** - * This implementation of {@link org.apache.nutch.net.URLExemptionFilter} uses regex configuration - * to check if URL is eligible for exemption from 'db.ignore.external'. - * When this filter is enabled, the external urls will be checked against configured sequence of regex rules. + * This implementation of {@link org.apache.nutch.net.URLExemptionFilter} + * uses regex configuration to check if URL is eligible for exemption from + * the db.ignore.external.links configuration property. + * When this filter is enabled, the external urls will be checked + * against confi
(nutch) branch master updated: NUTCH-3054 Address deprecation of Node16 for all GitHub Actions (#817)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 7ac3ce28e NUTCH-3054 Address deprecation of Node16 for all GitHub Actions (#817) 7ac3ce28e is described below commit 7ac3ce28e065fb5160f96ce7bce1ec840f87d0dc Author: Lewis John McGibbney AuthorDate: Tue Apr 30 07:35:39 2024 -0700 NUTCH-3054 Address deprecation of Node16 for all GitHub Actions (#817) --- .github/workflows/master-build.yml | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/.github/workflows/master-build.yml b/.github/workflows/master-build.yml index e0af58df0..db24168b9 100644 --- a/.github/workflows/master-build.yml +++ b/.github/workflows/master-build.yml @@ -30,9 +30,9 @@ jobs: os: [ubuntu-latest] runs-on: ${{ matrix.os }} steps: - - uses: actions/checkout@v4 + - uses: actions/checkout@v4.1.4 - name: Set up JDK ${{ matrix.java }} -uses: actions/setup-java@v3 +uses: actions/setup-java@v4.2.1 with: java-version: ${{ matrix.java }} distribution: 'temurin' @@ -45,9 +45,9 @@ jobs: os: [ubuntu-latest] runs-on: ${{ matrix.os }} steps: - - uses: actions/checkout@v4 + - uses: actions/checkout@v4.1.4 - name: Set up JDK ${{ matrix.java }} -uses: actions/setup-java@v3 +uses: actions/setup-java@v4.2.1 with: java-version: ${{ matrix.java }} distribution: 'temurin' @@ -68,9 +68,9 @@ jobs: os: [ubuntu-latest, macos-latest] runs-on: ${{ matrix.os }} steps: - - uses: actions/checkout@v4 + - uses: actions/checkout@v4.1.4 - name: Set up JDK ${{ matrix.java }} -uses: actions/setup-java@v3 +uses: actions/setup-java@v4.2.1 with: java-version: ${{ matrix.java }} distribution: 'temurin'
(nutch) branch master updated: Boostrap Nutch 1.21 development drive.
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 817af69d4 Boostrap Nutch 1.21 development drive. 817af69d4 is described below commit 817af69d451609d725fc7fb040bc32f1fa0052bc Author: Lewis John McGibbney AuthorDate: Sun Apr 28 17:34:10 2024 -0700 Boostrap Nutch 1.21 development drive. --- conf/nutch-default.xml | 2 +- default.properties | 4 ++-- src/bin/nutch | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml index edcaeb569..c00d9776b 100644 --- a/conf/nutch-default.xml +++ b/conf/nutch-default.xml @@ -203,7 +203,7 @@ http.agent.version - Nutch-1.20-SNAPSHOT + Nutch-1.21-SNAPSHOT A version string to advertise in the User-Agent header. diff --git a/default.properties b/default.properties index 385e53e57..47041f465 100644 --- a/default.properties +++ b/default.properties @@ -14,9 +14,9 @@ # limitations under the License. name=apache-nutch -version=1.20-SNAPSHOT +version=1.21-SNAPSHOT final.name=${name}-${version} -year=2022 +year=2024 basedir = ./ src.dir = ./src/java diff --git a/src/bin/nutch b/src/bin/nutch index 561c79e77..b3e0a256b 100755 --- a/src/bin/nutch +++ b/src/bin/nutch @@ -61,7 +61,7 @@ done # if no args specified, show usage if [ $# = 0 ]; then - echo "nutch 1.20-SNAPSHOT" + echo "nutch 1.21-SNAPSHOT" echo "Usage: nutch COMMAND [-Dproperty=value]... [command-specific args]..." echo "where COMMAND is one of:" echo " readdbread / dump crawl db"
(nutch) branch master updated: Add GitHub CI badge to README
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new c0b94614c Add GitHub CI badge to README c0b94614c is described below commit c0b94614ccf88cf1c55980bebd93bec357a31cac Author: Lewis John McGibbney AuthorDate: Sun Apr 28 10:23:32 2024 -0700 Add GitHub CI badge to README --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index e05f56ccd..28acfe8c7 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,8 @@ Apache Nutch README === +[![master pull request ci](https://github.com/apache/nutch/actions/workflows/master-build.yml/badge.svg)](https://github.com/apache/nutch/actions/workflows/master-build.yml) + https://nutch.apache.org/assets/img/nutch_logo_tm.png; align="right" width="300" /> For the latest information about Nutch, please visit our website at:
svn commit: r68753 - in /release/nutch: 1.19/ 1.20/apache-nutch-1.20-bin.tar.gz.sha512 1.20/apache-nutch-1.20-bin.zip.sha512 1.20/apache-nutch-1.20-src.tar.gz.sha512 1.20/apache-nutch-1.20-src.zip.sha
Author: lewismc Date: Thu Apr 25 04:27:39 2024 New Revision: 68753 Log: Cleanup older Nutch release distributions and add sha512sums for 1.20 release. Added: release/nutch/1.20/apache-nutch-1.20-bin.tar.gz.sha512 release/nutch/1.20/apache-nutch-1.20-bin.zip.sha512 release/nutch/1.20/apache-nutch-1.20-src.tar.gz.sha512 release/nutch/1.20/apache-nutch-1.20-src.zip.sha512 Removed: release/nutch/1.19/ release/nutch/2.4/ Added: release/nutch/1.20/apache-nutch-1.20-bin.tar.gz.sha512 == --- release/nutch/1.20/apache-nutch-1.20-bin.tar.gz.sha512 (added) +++ release/nutch/1.20/apache-nutch-1.20-bin.tar.gz.sha512 Thu Apr 25 04:27:39 2024 @@ -0,0 +1 @@ +871dc0a8cbfc61daf84ea08ce6987ffa4cfcec4e24d388ffeffd49e983426ba8dd218bc2cb4eba45e65cfe0e43ae72fad99e70850b83154ca3e86803c6bd1c01 apache-nutch-1.20-bin.tar.gz Added: release/nutch/1.20/apache-nutch-1.20-bin.zip.sha512 == --- release/nutch/1.20/apache-nutch-1.20-bin.zip.sha512 (added) +++ release/nutch/1.20/apache-nutch-1.20-bin.zip.sha512 Thu Apr 25 04:27:39 2024 @@ -0,0 +1 @@ +b37761be4a5464d60ef97c2515944757a33e093d844415c6f0f1f2e0a81076e473cf58879f1e58d499c169b39d74f10a2936eb24d3250bc216ecf167bdaa4f8e apache-nutch-1.20-bin.zip Added: release/nutch/1.20/apache-nutch-1.20-src.tar.gz.sha512 == --- release/nutch/1.20/apache-nutch-1.20-src.tar.gz.sha512 (added) +++ release/nutch/1.20/apache-nutch-1.20-src.tar.gz.sha512 Thu Apr 25 04:27:39 2024 @@ -0,0 +1 @@ +dfd70c95f6eba5a9c843639433f77c0651e12d9075541330fa5d159b4698192a968d670ea14275a6560707ac22d79ab2bcbfe339ce7d6f51a2f52d90209e5de3 apache-nutch-1.20-src.tar.gz Added: release/nutch/1.20/apache-nutch-1.20-src.zip.sha512 == --- release/nutch/1.20/apache-nutch-1.20-src.zip.sha512 (added) +++ release/nutch/1.20/apache-nutch-1.20-src.zip.sha512 Thu Apr 25 04:27:39 2024 @@ -0,0 +1 @@ +c4407accbcfc1bf67ea0f7121d3d726988c31e2bb90631ec892caf98aeebc946a2c72b303d42fdef020206da4509437ea3dbb5761e46fe541b81f39d4923c5ed apache-nutch-1.20-src.zip
svn commit: r68752 - /dev/nutch/1.20/ /release/nutch/1.20/
Author: lewismc Date: Thu Apr 25 02:23:27 2024 New Revision: 68752 Log: Release Apache Nutch 1.20 Added: release/nutch/1.20/ - copied from r68751, dev/nutch/1.20/ Removed: dev/nutch/1.20/
svn commit: r68410 [1/3] - /dev/nutch/1.20/
Author: lewismc Date: Tue Apr 9 20:44:40 2024 New Revision: 68410 Log: Stage Apache Nutch 1.20 RC#1 Added: dev/nutch/1.20/ dev/nutch/1.20/CHANGES.md dev/nutch/1.20/apache-nutch-1.20-bin.tar.gz (with props) dev/nutch/1.20/apache-nutch-1.20-bin.tar.gz.asc dev/nutch/1.20/apache-nutch-1.20-bin.zip (with props) dev/nutch/1.20/apache-nutch-1.20-bin.zip.asc dev/nutch/1.20/apache-nutch-1.20-src.tar.gz (with props) dev/nutch/1.20/apache-nutch-1.20-src.tar.gz.asc dev/nutch/1.20/apache-nutch-1.20-src.zip (with props) dev/nutch/1.20/apache-nutch-1.20-src.zip.asc
svn commit: r68410 [2/3] - /dev/nutch/1.20/
(snagel) + +* NUTCH-2177 Generator produces only one partition even in distributed mode (jnioche, snagel) + +* NUTCH-2158 Upgrade to Tika 1.11 (jnioche, snagel) + +* NUTCH-2175 Typos in property descriptions in nutch-default.xml (Roannel Fernández Hernández via snagel) + +* NUTCH-2069 Ignore external links based on domain (jnioche) + +* NUTCH-2173 String.join in FileDumper breaks the build (joyce) + +* NUTCH-2166 Add reverse URL format to dump tool (joyce) + +* NUTCH-2157 Addressing Miredot REST API Warnings (Sujen Shah) + +* NUTCH-2165 FileDumper Util hard codes part-# folder name (joyce) + +* NUTCH-2167 Backport TableUtil from 2.x for URL reversing (joyce) + +* NUTCH-2160 Upgrade Selenium Java to 2.48.2 (lewismc, kwhitehall) + +* NUTCH-2120 Remove MapWritable from trunk codebase (lewismc) + +* NUTCH-1911 Improve DomainStatistics tool command line parsing (joyce) + +* NUTCH-2064 URLNormalizer basic to encode reserved chars and decode non-reserved chars (markus, snagel) + +* NUTCH-2159 Ensure that all WebApp files are copied into generated artifacts for 1.X Webapp (lewismc) + +* NUTCH-2154 Nutch REST API (DB) suffering NullPointerException (Aron Ahmadia, Sujen Shah via mattmann) +
svn commit: r68410 [3/3] - /dev/nutch/1.20/
Added: dev/nutch/1.20/apache-nutch-1.20-bin.tar.gz == Binary file - no diff available. Propchange: dev/nutch/1.20/apache-nutch-1.20-bin.tar.gz -- svn:mime-type = application/octet-stream Added: dev/nutch/1.20/apache-nutch-1.20-bin.tar.gz.asc == --- dev/nutch/1.20/apache-nutch-1.20-bin.tar.gz.asc (added) +++ dev/nutch/1.20/apache-nutch-1.20-bin.tar.gz.asc Tue Apr 9 20:44:40 2024 @@ -0,0 +1,16 @@ +-BEGIN PGP SIGNATURE- + +iQIzBAABCAAdFiEE23tRmRIcCKXI9AUrOkcX8Ei66/YFAmYVpOQACgkQOkcX8Ei6 +6/buExAAwPh4uHBMGPvVUBLztSm5Ze+ZeRjHsxARVmiglyFUCKo9n1ZySHTaoqlW +3f1I7c79dqrVZyqKMY9O5BjdA5K0w7scz3klHNOdrUc5Zal8GSY52sbOXq+CLka0 +fEYz3H3BMfB1eDn8F+dtFcYgfKqatVf+sFbvLdzfeorLzURZha/07WsGiXAtc629 +dOuNb9mweE5+BlEaeIm3ypYww294KZEvtQstouuvdal86Gm94KCenVb989CofQLb +RHamuxjmVDOtb22G+PqCEFfPWZ3HSz9eOqzqn133glR88soWwG468MxzLAJZXpDU +uB05ENvozkcIngj/emSZFy7Y1sY81VH0ErLxbxZDCIssxpVnOwI6N+5Un00T/nMz +VbUeXv1Zq9XY2SHDZr9AP8wiWre4ae5wp2NAMVD2zlcTVo66jbDEiNSCzKmK/pPe +gdexcS47lXQjCCYYe6rnUO8T5wEAeVn2Ctp+1mdjfDamN7liNExzvPtoUg07uDyx +TM48F+5Es1c9wYC3nVyUvqadfKWFnCqfPIPogEeNTH5mwWTAtaXCcPcib+GxoCd+ +k5x5BEmB6wyQbmTKLjSVdDI6DL+suO4MtlIw1/2yHnj4uMPnAvABnG8uBKp2sCMc +3GlQWJ5FiadkXASf6bbCv5+2iQof1BhRGJAu5PvYjRGEASG3IhM= +=dpeR +-END PGP SIGNATURE- Added: dev/nutch/1.20/apache-nutch-1.20-bin.zip == Binary file - no diff available. Propchange: dev/nutch/1.20/apache-nutch-1.20-bin.zip -- svn:mime-type = application/octet-stream Added: dev/nutch/1.20/apache-nutch-1.20-bin.zip.asc == --- dev/nutch/1.20/apache-nutch-1.20-bin.zip.asc (added) +++ dev/nutch/1.20/apache-nutch-1.20-bin.zip.asc Tue Apr 9 20:44:40 2024 @@ -0,0 +1,16 @@ +-BEGIN PGP SIGNATURE- + +iQIzBAABCAAdFiEE23tRmRIcCKXI9AUrOkcX8Ei66/YFAmYVpUwACgkQOkcX8Ei6 +6/aj4RAAqeXW9QsddsFuxVu2el37aZhV4HOsGsCX66G/wxz5nj5s34O41IKxTPrv +SJ0XRoekQ304uGYziAzDtDQUyXfAFo7gpF3w5TgK+5f8Mz8piPiW80uIMZYaUgXV +kAr6dYlbLPtcbyzspxCBHFZlHPf0MC6YtnaHPFq5B9LBjLl3nE+u1HkCUlHjWm84 +dQqijPyaiFyYGhsuU4/xaAJcgluUNcQlmAcY6125vOtMGKJqHdTVU/rZvJ30Ym0V +/k92t6+CgU4y8a/JyOToNFRD0f+3aGGNQUXKZIvAenzNIugv5wlubxF/CRht+J5L +0bU48GcZjboNknKBc8tMewBwhHpAGAL5O5AS92j8naWUrZ1Wkur1y3EL7wiS39xJ +fI0BRrTNcVapOoUnoQuXtxpoqRjiBmC2sEP9nH9T5dHNZaDljOielB4gi+1SGYYR +DXiIpe6i/bMjMEO14At3ACwIoXknLo/gPQKUaIGQUTb+rlrFbZWVByZvcO826Az6 +0eEllycEzdvLpn0wv03zJhz9KwzJJCFJ4jgip/LIN5UXFHhUjzWykdJ2HUxHXq3v +1zjee9o3/K0UqUn07d/rIG3pNdteja4PDo0AmLt2l/B8Pfi0pnZj9LjbL5DIWcNp +oe41Ew6RFL7hjRZV2HwwBSmYCHNUSoL5HCR9dk10PcQFrH6phW0= +=nfnj +-END PGP SIGNATURE- Added: dev/nutch/1.20/apache-nutch-1.20-src.tar.gz == Binary file - no diff available. Propchange: dev/nutch/1.20/apache-nutch-1.20-src.tar.gz -- svn:mime-type = application/octet-stream Added: dev/nutch/1.20/apache-nutch-1.20-src.tar.gz.asc == --- dev/nutch/1.20/apache-nutch-1.20-src.tar.gz.asc (added) +++ dev/nutch/1.20/apache-nutch-1.20-src.tar.gz.asc Tue Apr 9 20:44:40 2024 @@ -0,0 +1,16 @@ +-BEGIN PGP SIGNATURE- + +iQIzBAABCAAdFiEE23tRmRIcCKXI9AUrOkcX8Ei66/YFAmYVpPwACgkQOkcX8Ei6 +6/ZUGhAAjocHBJYQynpMuU+Geai8TC2sVBGUt33VuDPG5fHVnq5Y/QiwK3B/AL0u +DtQdcajwnym3QMYBq0ZzzjOqXtE0B0Awwsz14KQYt+43AMpakLsVXBysZDXOTTcm +yrSc3IJEYvxlDQg0DA9uU4qpw5AHcEP3gzQ5tqA8X9V0EWejf82+KRjpJmKwJi1j +hS1rIdY0cCd15Ibo+jCf7PMSWZqYcEUdivy9+h1Zm+hV5mv49TMm4Js+fsNQrFyh +2dS5EZSvommodgP4hjKCpW7EkNRcl20ZmlVntLNhULTEXDd8CCpweg/7iSNo0hD/ +MWS2YMtY2zf2lnid217YNhSG1a2LprZ3sqmMtEcM0/F8PsOrA1p1klsuTz6+S2FO +ei89JdVQvOJbh6PdeaNkQqBTnc06seNQLTF+6iLtCPVQ3mojFJhqgnaMWP3W20A+ +ZElNLRe0Jw//5aX19YZilRoxAwA3aAxXSXIeNk9TukiRPOqvevxORDoXy3INosYj +/8HrSESOXsZyCIyOQzHExYNDQA/SkH8BisxY9aVDDmJyaKTXgWAaraLVn1+/6thX +zGhT3M349+bSrfR4BiMO7Cg3r0VcMgUkcfIUPfZtpLtOIV9bs+rGrxWlujor1vC6 +eS3hfSjMbQHLR3UuLMFRhWIAiunXAMHqnrRwWK20vOy5LiJo70I= +=lrhO +-END PGP SIGNATURE- Added: dev/nutch/1.20/apache-nutch-1.20-src.zip == Binary file - no diff available. Propchange: dev/nutch/1.20/apache-nutch-1.20-src.zip -- svn:mime-type = application/octet-stream Added: dev/nutch/1.20/apache-nutch-1.20-src.zip.asc == --- dev/nutch/1.20/apache-nutch-1.20-src.zip.asc (added) +++
(nutch) annotated tag release-1.20 updated (a2cb6aa5d -> 6510cb241)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a change to annotated tag release-1.20 in repository https://gitbox.apache.org/repos/asf/nutch.git *** WARNING: tag release-1.20 was modified! *** from a2cb6aa5d (commit) to 6510cb241 (tag) tagging a2cb6aa5d3e90b7249e47323f2fa4cbf2aa9fa27 (commit) replaces release-1.13 by Lewis John McGibbney on Tue Apr 9 09:44:29 2024 -0700 - Log - Apache Nutch 1.20 RC#1 Tag --- No new revisions were added by this update. Summary of changes:
(nutch) branch branch-1.20 updated: Prepare Nutch 1.20 release candidate
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch branch-1.20 in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/branch-1.20 by this push: new a2cb6aa5d Prepare Nutch 1.20 release candidate a2cb6aa5d is described below commit a2cb6aa5d3e90b7249e47323f2fa4cbf2aa9fa27 Author: Lewis John McGibbney AuthorDate: Tue Apr 9 09:23:24 2024 -0700 Prepare Nutch 1.20 release candidate --- ivy/mvn.template | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ivy/mvn.template b/ivy/mvn.template index fafc79f83..43ecfbd6a 100644 --- a/ivy/mvn.template +++ b/ivy/mvn.template @@ -45,7 +45,7 @@ https://github.com/apache/nutch.git - 2 + maven2 https://repo.maven.apache.org/maven2/
(nutch) branch branch-1.20 created (now f141a398c)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a change to branch branch-1.20 in repository https://gitbox.apache.org/repos/asf/nutch.git at f141a398c Prepare Nutch 1.20 release candidate This branch includes the following new commits: new f141a398c Prepare Nutch 1.20 release candidate The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference.
(nutch) 01/01: Prepare Nutch 1.20 release candidate
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch branch-1.20 in repository https://gitbox.apache.org/repos/asf/nutch.git commit f141a398c1c0c4e2a1861cd2928fff6a58f53b1f Author: Lewis John McGibbney AuthorDate: Tue Apr 9 09:16:40 2024 -0700 Prepare Nutch 1.20 release candidate --- .gitignore | 2 + CHANGES.md | 157 + conf/nutch-default.xml | 2 +- default.properties | 4 +- src/bin/nutch | 2 +- 5 files changed, 163 insertions(+), 4 deletions(-) diff --git a/.gitignore b/.gitignore index 8c521aa68..972a7cfcb 100644 --- a/.gitignore +++ b/.gitignore @@ -26,3 +26,5 @@ lib/spotbugs-* ivy/dependency-check-ant/* .gradle* ivy/apache-rat-* +ivy/maven-ant-tasks-* +pom.xml diff --git a/CHANGES.md b/CHANGES.md index adea4478f..0e9a0cf45 100644 --- a/CHANGES.md +++ b/CHANGES.md @@ -1,5 +1,162 @@ # Nutch Change Log + +Nutch 1.20 Release 09/04/2024 (dd/mm/) +Release Report: https://s.apache.org/ovjf3 + +Sub-task + + +[NUTCH-2596] - Upgrade from org.mortbay.jetty to org.eclipse.jetty + +[NUTCH-2852] - Method invokes System.exit(...) 9 bugs + +[NUTCH-2972] - Javadoc build fails using JDK 17 + +[NUTCH-3007] - Fix impossible casts + + + +Bug + + +[NUTCH-2634] - Some links marked as nofollow are followed anyway. + +[NUTCH-2820] - Review sample files used in any23 unit tests + +[NUTCH-2924] - Generate maxCount expr evaluated only once + +[NUTCH-2937] - parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode + +[NUTCH-2973] - Single domain names (eg https://localnet) cant be crawled - filtering fails + +[NUTCH-2974] - Ant build fails with Unparseable date on certain platforms + +[NUTCH-2979] - Upgrade Commons Text to 1.10.0 + +[NUTCH-2982] - Generator: parameter for URL normalization not passed forward + +[NUTCH-2985] - Disable plugin urlfilter-validator by default + +[NUTCH-2992] - Fetcher: always block fetch queues when exceptions threshold is reached + +[NUTCH-3000] - protocol-selenium returns only the body,strips off the head/ element + +[NUTCH-3001] - protocol-selenium requires Content-Type header + +[NUTCH-3002] - Protocol-okhttp HttpResponse: HTTP header metadata lookup should be case-insensitive + +[NUTCH-3008] - indexer-elastic: downgrade to ES 7.10.2 to address licensing issues + +[NUTCH-3012] - SegmentReader when dumping with option -recode: NPE on unparsed documents + +[NUTCH-3027] - Trivial resource leak patch in DomainSuffixes.java + +[NUTCH-3035] - Update license and notice file for release of 1.20 + + + +New Feature + + +[NUTCH-2832] - Create tutorial on sending Nutch logs to Elasticsearch + +[NUTCH-2888] - Selenium Protocol: Support for Selenium 4 + +[NUTCH-2920] - Implement a indexer-opensearch plugin + +[NUTCH-2991] - Support HTTP/S Header Authorization for Solr connections + +[NUTCH-3029] - Host specific max. and min. intervals in adaptive scheduler + + + +Improvement + + +[NUTCH-2853] - bin/nutch: remove deprecated commands solrindex, solrdedup, solrclean + +[NUTCH-2883] - Provide means to run server as a persistent service in Docker container + +[NUTCH-2897] - Do not supress deprecated API warnings + +[NUTCH-2961] - Upgrade dependencies of parsefilter-naivebayes + +[NUTCH-2980] - Upgrade Selenium Java to 4.7.2 + +[NUTCH-2983] - nutch-default.xml improvements + +[NUTCH-2990] - HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309 + +[NUTCH-2993] - ScoringDepth plugin to skip depth check based on URL Pattern + +[NUTCH-2995] - Upgrade to crawler-commons 1.4 + +[NUTCH-2996] - Use new SimpleRobotRulesParser API entry point (crawler-commons 1.4) + +[NUTCH-2997] - Add Override annotations where applicable + +[NUTCH-3004] - Avoid NPE in HttpResponse + +[NUTCH-3005] - Upgrade selenium as needed + +[NUTCH-3009] - Upgrade to Hadoop 3.3.6 + +[NUTCH-3010] - Injector: count unique number of injected URLs + +[NUTCH-3011] - HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx) + +[NUTCH-3013] - Employ commons-lang3s StopWatch to simplify timing logic + +[NUTCH-3014] - Standardize Job names + +[NUTCH-3015] - Add more CI steps to GitHub master-build.yml + +[NUTCH-3017] - Allow fast-urlfilter to load from HDFS/S3 and support gzipped input + +[NUTCH-3025] - urlfilter-fast to filter based on the length of the URL + +[NUTCH-3031] - ProtocolFactory host mapper to support domains
(nutch) branch master updated: NUTCH-3038 Address issues discovered during 1.20 release management dryrun (#811)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 271f92e11 NUTCH-3038 Address issues discovered during 1.20 release management dryrun (#811) 271f92e11 is described below commit 271f92e11c39b7a3583cfcd8d664262cfac59674 Author: Lewis John McGibbney AuthorDate: Mon Apr 8 16:21:13 2024 -0700 NUTCH-3038 Address issues discovered during 1.20 release management dryrun (#811) --- CHANGES.txt => CHANGES.md | 0 build.xml | 6 +++--- docker/Dockerfile | 2 +- docker/README.md | 3 +-- ivy/mvn.template | 37 +++-- 5 files changed, 8 insertions(+), 40 deletions(-) diff --git a/CHANGES.txt b/CHANGES.md similarity index 100% rename from CHANGES.txt rename to CHANGES.md diff --git a/build.xml b/build.xml index 49187d3ba..845bdfce8 100644 --- a/build.xml +++ b/build.xml @@ -329,7 +329,7 @@ - + @@ -340,7 +340,7 @@ - + @@ -352,7 +352,7 @@ - + diff --git a/docker/Dockerfile b/docker/Dockerfile index fb93fe98a..2eb218bad 100644 --- a/docker/Dockerfile +++ b/docker/Dockerfile @@ -22,7 +22,7 @@ # 2 == Same as mode 1 with addition of Nutch WebApp ARG BUILD_MODE=0 -FROM alpine:3.13 AS base +FROM alpine:3.19 AS base ARG SERVER_PORT=8081 ARG SERVER_HOST=0.0.0.0 diff --git a/docker/README.md b/docker/README.md index c8330bf9b..80e1a1d6d 100644 --- a/docker/README.md +++ b/docker/README.md @@ -3,7 +3,6 @@ ![Docker Pulls](https://img.shields.io/docker/pulls/apache/nutch?style=for-the-badge) ![Docker Image Size (latest by date)](https://img.shields.io/docker/image-size/apache/nutch?style=for-the-badge) ![Docker Image Version (latest semver)](https://img.shields.io/docker/v/apache/nutch?style=for-the-badge) -![MicroBadger Layers](https://img.shields.io/microbadger/layers/apache/nutch?style=for-the-badge) ![Docker Stars](https://img.shields.io/docker/stars/apache/nutch?style=for-the-badge) ![Docker Automated build](https://img.shields.io/docker/automated/apache/nutch?style=for-the-badge) @@ -25,7 +24,7 @@ Current configuration of this image consists of components: ## Base Image -* [alpine:3.13](https://hub.docker.com/_/alpine/) +* [alpine:3.19](https://hub.docker.com/_/alpine/tags) ## Tips diff --git a/ivy/mvn.template b/ivy/mvn.template index b38b37f6d..fafc79f83 100644 --- a/ivy/mvn.template +++ b/ivy/mvn.template @@ -22,7 +22,7 @@ org.apache apache -23 +31 ${ivy.pom.groupId} ${ivy.pom.artifactId} @@ -45,12 +45,7 @@ https://github.com/apache/nutch.git - - - miredot - MireDot Releases - http://nexus.qmino.com/content/repositories/miredot - + 2 maven2 https://repo.maven.apache.org/maven2/ @@ -128,7 +123,7 @@ org.apache.maven.plugins maven-compiler-plugin - 3.8.1 + 3.13.0 11 11 @@ -136,31 +131,5 @@ - - -com.qmino -miredot-plugin -2.4.0 - - - - restdoc - - - - - cHJvamVjdHxvcmcuYXBhY2hlLm51dGNoLm51dGNofDIwMTktMTAtMzB8dHJ1ZXwtMSNNQ3dDRkJMb0FjM283ME1YRERRMkFJemY1QmxZUjAwK0FoUkJVMlJrVi81RlBQc25zMUZ2S2g0Q29weGFxZz09 - - - jax-rs - - - - - - - - -
(nutch) branch branch-1.20 deleted (was 9cfe3d7f9)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a change to branch branch-1.20 in repository https://gitbox.apache.org/repos/asf/nutch.git was 9cfe3d7f9 Prepare for Nutch 1.20 release This change permanently discards the following revisions: discard 9cfe3d7f9 Prepare for Nutch 1.20 release
(nutch) 01/01: Prepare for Nutch 1.20 release
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch branch-1.20 in repository https://gitbox.apache.org/repos/asf/nutch.git commit 9cfe3d7f9bf46a71f5473d7afb1dfc71f7ff2c1b Author: Lewis John McGibbney AuthorDate: Fri Apr 5 19:33:51 2024 -0700 Prepare for Nutch 1.20 release --- CHANGES.txt| 150 + conf/nutch-default.xml | 2 +- default.properties | 4 +- src/bin/nutch | 2 +- 4 files changed, 154 insertions(+), 4 deletions(-) diff --git a/CHANGES.txt b/CHANGES.txt index adea4478f..6b032d798 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -1,5 +1,155 @@ # Nutch Change Log +Nutch 1.20 Release 05/04/2024 (dd/mm/) +Release Report: https://s.apache.org/arvtl + +Release Notes - Nutch - Version 1.20 + +Sub-task + + +[NUTCH-2596] - Upgrade from org.mortbay.jetty to org.eclipse.jetty + +[NUTCH-2852] - Method invokes System.exit(...) 9 bugs + +[NUTCH-2972] - Javadoc build fails using JDK 17 + +[NUTCH-3007] - Fix impossible casts + + + +Bug + + +[NUTCH-2634] - Some links marked as nofollow are followed anyway. + +[NUTCH-2820] - Review sample files used in any23 unit tests + +[NUTCH-2924] - Generate maxCount expr evaluated only once + +[NUTCH-2973] - Single domain names (eg https://localnet) cant be crawled - filtering fails + +[NUTCH-2974] - Ant build fails with Unparseable date on certain platforms + +[NUTCH-2979] - Upgrade Commons Text to 1.10.0 + +[NUTCH-2982] - Generator: parameter for URL normalization not passed forward + +[NUTCH-2985] - Disable plugin urlfilter-validator by default + +[NUTCH-2992] - Fetcher: always block fetch queues when exceptions threshold is reached + +[NUTCH-3000] - protocol-selenium returns only the body,strips off the head/ element + +[NUTCH-3001] - protocol-selenium requires Content-Type header + +[NUTCH-3002] - Protocol-okhttp HttpResponse: HTTP header metadata lookup should be case-insensitive + +[NUTCH-3008] - indexer-elastic: downgrade to ES 7.10.2 to address licensing issues + +[NUTCH-3012] - SegmentReader when dumping with option -recode: NPE on unparsed documents + +[NUTCH-3027] - Trivial resource leak patch in DomainSuffixes.java + +[NUTCH-3035] - Update license and notice file for release of 1.20 + + + +New Feature + + +[NUTCH-2832] - Create tutorial on sending Nutch logs to Elasticsearch + +[NUTCH-2888] - Selenium Protocol: Support for Selenium 4 + +[NUTCH-2920] - Implement a indexer-opensearch plugin + +[NUTCH-2991] - Support HTTP/S Header Authorization for Solr connections + +[NUTCH-3029] - Host specific max. and min. intervals in adaptive scheduler + + + +Improvement + + +[NUTCH-2853] - bin/nutch: remove deprecated commands solrindex, solrdedup, solrclean + +[NUTCH-2883] - Provide means to run server as a persistent service in Docker container + +[NUTCH-2897] - Do not supress deprecated API warnings + +[NUTCH-2961] - Upgrade dependencies of parsefilter-naivebayes + +[NUTCH-2980] - Upgrade Selenium Java to 4.7.2 + +[NUTCH-2983] - nutch-default.xml improvements + +[NUTCH-2990] - HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309 + +[NUTCH-2993] - ScoringDepth plugin to skip depth check based on URL Pattern + +[NUTCH-2995] - Upgrade to crawler-commons 1.4 + +[NUTCH-2996] - Use new SimpleRobotRulesParser API entry point (crawler-commons 1.4) + +[NUTCH-2997] - Add Override annotations where applicable + +[NUTCH-3004] - Avoid NPE in HttpResponse + +[NUTCH-3009] - Upgrade to Hadoop 3.3.6 + +[NUTCH-3010] - Injector: count unique number of injected URLs + +[NUTCH-3011] - HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx) + +[NUTCH-3013] - Employ commons-lang3s StopWatch to simplify timing logic + +[NUTCH-3014] - Standardize Job names + +[NUTCH-3015] - Add more CI steps to GitHub master-build.yml + +[NUTCH-3017] - Allow fast-urlfilter to load from HDFS/S3 and support gzipped input + +[NUTCH-3025] - urlfilter-fast to filter based on the length of the URL + +[NUTCH-3031] - ProtocolFactory host mapper to support domains + +[NUTCH-3032] - Indexing plugin as an adapter for end users own POJO instances + +[NUTCH-3036] - Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium + + + +Task + + +[NUTCH-2959] - Upgrade to Apache Tika 2.9.0 + +[NUTCH-2977] - Support for showing dependency tree + +[NUTCH-2978] - Move to slf4j2 and remove
(nutch) branch branch-1.20 created (now 9cfe3d7f9)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a change to branch branch-1.20 in repository https://gitbox.apache.org/repos/asf/nutch.git at 9cfe3d7f9 Prepare for Nutch 1.20 release This branch includes the following new commits: new 9cfe3d7f9 Prepare for Nutch 1.20 release The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference.
(nutch) branch master updated: NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time (#810)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new c9e2f4ed6 NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time (#810) c9e2f4ed6 is described below commit c9e2f4ed693014e9dcb9d6f68ae918e0c0eedd26 Author: Joe Gilvary AuthorDate: Thu Apr 4 12:06:19 2024 -0400 NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time (#810) --- build.xml | 4 + conf/nutch-default.xml | 66 + src/plugin/build.xml | 3 + src/plugin/index-arbitrary/build.xml | 22 ++ src/plugin/index-arbitrary/ivy.xml | 39 +++ src/plugin/index-arbitrary/plugin.xml | 42 +++ .../indexer/arbitrary/ArbitraryIndexingFilter.java | 286 + .../nutch/indexer/arbitrary/package-info.java | 23 ++ .../org/apache/nutch/indexer/arbitrary/Echo.java | 40 +++ .../apache/nutch/indexer/arbitrary/Multiplier.java | 47 .../arbitrary/TestArbitraryIndexingFilter.java | 222 11 files changed, 794 insertions(+) diff --git a/build.xml b/build.xml index 0a18682f8..49187d3ba 100644 --- a/build.xml +++ b/build.xml @@ -203,6 +203,7 @@ + @@ -646,6 +647,7 @@ + @@ -1173,6 +1175,8 @@ + + diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml index 8b24f092a..edcaeb569 100644 --- a/conf/nutch-default.xml +++ b/conf/nutch-default.xml @@ -2252,6 +2252,72 @@ CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when using this + + + index.arbitrary.function.count + + The count of arbitrary additions/edits to the document. +Specify the remaining properties (fieldName, className, constructorArgs, +methodName, and methodArgs) independently in this file by appending a +dot (.) followed by integer numerals (beginning with '0') to the property +names, e.g.: + +index.arbitrary.fieldName.0 +for the field to add/set with the first arbitrary addition or: + +index.arbitrary.className.3 +for the POJO class name to use in setting the fourth arbitrary addition. + + + + + index.arbitrary.fieldName.0 + + The name of the field to add to the document with the value +returned from the custom POJO. + + + + index.arbitrary.className.0 + + The fully qualified name of the POJO class that will supply +values for the new field. + + + + index.arbitrary.constructorArgs.0 + + The values (as strings) to pass into the POJO constructor. +The POJO must accept a String representation of the NutchDocument's URL +as the first parameter in the constructor. The values you specify here +will populate the constructor arguments 1,..,n-1 where n=the count of +arguments to the constructor. Argument #0 will be the NutchDocument's URL. + + + + + index.arbitrary.methodName.0 + + The name of the method to invoke on the instance of your custom +class in order to determine the value to add to the document. + + + + index.arbitrary.methodArgs.0 + + The values (as strings) to pass into the named method on the POJO +instance. Unlike the constructor args, there is no required argument that this +method in the POJO must accept, i.e., the Arbitrary Indexer doesn't supply any +arguments taken from the NutchDocument values by default. + + + + index.arbitrary.overwrite.0 + Whether to overwrite any existing value in the doc for +for fieldName. Default is false if not specified in config + + + metatags.names diff --git a/src/plugin/build.xml b/src/plugin/build.xml index 34688ed56..498259a95 100755 --- a/src/plugin/build.xml +++ b/src/plugin/build.xml @@ -40,6 +40,7 @@ + @@ -117,6 +118,7 @@ + @@ -179,6 +181,7 @@ + diff --git a/src/plugin/index-arbitrary/build.xml b/src/plugin/index-arbitrary/build.xml new file mode 100644 index 0..818020c84 --- /dev/null +++ b/src/plugin/index-arbitrary/build.xml @@ -0,0 +1,22 @@ + + + + + + + diff --git a/src/plugin/index-arbitrary/ivy.xml b/src/plugin/index-arbitrary/ivy.xml new file mode 100644 index 0..9feb1e1b4 --- /dev/null +++ b/src/plugin/index-arbitrary/ivy.xml @@ -0,0 +1,39 @@ + + + + + +https://nutch.apache.org/"/> + +Apache Nutch + + + + + + + + + + + + + + + + diff --git a/src/plugin/index-arbitrary/plugin.xml b/src/plugin/index-arbitrary/plugin.xml
(nutch) branch master updated (5a95bc653 -> 1563396d9)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git from 5a95bc653 NUTCH-3035 Update license and notice file for release of 1.20 (#808) add 1563396d9 NUTCH-3036 Upgrade org.seleniumhq.selenium:selenium-java dependency i… (#807) No new revisions were added by this update. Summary of changes: README.md | 1 + .../nutch/protocol/htmlunit/HtmlUnitWebDriver.java | 27 ++-- .../apache/nutch/protocol/http/api/HttpBase.java | 63 - src/plugin/lib-selenium/README.md | 2 +- src/plugin/lib-selenium/howto_upgrade_selenium.md | 34 +++-- src/plugin/lib-selenium/ivy.xml| 2 +- src/plugin/lib-selenium/plugin.xml | 147 ++--- .../nutch/protocol/selenium/HttpWebClient.java | 82 +--- src/plugin/protocol-interactiveselenium/README.md | 4 +- src/plugin/protocol-selenium/README.md | 2 +- .../org/apache/nutch/protocol/selenium/Http.java | 16 +-- 11 files changed, 144 insertions(+), 236 deletions(-)
(nutch) branch master updated (3905a8df7 -> 5a95bc653)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git from 3905a8df7 NUTCH-3037 Upgrade org.apache.kafka:kafka_2.12: to v3.7.0 (#809) add 5a95bc653 NUTCH-3035 Update license and notice file for release of 1.20 (#808) No new revisions were added by this update. Summary of changes: LICENSE-binary | 193 +++-- NOTICE-binary | 667 + 2 files changed, 416 insertions(+), 444 deletions(-)
(nutch) branch master updated (367988dfd -> 3905a8df7)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git from 367988dfd NUTCH-3008 indexer-elastic: downgrade to ES 7.10.2 to address licensing issues add 3905a8df7 NUTCH-3037 Upgrade org.apache.kafka:kafka_2.12: to v3.7.0 (#809) No new revisions were added by this update. Summary of changes: src/plugin/indexer-kafka/ivy.xml| 4 +- src/plugin/indexer-kafka/plugin.xml | 73 + 2 files changed, 59 insertions(+), 18 deletions(-)
(nutch) branch master updated: NUTCH-3033 Upgrade Ivy to v2.5.2 (#803)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 4f62dec0f NUTCH-3033 Upgrade Ivy to v2.5.2 (#803) 4f62dec0f is described below commit 4f62dec0f3001a8d41b236913346669ac7968133 Author: Lewis John McGibbney AuthorDate: Wed Mar 13 07:42:58 2024 -0700 NUTCH-3033 Upgrade Ivy to v2.5.2 (#803) --- .gitignore | 5 + build.xml | 2 +- default.properties | 2 +- ivy/ivy.xml | 4 +++- ivy/ivysettings.xml | 4 ++-- src/plugin/build-plugin.xml | 4 ++-- src/plugin/creativecommons/ivy.xml | 4 +++- src/plugin/exchange-jexl/ivy.xml| 4 +++- src/plugin/feed/ivy.xml | 4 +++- src/plugin/headings/ivy.xml | 4 +++- src/plugin/index-anchor/ivy.xml | 4 +++- src/plugin/index-basic/ivy.xml | 4 +++- src/plugin/index-geoip/ivy.xml | 4 +++- src/plugin/index-jexl-filter/ivy.xml| 4 +++- src/plugin/index-links/ivy.xml | 4 +++- src/plugin/index-metadata/ivy.xml | 4 +++- src/plugin/index-more/ivy.xml | 4 +++- src/plugin/index-replace/ivy.xml| 4 +++- src/plugin/index-static/ivy.xml | 4 +++- src/plugin/indexer-cloudsearch/ivy.xml | 4 +++- src/plugin/indexer-csv/ivy.xml | 4 +++- src/plugin/indexer-dummy/ivy.xml| 4 +++- src/plugin/indexer-elastic/ivy.xml | 4 +++- src/plugin/indexer-kafka/ivy.xml| 4 +++- src/plugin/indexer-opensearch-1x/ivy.xml| 4 +++- src/plugin/indexer-rabbit/ivy.xml | 4 +++- src/plugin/indexer-solr/ivy.xml | 4 +++- src/plugin/language-identifier/ivy.xml | 4 +++- src/plugin/lib-htmlunit/build-ivy.xml | 2 +- src/plugin/lib-htmlunit/ivy.xml | 4 +++- src/plugin/lib-http/ivy.xml | 4 +++- src/plugin/lib-nekohtml/ivy.xml | 4 +++- src/plugin/lib-rabbitmq/ivy.xml | 4 +++- src/plugin/lib-regex-filter/ivy.xml | 4 +++- src/plugin/lib-selenium/ivy.xml | 4 +++- src/plugin/lib-xml/ivy.xml | 4 +++- src/plugin/microformats-reltag/ivy.xml | 4 +++- src/plugin/mimetype-filter/ivy.xml | 4 +++- src/plugin/nutch-extensionpoints/ivy.xml| 4 +++- src/plugin/parse-ext/ivy.xml| 4 +++- src/plugin/parse-html/ivy.xml | 4 +++- src/plugin/parse-js/ivy.xml | 4 +++- src/plugin/parse-metatags/ivy.xml | 4 +++- src/plugin/parse-tika/ivy.xml | 4 +++- src/plugin/parse-zip/ivy.xml| 4 +++- src/plugin/parsefilter-debug/ivy.xml| 4 +++- src/plugin/parsefilter-naivebayes/ivy.xml | 4 +++- src/plugin/parsefilter-regex/ivy.xml| 4 +++- src/plugin/protocol-file/ivy.xml| 4 +++- src/plugin/protocol-foo/ivy.xml | 4 +++- src/plugin/protocol-ftp/ivy.xml | 4 +++- src/plugin/protocol-htmlunit/ivy.xml| 4 +++- src/plugin/protocol-http/ivy.xml| 4 +++- src/plugin/protocol-httpclient/ivy.xml | 4 +++- src/plugin/protocol-interactiveselenium/ivy.xml | 4 +++- src/plugin/protocol-okhttp/ivy.xml | 4 +++- src/plugin/protocol-selenium/ivy.xml| 4 +++- src/plugin/publish-rabbitmq/ivy.xml | 4 +++- src/plugin/scoring-depth/ivy.xml| 4 +++- src/plugin/scoring-link/ivy.xml | 4 +++- src/plugin/scoring-metadata/ivy.xml | 4 +++- src/plugin/scoring-opic/ivy.xml | 4 +++- src/plugin/scoring-orphan/ivy.xml | 4 +++- src/plugin/scoring-similarity/ivy.xml | 4 +++- src/plugin/subcollection/ivy.xml| 4 +++- src/plugin/tld/ivy.xml | 4 +++- src/plugin/urlfilter-automaton/ivy.xml | 4 +++- src/plugin/urlfilter-domain/ivy.xml | 4 +++- src/plugin/urlfilter-domaindenylist/ivy.xml | 4 +++- src/plugin/urlfilter-fast/ivy.xml | 4 +++- src/plugin/urlfilter-ignoreexempt/ivy.xml | 4 +++- src/plugin/urlfilter-prefix/ivy.xml | 4 +++- src/plugin/urlfilter-regex/ivy.xml | 4 +++- src/plugin/urlfilter-suffix/ivy.xml | 4 +++- src/plugin/urlfilter-validator/ivy.xml | 4 +++- src/plugin/urlmeta/ivy.xml | 4 +++- src/plugin/urlnormalizer-ajax/ivy.xml | 4 +++- src/plugin/urlnormalizer-basic
(nutch) branch master updated: Update Dockerfile / JAVA_HOME - 2nd try (#805)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 42b55f6a9 Update Dockerfile / JAVA_HOME - 2nd try (#805) 42b55f6a9 is described below commit 42b55f6a9b369d8e7f6b93735107abb187f65c39 Author: Jakob Berlin AuthorDate: Wed Mar 13 06:11:30 2024 +0100 Update Dockerfile / JAVA_HOME - 2nd try (#805) * Nutch 1.19 release - update current year in API docs etc. - update version number - add changes / release notes - update links to Hadoop API docs * Update Dockerfile / JAVA_HOME Alpine is using ash shell by default which results in an not set JAVA_HOME environment variable * Update Dockerfile Remove empty line at the end - Co-authored-by: Sebastian Nagel --- docker/Dockerfile | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docker/Dockerfile b/docker/Dockerfile index cffa00a95..fb93fe98a 100644 --- a/docker/Dockerfile +++ b/docker/Dockerfile @@ -46,6 +46,7 @@ RUN apk --no-cache add apache-ant bash git openjdk11 supervisor # Establish environment variables RUN echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk' >> $HOME/.bashrc +RUN echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk' >> $HOME/.ashrc ENV JAVA_HOME='/usr/lib/jvm/java-11-openjdk' ENV NUTCH_HOME='/root/nutch_source/runtime/local' @@ -112,4 +113,4 @@ EXPOSE $WEBAPP_PORT ENTRYPOINT [ "supervisord", "--nodaemon", "--configuration", "/etc/supervisord.conf" ] FROM branch-version-$BUILD_MODE AS final -RUN echo "Successfully built image, see https://s.apache.org/m5933 for guidance on running a container instance." \ No newline at end of file +RUN echo "Successfully built image, see https://s.apache.org/m5933 for guidance on running a container instance."
(nutch) branch revert-801-patch-2 deleted (was 54394b9ed)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a change to branch revert-801-patch-2 in repository https://gitbox.apache.org/repos/asf/nutch.git was 54394b9ed Revert "Update Dockerfile / JAVA_HOME (#801)" The revisions that were on this branch are still contained in other references; therefore, this change does not discard any commits from the repository.
(nutch) branch branch-1.19 updated: Revert "Update Dockerfile / JAVA_HOME (#801)" (#804)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch branch-1.19 in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/branch-1.19 by this push: new 19bfd00bb Revert "Update Dockerfile / JAVA_HOME (#801)" (#804) 19bfd00bb is described below commit 19bfd00bbce1298a956c646798200df5ae89fb71 Author: Lewis John McGibbney AuthorDate: Mon Mar 11 13:25:01 2024 -0700 Revert "Update Dockerfile / JAVA_HOME (#801)" (#804) This reverts commit 0b04db65ad32634aa1a63a191a404c52a5d29e46. --- docker/Dockerfile | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/docker/Dockerfile b/docker/Dockerfile index ea734bd06..29ead46ba 100644 --- a/docker/Dockerfile +++ b/docker/Dockerfile @@ -23,8 +23,6 @@ RUN apk update RUN apk --no-cache add apache-ant bash git openjdk11 RUN echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk' >> $HOME/.bashrc -RUN echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk' >> $HOME/.ashrc -ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk env NUTCH_HOME='/root/nutch_source/runtime/local' # Checkout and build the Nutch master branch (1.x) @@ -36,4 +34,4 @@ RUN git clone https://github.com/apache/nutch.git nutch_source && \ # Create symlinks for runtime/local/bin/nutch and runtime/local/bin/crawl RUN ln -sf $NUTCH_HOME/bin/nutch /usr/local/bin/ -RUN ln -sf $NUTCH_HOME/bin/crawl /usr/local/bin/ +RUN ln -sf $NUTCH_HOME/bin/crawl /usr/local/bin/ \ No newline at end of file
(nutch) 01/01: Revert "Update Dockerfile / JAVA_HOME (#801)"
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch revert-801-patch-2 in repository https://gitbox.apache.org/repos/asf/nutch.git commit 54394b9ed860eda9e60fc3e469534cf447dd0518 Author: Lewis John McGibbney AuthorDate: Mon Mar 11 13:24:44 2024 -0700 Revert "Update Dockerfile / JAVA_HOME (#801)" This reverts commit 0b04db65ad32634aa1a63a191a404c52a5d29e46. --- docker/Dockerfile | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/docker/Dockerfile b/docker/Dockerfile index ea734bd06..29ead46ba 100644 --- a/docker/Dockerfile +++ b/docker/Dockerfile @@ -23,8 +23,6 @@ RUN apk update RUN apk --no-cache add apache-ant bash git openjdk11 RUN echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk' >> $HOME/.bashrc -RUN echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk' >> $HOME/.ashrc -ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk env NUTCH_HOME='/root/nutch_source/runtime/local' # Checkout and build the Nutch master branch (1.x) @@ -36,4 +34,4 @@ RUN git clone https://github.com/apache/nutch.git nutch_source && \ # Create symlinks for runtime/local/bin/nutch and runtime/local/bin/crawl RUN ln -sf $NUTCH_HOME/bin/nutch /usr/local/bin/ -RUN ln -sf $NUTCH_HOME/bin/crawl /usr/local/bin/ +RUN ln -sf $NUTCH_HOME/bin/crawl /usr/local/bin/ \ No newline at end of file
(nutch) branch revert-801-patch-2 created (now 54394b9ed)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a change to branch revert-801-patch-2 in repository https://gitbox.apache.org/repos/asf/nutch.git at 54394b9ed Revert "Update Dockerfile / JAVA_HOME (#801)" This branch includes the following new commits: new 54394b9ed Revert "Update Dockerfile / JAVA_HOME (#801)" The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference.
(nutch) branch branch-1.19 updated: Update Dockerfile / JAVA_HOME (#801)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch branch-1.19 in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/branch-1.19 by this push: new 0b04db65a Update Dockerfile / JAVA_HOME (#801) 0b04db65a is described below commit 0b04db65ad32634aa1a63a191a404c52a5d29e46 Author: Jakob Berlin AuthorDate: Mon Mar 11 21:24:21 2024 +0100 Update Dockerfile / JAVA_HOME (#801) Alpine is using ash shell by default which results in an not set JAVA_HOME environment variable --- docker/Dockerfile | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docker/Dockerfile b/docker/Dockerfile index 29ead46ba..ea734bd06 100644 --- a/docker/Dockerfile +++ b/docker/Dockerfile @@ -23,6 +23,8 @@ RUN apk update RUN apk --no-cache add apache-ant bash git openjdk11 RUN echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk' >> $HOME/.bashrc +RUN echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk' >> $HOME/.ashrc +ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk env NUTCH_HOME='/root/nutch_source/runtime/local' # Checkout and build the Nutch master branch (1.x) @@ -34,4 +36,4 @@ RUN git clone https://github.com/apache/nutch.git nutch_source && \ # Create symlinks for runtime/local/bin/nutch and runtime/local/bin/crawl RUN ln -sf $NUTCH_HOME/bin/nutch /usr/local/bin/ -RUN ln -sf $NUTCH_HOME/bin/crawl /usr/local/bin/ \ No newline at end of file +RUN ln -sf $NUTCH_HOME/bin/crawl /usr/local/bin/
(nutch) branch master updated: NUTCH-3024 Remove flaky 'dependency check' target (#795)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 85fea6e46 NUTCH-3024 Remove flaky 'dependency check' target (#795) 85fea6e46 is described below commit 85fea6e46475cb74c61c13193fff008a7e7e6a37 Author: Lewis John McGibbney AuthorDate: Fri Nov 24 12:33:50 2023 -0800 NUTCH-3024 Remove flaky 'dependency check' target (#795) --- .github/workflows/dependency-check.yml | 37 -- build.xml | 47 -- 2 files changed, 84 deletions(-) diff --git a/.github/workflows/dependency-check.yml b/.github/workflows/dependency-check.yml deleted file mode 100644 index f07f746a0..0 --- a/.github/workflows/dependency-check.yml +++ /dev/null @@ -1,37 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one or more -# contributor license agreements. See the NOTICE file distributed with -# this work for additional information regarding copyright ownership. -# The ASF licenses this file to You under the Apache License, Version 2.0 -# (the "License"); you may not use this file except in compliance with -# the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -name: master pr build - -on: - schedule: -- cron: '0 0 * * *' # every day at midnight - -jobs: - dependency-check: -strategy: - matrix: -java: ['11'] -os: [ubuntu-latest] -runs-on: ${{ matrix.os }} -steps: - - uses: actions/checkout@v4 - - name: Set up JDK ${{ matrix.java }} -uses: actions/setup-java@v3 -with: - java-version: ${{ matrix.java }} - distribution: 'temurin' - - name: Dependency check -run: ant clean dependency-check -buildfile build.xml diff --git a/build.xml b/build.xml index dd9797302..70c8e8a9e 100644 --- a/build.xml +++ b/build.xml @@ -38,10 +38,6 @@ - - - - @@ -615,49 +611,6 @@ - - - - - - - -https://github.com/jeremylong/DependencyCheck/releases/download/v${dependency-check-ant.version}/dependency-check-ant-${dependency-check-ant.version}-release.zip; - dest="${ivy.dir}/dependency-check-ant-${dependency-check-ant.version}-release.zip" usetimestamp="false" /> - - - - - - - - - - - - - - - - - - - - - - - - - - - -
(nutch) branch master updated: NUTCH-3014 Standardize Job names (#789)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new bbf086726 NUTCH-3014 Standardize Job names (#789) bbf086726 is described below commit bbf0867263ed1764c56fe7794c17942d0e8bf1c4 Author: Lewis John McGibbney AuthorDate: Thu Nov 2 20:36:43 2023 -0700 NUTCH-3014 Standardize Job names (#789) --- src/java/org/apache/nutch/crawl/CrawlDb.java | 3 +- src/java/org/apache/nutch/crawl/CrawlDbMerger.java | 3 +- src/java/org/apache/nutch/crawl/CrawlDbReader.java | 20 + .../org/apache/nutch/crawl/DeduplicationJob.java | 3 +- src/java/org/apache/nutch/crawl/Generator.java | 13 - src/java/org/apache/nutch/crawl/Injector.java | 2 +- src/java/org/apache/nutch/crawl/LinkDb.java| 3 +- src/java/org/apache/nutch/crawl/LinkDbMerger.java | 3 +- src/java/org/apache/nutch/crawl/LinkDbReader.java | 3 +- src/java/org/apache/nutch/fetcher/Fetcher.java | 2 +- src/java/org/apache/nutch/hostdb/ReadHostDb.java | 3 +- src/java/org/apache/nutch/hostdb/UpdateHostDb.java | 3 +- src/java/org/apache/nutch/indexer/CleaningJob.java | 4 +-- src/java/org/apache/nutch/indexer/IndexingJob.java | 3 +- src/java/org/apache/nutch/parse/ParseSegment.java | 3 +- .../apache/nutch/scoring/webgraph/LinkDumper.java | 6 ++-- .../apache/nutch/scoring/webgraph/LinkRank.java| 15 -- .../apache/nutch/scoring/webgraph/NodeDumper.java | 3 +- .../nutch/scoring/webgraph/ScoreUpdater.java | 3 +- .../apache/nutch/scoring/webgraph/WebGraph.java| 9 ++ .../org/apache/nutch/segment/SegmentMerger.java| 3 +- .../org/apache/nutch/segment/SegmentReader.java| 3 +- src/java/org/apache/nutch/tools/FreeGenerator.java | 2 +- .../apache/nutch/tools/arc/ArcSegmentCreator.java | 9 ++ .../org/apache/nutch/tools/warc/WARCExporter.java | 3 +- .../apache/nutch/util/CrawlCompletionStats.java| 6 ++-- src/java/org/apache/nutch/util/NutchJob.java | 4 --- .../nutch/util/ProtocolStatusStatistics.java | 2 +- .../org/apache/nutch/util/SitemapProcessor.java| 34 ++ .../apache/nutch/util/domain/DomainStatistics.java | 10 +++ .../org/apache/nutch/crawl/TestCrawlDbFilter.java | 3 +- .../org/apache/nutch/plugin/TestPluginSystem.java | 5 ++-- 32 files changed, 74 insertions(+), 117 deletions(-) diff --git a/src/java/org/apache/nutch/crawl/CrawlDb.java b/src/java/org/apache/nutch/crawl/CrawlDb.java index 16394832b..2b609c0a6 100644 --- a/src/java/org/apache/nutch/crawl/CrawlDb.java +++ b/src/java/org/apache/nutch/crawl/CrawlDb.java @@ -165,8 +165,7 @@ public class CrawlDb extends NutchTool implements Tool { Path newCrawlDb = new Path(crawlDb, Integer.toString(new Random() .nextInt(Integer.MAX_VALUE))); -Job job = NutchJob.getInstance(config); -job.setJobName("crawldb " + crawlDb); +Job job = Job.getInstance(config, "Nutch CrawlDb: " + crawlDb); Path current = new Path(crawlDb, CURRENT_NAME); if (current.getFileSystem(job.getConfiguration()).exists(current)) { diff --git a/src/java/org/apache/nutch/crawl/CrawlDbMerger.java b/src/java/org/apache/nutch/crawl/CrawlDbMerger.java index 1bf7243d3..6ee4b43cd 100644 --- a/src/java/org/apache/nutch/crawl/CrawlDbMerger.java +++ b/src/java/org/apache/nutch/crawl/CrawlDbMerger.java @@ -165,9 +165,8 @@ public class CrawlDbMerger extends Configured implements Tool { Path newCrawlDb = new Path(output, "merge-" + Integer.toString(new Random().nextInt(Integer.MAX_VALUE))); -Job job = NutchJob.getInstance(conf); +Job job = Job.getInstance(conf, "Nutch CrawlDbMerger: " + output); conf = job.getConfiguration(); -job.setJobName("crawldb merge " + output); job.setInputFormatClass(SequenceFileInputFormat.class); diff --git a/src/java/org/apache/nutch/crawl/CrawlDbReader.java b/src/java/org/apache/nutch/crawl/CrawlDbReader.java index bd3e6f38d..29e8efe17 100644 --- a/src/java/org/apache/nutch/crawl/CrawlDbReader.java +++ b/src/java/org/apache/nutch/crawl/CrawlDbReader.java @@ -564,9 +564,8 @@ public class CrawlDbReader extends AbstractChecker implements Closeable { throws IOException, InterruptedException, ClassNotFoundException { Path tmpFolder = new Path(crawlDb, "stat_tmp" + System.currentTimeMillis()); -Job job = NutchJob.getInstance(config); +Job job = Job.getInstance(config, "Nutch CrawlDbReader: " + crawlDb); config = job.getConfiguration(); -job.setJobName("stats " + crawlDb); config.setBoolean("db.reader.stats.sort", sort); FileInputFormat.addInputPath(job, new Path(crawlDb, CrawlDb.CURRENT_NAME)); @@ -812,7 +811,7 @@ public class CrawlDbReader exte
(nutch) branch master updated: NUTCH-3015 Add more CI steps to GitHub master-build.yml (#790)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 792ed2891 NUTCH-3015 Add more CI steps to GitHub master-build.yml (#790) 792ed2891 is described below commit 792ed28914f4beb2fb8b8ce28eebe17196c92af1 Author: Lewis John McGibbney AuthorDate: Fri Oct 27 15:04:22 2023 -0700 NUTCH-3015 Add more CI steps to GitHub master-build.yml (#790) --- .../{master-build.yml => dependency-check.yml} | 25 - .github/workflows/master-build.yml | 64 +- .gitignore | 1 + build.xml | 52 +++--- .../dependency-check-suppressions.xml | 5 -- src/java/overview.html | 16 ++ .../creativecommons/conf/crawl-urlfilter.txt | 15 + src/plugin/creativecommons/conf/nutch-site.xml | 16 ++ src/plugin/creativecommons/data/anchor.html| 16 ++ src/plugin/creativecommons/data/rdf.html | 16 ++ src/plugin/creativecommons/data/rel.html | 16 ++ src/plugin/creativecommons/ivy.xml | 1 - src/plugin/exchange-jexl/README.md | 17 ++ src/plugin/exchange-jexl/ivy.xml | 1 - src/plugin/feed/ivy.xml| 1 - src/plugin/headings/ivy.xml| 1 - src/plugin/index-anchor/ivy.xml| 1 - src/plugin/index-basic/ivy.xml | 1 - src/plugin/index-geoip/ivy.xml | 1 - src/plugin/index-geoip/plugin.xml | 1 + src/plugin/index-jexl-filter/ivy.xml | 1 - src/plugin/index-links/README.md | 17 ++ src/plugin/index-links/ivy.xml | 1 - src/plugin/index-metadata/ivy.xml | 1 - src/plugin/index-more/ivy.xml | 1 - src/plugin/index-replace/ivy.xml | 1 - .../index-replace/sample/testIndexReplace.html | 16 ++ src/plugin/index-static/ivy.xml| 1 - src/plugin/indexer-cloudsearch/README.md | 17 ++ src/plugin/indexer-cloudsearch/createCSDomain.sh | 15 + src/plugin/indexer-csv/README.md | 17 ++ src/plugin/indexer-csv/ivy.xml | 1 - src/plugin/indexer-dummy/README.md | 17 ++ src/plugin/indexer-dummy/ivy.xml | 1 - src/plugin/indexer-elastic/README.md | 17 ++ .../{howto_upgrade_es.txt => howto_upgrade_es.md} | 17 ++ src/plugin/indexer-kafka/ivy.xml | 1 - src/plugin/indexer-opensearch-1x/README.md | 17 ++ ..._opensearch.txt => howto_upgrade_opensearch.md} | 17 ++ src/plugin/indexer-rabbit/README.md| 17 ++ src/plugin/indexer-rabbit/ivy.xml | 1 - src/plugin/indexer-solr/README.md | 17 ++ ...owto_upgrade_solr.txt => howto_upgrade_solr.md} | 17 ++ src/plugin/indexer-solr/ivy.xml| 25 + src/plugin/indexer-solr/plugin.xml | 26 + src/plugin/language-identifier/ivy.xml | 1 - src/plugin/lib-htmlunit/ivy.xml| 1 - src/plugin/lib-http/ivy.xml| 1 - src/plugin/lib-nekohtml/ivy.xml| 1 - src/plugin/lib-rabbitmq/ivy.xml| 1 - src/plugin/lib-regex-filter/ivy.xml| 1 - src/plugin/lib-selenium/README.md | 17 ++ .../howto_upgrade_selenium.md} | 42 +- src/plugin/lib-selenium/howto_upgrade_selenium.txt | 15 - src/plugin/lib-selenium/ivy.xml| 1 - src/plugin/lib-xml/ivy.xml | 1 - src/plugin/microformats-reltag/ivy.xml | 1 - src/plugin/mimetype-filter/ivy.xml | 1 - src/plugin/nutch-extensionpoints/ivy.xml | 1 - src/plugin/parse-ext/command | 15 + src/plugin/parse-ext/ivy.xml | 1 - src/plugin/parse-html/ivy.xml | 1 - src/plugin/parse-js/ivy.xml| 1 - .../parse-js/sample/parse_embedded_js_test.html| 16 ++ src/plugin/parse-js/sample/parse_pure_js_test.js | 15 + src/plugin/parse-metatags/ivy.xml | 1 - src/plugin/parse-metatags/sample/testMetatags.html | 16 ++ .../sample/testMultivalueMetatags.html | 16 ++ ...owto_upgrade_tika.txt => howto_upgrade_tika.md} | 17 ++ src/plugin/parse-tika/ivy.xml | 1 - src/plugin/parse-tika/sample/nutch.html| 16 ++ src/plugin/pa
[nutch] branch master updated: NUTCH-3013 Employ commons-lang3's StopWatch to simplify timing logic (#788)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 8431dcfe5 NUTCH-3013 Employ commons-lang3's StopWatch to simplify timing logic (#788) 8431dcfe5 is described below commit 8431dcfe52f5395a0fd9e3c00db009dbb2bcf6f5 Author: Lewis John McGibbney AuthorDate: Sat Oct 21 11:09:31 2023 -0700 NUTCH-3013 Employ commons-lang3's StopWatch to simplify timing logic (#788) --- .github/workflows/master-build.yml | 1 - .gitignore | 1 + src/java/org/apache/nutch/crawl/CrawlDb.java | 19 + src/java/org/apache/nutch/crawl/CrawlDbMerger.java | 16 +++ .../org/apache/nutch/crawl/DeduplicationJob.java | 16 +++ src/java/org/apache/nutch/crawl/Generator.java | 17 +++ src/java/org/apache/nutch/crawl/Injector.java | 16 +++ src/java/org/apache/nutch/crawl/LinkDb.java| 15 +++--- src/java/org/apache/nutch/crawl/LinkDbMerger.java | 16 +++ src/java/org/apache/nutch/crawl/LinkDbReader.java | 24 ++ src/java/org/apache/nutch/fetcher/Fetcher.java | 17 +++ src/java/org/apache/nutch/hostdb/ReadHostDb.java | 15 +++--- src/java/org/apache/nutch/hostdb/UpdateHostDb.java | 16 +++ src/java/org/apache/nutch/indexer/CleaningJob.java | 16 +++ src/java/org/apache/nutch/indexer/IndexingJob.java | 16 +++ src/java/org/apache/nutch/parse/ParseSegment.java | 21 --- .../apache/nutch/scoring/webgraph/LinkDumper.java | 17 +++ .../apache/nutch/scoring/webgraph/LinkRank.java| 16 +++ .../apache/nutch/scoring/webgraph/NodeDumper.java | 16 +++ .../nutch/scoring/webgraph/ScoreUpdater.java | 16 +++ .../apache/nutch/scoring/webgraph/WebGraph.java| 24 ++ src/java/org/apache/nutch/tools/FreeGenerator.java | 16 +++ .../apache/nutch/tools/arc/ArcSegmentCreator.java | 16 +++ .../org/apache/nutch/tools/warc/WARCExporter.java | 15 +++--- .../apache/nutch/util/CrawlCompletionStats.java| 15 +++--- .../nutch/util/ProtocolStatusStatistics.java | 19 - .../org/apache/nutch/util/SitemapProcessor.java| 12 +++ .../apache/nutch/util/domain/DomainStatistics.java | 16 +++ .../urlfilter/api/RegexURLFilterBaseTest.java | 11 +- .../regex/TestRegexURLNormalizer.java | 8 ++-- 30 files changed, 234 insertions(+), 225 deletions(-) diff --git a/.github/workflows/master-build.yml b/.github/workflows/master-build.yml index e3ed11c86..ba1d470ec 100644 --- a/.github/workflows/master-build.yml +++ b/.github/workflows/master-build.yml @@ -22,7 +22,6 @@ on: branches: [ master ] pull_request: branches: [ master ] - jobs: build: diff --git a/.gitignore b/.gitignore index 0612a99c2..b46690852 100644 --- a/.gitignore +++ b/.gitignore @@ -27,3 +27,4 @@ naivebayes-model csvindexwriter lib/spotbugs-* ivy/dependency-check-ant/* +.gradle* diff --git a/src/java/org/apache/nutch/crawl/CrawlDb.java b/src/java/org/apache/nutch/crawl/CrawlDb.java index 3819bb3a0..16394832b 100644 --- a/src/java/org/apache/nutch/crawl/CrawlDb.java +++ b/src/java/org/apache/nutch/crawl/CrawlDb.java @@ -19,14 +19,15 @@ package org.apache.nutch.crawl; import java.io.File; import java.io.IOException; import java.lang.invoke.MethodHandles; -import java.text.SimpleDateFormat; import java.util.ArrayList; import java.util.Arrays; import java.util.HashMap; import java.util.HashSet; import java.util.Map; import java.util.Random; +import java.util.concurrent.TimeUnit; +import org.apache.commons.lang3.time.StopWatch; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import org.apache.hadoop.conf.Configuration; @@ -49,7 +50,6 @@ import org.apache.nutch.util.LockUtil; import org.apache.nutch.util.NutchConfiguration; import org.apache.nutch.util.NutchJob; import org.apache.nutch.util.NutchTool; -import org.apache.nutch.util.TimingUtil; /** * This class takes the output of the fetcher and updates the crawldb @@ -85,10 +85,11 @@ public class CrawlDb extends NutchTool implements Tool { public void update(Path crawlDb, Path[] segments, boolean normalize, boolean filter, boolean additionsAllowed, boolean force) throws IOException, InterruptedException, ClassNotFoundException { -Path lock = lock(getConf(), crawlDb, force); -SimpleDateFormat sdf = new SimpleDateFormat("-MM-dd HH:mm:ss"); -long start = System.currentTimeMillis(); +StopWatch stopWatch = new StopWatch(); +stopWatch.start(); + +Path lock = lock(getConf(), crawlDb, force);
[nutch] branch master updated: NUTCH-2936 Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode (#726)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 02dca3b6d NUTCH-2936 Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode (#726) 02dca3b6d is described below commit 02dca3b6d097af0f8fa76ce17f0a33267964bf19 Author: Lewis John McGibbney AuthorDate: Fri May 20 11:04:22 2022 -0700 NUTCH-2936 Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode (#726) * NUTCH-2936 Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode --- src/java/org/apache/nutch/parse/ParserChecker.java | 48 +++ src/java/org/apache/nutch/plugin/Extension.java| 12 +- .../org/apache/nutch/plugin/ExtensionPoint.java| 18 +-- .../apache/nutch/plugin/PluginManifestParser.java | 3 +- .../org/apache/nutch/plugin/PluginRepository.java | 57 + .../nutch/plugin/URLStreamHandlerFactory.java | 13 +- .../apache/nutch/protocol/http/api/HttpBase.java | 140 +++-- src/plugin/protocol-foo/plugin.xml | 2 +- src/plugin/protocol-okhttp/ivy.xml | 4 +- src/plugin/protocol-okhttp/plugin.xml | 12 +- .../org/apache/nutch/protocol/okhttp/OkHttp.java | 48 +++ .../nutch/protocol/okhttp/OkHttpResponse.java | 26 ++-- 12 files changed, 195 insertions(+), 188 deletions(-) diff --git a/src/java/org/apache/nutch/parse/ParserChecker.java b/src/java/org/apache/nutch/parse/ParserChecker.java index 6c82a516b..5da023fdc 100644 --- a/src/java/org/apache/nutch/parse/ParserChecker.java +++ b/src/java/org/apache/nutch/parse/ParserChecker.java @@ -114,15 +114,15 @@ public class ParserChecker extends AbstractChecker { int numConsumed; for (int i = 0; i < args.length; i++) { if (args[i].equals("-normalize")) { -normalizers = new URLNormalizers(getConf(), URLNormalizers.SCOPE_DEFAULT); +this.normalizers = new URLNormalizers(getConf(), URLNormalizers.SCOPE_DEFAULT); } else if (args[i].equals("-followRedirects")) { -followRedirects = true; +this.followRedirects = true; } else if (args[i].equals("-checkRobotsTxt")) { -checkRobotsTxt = true; +this.checkRobotsTxt = true; } else if (args[i].equals("-forceAs")) { -forceAsContentType = args[++i]; +this.forceAsContentType = args[++i]; } else if (args[i].equals("-dumpText")) { -dumpText = true; +this.dumpText = true; } else if (args[i].equals("-md")) { String k = null, v = null; String nextOne = args[++i]; @@ -132,7 +132,7 @@ public class ParserChecker extends AbstractChecker { v = nextOne.substring(firstEquals + 1); } else k = nextOne; -metadata.put(k, v); +this.metadata.put(k, v); } else if ((numConsumed = super.parseArgs(args, i)) > 0) { i += numConsumed - 1; } else if (i != args.length - 1) { @@ -144,7 +144,7 @@ public class ParserChecker extends AbstractChecker { } } -scfilters = new ScoringFilters(getConf()); +this.scfilters = new ScoringFilters(getConf()); if (url != null) { return super.processSingle(url); @@ -155,25 +155,25 @@ public class ParserChecker extends AbstractChecker { } protected int process(String url, StringBuilder output) throws Exception { -if (normalizers != null) { - url = normalizers.normalize(url, URLNormalizers.SCOPE_DEFAULT); +if (this.normalizers != null) { + url = this.normalizers.normalize(url, URLNormalizers.SCOPE_DEFAULT); } LOG.info("fetching: " + url); CrawlDatum datum = new CrawlDatum(); -Iterator iter = metadata.keySet().iterator(); +Iterator iter = this.metadata.keySet().iterator(); while (iter.hasNext()) { String key = iter.next(); - String value = metadata.get(key); + String value = this.metadata.get(key); if (value == null) value = ""; datum.getMetaData().put(new Text(key), new Text(value)); } int maxRedirects = getConf().getInt("http.redirect.max", 3); -if (followRedirects) { +if (this.followRedirects) { if (maxRedirects == 0) { LOG.info("Following max. 3 redirects (ignored http.redirect.max == 0)"); maxRedirects = 3; @@ -183,30 +183,30 @@ public class ParserChecker extends AbstractChecker { } ProtocolOutput protocolOutput = getProtocolOutput(url, datum, -checkRobotsTxt); +this.checkRobotsTxt); Text turl = new Text(url); // Follo
[nutch] branch master updated: NUTCH-2919 Upgrade to Tika 2.2.1 and Any23 2.6 (#717)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 847e19d NUTCH-2919 Upgrade to Tika 2.2.1 and Any23 2.6 (#717) 847e19d is described below commit 847e19d984503d333fd8fdd430fe347dd370dc4c Author: Lewis John McGibbney AuthorDate: Sat Jan 15 15:24:21 2022 -0800 NUTCH-2919 Upgrade to Tika 2.2.1 and Any23 2.6 (#717) * NUTCH-2919 Upgrade to Tika 2.2.1 and Any23 2.6 --- ivy/ivy.xml| 2 +- src/plugin/any23/ivy.xml | 2 +- src/plugin/any23/plugin.xml| 272 ++--- .../apache/nutch/any23/Any23IndexingFilter.java| 2 +- .../org/apache/nutch/any23/Any23ParseFilter.java | 35 +-- src/plugin/build-plugin.xml| 3 +- src/plugin/language-identifier/ivy.xml | 2 +- src/plugin/language-identifier/plugin.xml | 6 +- src/plugin/parse-tika/howto_upgrade_tika.txt | 5 +- src/plugin/parse-tika/ivy.xml | 2 +- src/plugin/parse-tika/plugin.xml | 70 +++--- .../org/apache/nutch/parse/tika/TestRTFParser.java | 4 +- 12 files changed, 192 insertions(+), 213 deletions(-) diff --git a/ivy/ivy.xml b/ivy/ivy.xml index 8d154bf..34e298f 100644 --- a/ivy/ivy.xml +++ b/ivy/ivy.xml @@ -63,7 +63,7 @@ - + diff --git a/src/plugin/any23/ivy.xml b/src/plugin/any23/ivy.xml index a5a0077..7220a25 100644 --- a/src/plugin/any23/ivy.xml +++ b/src/plugin/any23/ivy.xml @@ -36,7 +36,7 @@ - + diff --git a/src/plugin/any23/plugin.xml b/src/plugin/any23/plugin.xml index cc941b2..40a42c7 100644 --- a/src/plugin/any23/plugin.xml +++ b/src/plugin/any23/plugin.xml @@ -26,194 +26,168 @@ - - - - - - - - - - - - - - + + + + + + + + + + + + - - - - - - + + + - + - - + + + - - - - - + - - - - - - - - - + + + - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + - - - - + + - - - - - + - - - - - + + + - - - - - + + - - + + + - - - - - - - - - - - - - - - - + + + + + + + + + + + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + - - - - - - - - - - + - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + - - - - - - + + diff --git a/src/plugin/any23/src/java/org/apache/nutch/any23/Any23IndexingFilter.java b/src/plugin/any23/src/java/org/apache/nutch/any23/Any23IndexingFilter.java index c0f1d6f..09dc32e 100644 --- a/src/plugin/any23/src/java/org/apache/nutch/any23/Any23IndexingFilter.java +++ b/src/plugin/any23/src/java/org/apache/nutch/any23/Any23IndexingFilter.java @@ -106,7 +106,7 @@ public class Any23IndexingFilter implements IndexingFilter
[nutch] branch master updated: NUTCH-2429 Fix Plugin System to allow protocol plugins to bundle their URLStreamHandlers (#720)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new e76d69f NUTCH-2429 Fix Plugin System to allow protocol plugins to bundle their URLStreamHandlers (#720) e76d69f is described below commit e76d69fe13902fd2f3a98660dd2bac52c2ea568c Author: Lewis John McGibbney AuthorDate: Fri Jan 7 20:07:54 2022 -0800 NUTCH-2429 Fix Plugin System to allow protocol plugins to bundle their URLStreamHandlers (#720) * NUTCH-2429 Fix Plugin System to allow protocol plugins to bundle their URLStreamHandlers Co-authored-by: Hiran Chaudhuri --- build.xml | 1 + src/java/org/apache/nutch/crawl/CrawlDbReader.java | 43 ++-- src/java/org/apache/nutch/parse/ParserChecker.java | 5 + .../apache/nutch/plugin/PluginManifestParser.java | 66 +++--- .../org/apache/nutch/plugin/PluginRepository.java | 244 +++-- .../nutch/plugin/URLStreamHandlerFactory.java | 115 ++ .../apache/nutch/util/CrawlCompletionStats.java| 40 ++-- src/java/org/apache/nutch/util/NutchJob.java | 12 +- src/java/org/apache/nutch/util/NutchTool.java | 9 + .../org/apache/nutch/util/SitemapProcessor.java| 10 +- .../apache/nutch/util/domain/DomainStatistics.java | 20 +- .../apache/nutch/any23/Any23IndexingFilter.java| 2 +- .../org/apache/nutch/any23/Any23ParseFilter.java | 2 +- src/plugin/build.xml | 2 + .../nutch/indexwriter/csv/CSVIndexWriter.java | 2 +- .../indexwriter/rabbit/RabbitIndexWriter.java | 2 +- src/plugin/protocol-foo/build.xml | 22 ++ src/plugin/protocol-foo/ivy.xml| 41 src/plugin/protocol-foo/plugin.xml | 48 .../java/org/apache/nutch/protocol/foo/Foo.java| 141 .../org/apache/nutch/protocol/foo/Handler.java | 28 +++ 21 files changed, 696 insertions(+), 159 deletions(-) diff --git a/build.xml b/build.xml index ecef1e7..2c0eef0 100644 --- a/build.xml +++ b/build.xml @@ -1272,6 +1272,7 @@ + diff --git a/src/java/org/apache/nutch/crawl/CrawlDbReader.java b/src/java/org/apache/nutch/crawl/CrawlDbReader.java index 2a20a56..f31210a 100644 --- a/src/java/org/apache/nutch/crawl/CrawlDbReader.java +++ b/src/java/org/apache/nutch/crawl/CrawlDbReader.java @@ -16,11 +16,12 @@ */ package org.apache.nutch.crawl; +import java.io.Closeable; import java.io.DataOutputStream; import java.io.File; import java.io.IOException; -import java.io.Closeable; import java.lang.invoke.MethodHandles; +import java.net.MalformedURLException; import java.net.URL; import java.nio.ByteBuffer; import java.util.ArrayList; @@ -32,16 +33,11 @@ import java.util.List; import java.util.Map; import java.util.Map.Entry; import java.util.Random; +import java.util.TreeMap; import java.util.regex.Matcher; import java.util.regex.Pattern; -import java.util.TreeMap; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import com.tdunning.math.stats.MergingDigest; -import com.tdunning.math.stats.TDigest; +import org.apache.commons.jexl3.JexlScript; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileStatus; import org.apache.hadoop.fs.FileSystem; @@ -55,18 +51,18 @@ import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Writable; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; +import org.apache.hadoop.mapreduce.RecordWriter; import org.apache.hadoop.mapreduce.Reducer; -import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat; -import org.apache.hadoop.mapreduce.lib.output.MapFileOutputFormat; +import org.apache.hadoop.mapreduce.TaskAttemptContext; +import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; +import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; +import org.apache.hadoop.mapreduce.lib.output.MapFileOutputFormat; +import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; -import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; -import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner; -import org.apache.hadoop.mapreduce.RecordWriter; -import org.apache.hadoop.mapreduce.TaskAttemptContext; -import org.apache.hadoop.util.ToolRunner; import org.apache.hadoop.util.StringUtils; +import org.apache.hadoop.util.ToolRunner; import org.apache.nutch.util.AbstractChecker; import org.apache.nutch.util.JexlUtil; import org.apache.nutch.util.NutchConfiguration; @@ -74,7 +70,8 @@ import
[nutch] branch master updated: NUTCH-2449 Replace Tika LanguageIdentifier in language-identifier (#716)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new a9b50a7 NUTCH-2449 Replace Tika LanguageIdentifier in language-identifier (#716) a9b50a7 is described below commit a9b50a7c7e0ab83865883bf87f2c98f1ce354388 Author: Lewis John McGibbney AuthorDate: Fri Dec 17 20:11:01 2021 -0800 NUTCH-2449 Replace Tika LanguageIdentifier in language-identifier (#716) --- src/plugin/language-identifier/build-ivy.xml | 47 src/plugin/language-identifier/build.xml | 4 +-- 2 files changed, 49 insertions(+), 2 deletions(-) diff --git a/src/plugin/language-identifier/build-ivy.xml b/src/plugin/language-identifier/build-ivy.xml new file mode 100644 index 000..c735501 --- /dev/null +++ b/src/plugin/language-identifier/build-ivy.xml @@ -0,0 +1,47 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/src/plugin/language-identifier/build.xml b/src/plugin/language-identifier/build.xml index 668075e..4efb786 100644 --- a/src/plugin/language-identifier/build.xml +++ b/src/plugin/language-identifier/build.xml @@ -20,9 +20,9 @@ -Copying language profiles +Copying language mappings (language codes to names) - + Copying test files
[nutch-site] branch main updated (b720870 -> 198d962)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a change to branch main in repository https://gitbox.apache.org/repos/asf/nutch-site.git. from b720870 Add .asf.yaml file to Nutch website add 335dac0 Add public directory to SCM new 198d962 Remove public directory from main branch The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: .gitignore | 1 - 1 file changed, 1 deletion(-)
[nutch-site] branch main updated (819de2a -> b720870)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a change to branch main in repository https://gitbox.apache.org/repos/asf/nutch-site.git. from 819de2a Initial code import add b720870 Add .asf.yaml file to Nutch website No new revisions were added by this update. Summary of changes: .asf.yaml | 34 ++ 1 file changed, 34 insertions(+) create mode 100644 .asf.yaml
[nutch-site] branch asf-site created (now b720870)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a change to branch asf-site in repository https://gitbox.apache.org/repos/asf/nutch-site.git. at b720870 Add .asf.yaml file to Nutch website This branch includes the following new commits: new b720870 Add .asf.yaml file to Nutch website The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference.
[nutch-site] 01/01: Add .asf.yaml file to Nutch website
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/nutch-site.git commit b720870ebcc1a21abf3e9add6a2170560c423836 Author: Lewis John McGibbney AuthorDate: Wed Nov 24 08:41:06 2021 -0800 Add .asf.yaml file to Nutch website --- .asf.yaml | 34 ++ 1 file changed, 34 insertions(+) diff --git a/.asf.yaml b/.asf.yaml new file mode 100644 index 000..0cc84e6 --- /dev/null +++ b/.asf.yaml @@ -0,0 +1,34 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# https://cwiki.apache.org/confluence/display/INFRA/.asf.yaml+features+for+git+repositories + +github: + description: "Apache Nutch Website" + homepage: https://nutch.apache.org/ + labels: +- apache +- nutch +- hugo + + enabled_merge_buttons: +squash: true +merge: false +rebase: false + +publish: + whoami: asf-site \ No newline at end of file
[nutch-site] branch main updated: Remove broken site
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/nutch-site.git The following commit(s) were added to refs/heads/main by this push: new a2bcc5c Remove broken site a2bcc5c is described below commit a2bcc5cf4ec05ed921506b702ff39befbdba4a39 Author: Lewis John McGibbney AuthorDate: Tue Nov 23 20:14:53 2021 -0800 Remove broken site --- .gitmodules | 3 - README.md | 64 -- archetypes/default.md | 6 -- config.toml | 55 public/categories/index.xml | 10 --- public/css/index.css| 86 --- public/css/navbar.css | 53 public/img/IMG_0292.png | Bin 15728243 -> 0 bytes public/img/IMG_0295.png | Bin 16275273 -> 0 bytes public/img/server_rack.jpg | Bin 214612 -> 0 bytes public/img/wave.png | Bin 4789 -> 0 bytes public/index.html | 198 public/index.xml| 20 - public/posts/index.xml | 20 - public/sitemap.xml | 28 --- public/tags/index.xml | 10 --- themes/SimpleIntro | 1 - 17 files changed, 554 deletions(-) diff --git a/.gitmodules b/.gitmodules index 6378a22..e69de29 100644 --- a/.gitmodules +++ b/.gitmodules @@ -1,3 +0,0 @@ -[submodule "themes/SimpleIntro"] - path = themes/SimpleIntro - url = https://github.com/gangjun06/SimpleIntro diff --git a/README.md b/README.md deleted file mode 100644 index b5e146f..000 --- a/README.md +++ /dev/null @@ -1,64 +0,0 @@ -Nutch Website -= - -https://nutch.apache.org/assets/img/nutch_logo_tm.png; align="right" width="300" /> - -This repository contains the website source code for the [Apache Nutch](https://nutch.apache.org) project. - -# Tooling - -The Website is built using [Hugo](https://gohugo.io/) a popular open-source static website generation framework. - -# Prerequisites -* [Install Hugo](https://gohugo.io/getting-started/installing/) - -# Local Build and Deploy - -```bash -$ hugo server -... -Start building sites … - - | EN +- - Pages| 10 - Paginator pages | 0 - Non-page files | 0 - Static files | 10 - Processed images | 0 - Aliases | 0 - Sitemaps | 1 - Cleaned | 0 - -Built in 107 ms -Watching for changes in /path/to/nutch_site/{archetypes,content,data,layouts,static,themes} -Watching for config changes in /path/to/nutch_site/config.toml -Environment: "development" -Serving pages from memory -Running in Fast Render Mode. For full rebuilds on change: hugo server --disableFastRender -Web Server is available at http://localhost:1313/ (bind address 127.0.0.1) -Press Ctrl+C to stop -``` - -# Contributing - -To contribute a patch, follow these instructions (note that installing -[Hub](https://hub.github.com/) is not strictly required, but is recommended). - -``` -0. Download and install hub.github.com -1. File JIRA issue for your fix at https://issues.apache.org/jira/projects/NUTCH/issues -- you will get issue id NUTCH-xxx where xxx is the issue ID. -2. git clone https://github.com/apache/nutch-site.git -3. cd nutch-site -4. git checkout -b NUTCH-xxx -5. edit files -6. git status (make sure it shows what files you expected to edit) -7. git add -8. git commit -m “fix for NUTCH-xxx contributed by ” -9. git fork -10. git push -u NUTCH-xxx -11. git pull-request -``` - -# License diff --git a/archetypes/default.md b/archetypes/default.md deleted file mode 100644 index 00e77bd..000 --- a/archetypes/default.md +++ /dev/null @@ -1,6 +0,0 @@ -title: "{{ replace .Name "-" " " | title }}" -date: {{ .Date }} -draft: true - diff --git a/config.toml b/config.toml deleted file mode 100644 index 80101b7..000 --- a/config.toml +++ /dev/null @@ -1,55 +0,0 @@ -baseURL = "http://nutch.apache.org/; -languageCode = "en-us" -theme = "SimpleIntro" -#title = "Apache Nutch" -publishDir = "public" - -[params] -mainbg = "./img/IMG_0295.png" -pagebg = "../img/background.jpg" -name = "Apache Nutch" -mainTitle = "Apache Nutch" -mainText = "Highly extensible, highly scalable, production-ready Web crawler" - -[menus] -[[menu.main]] -identifier = "about" -name = "About" -url = "#about" -[[menu.main]] -identifier = "community" -name = "Community" -url = "/community" -[[menu.main]] -identifier = "development" -name = "Development" -url = "/development" -[[menu.main]] -identifier = &quo
[nutch] branch master updated: quick IntelliJ IDEA setup docs added (#698)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new b9a4856 quick IntelliJ IDEA setup docs added (#698) b9a4856 is described below commit b9a4856ac172f64659682d3e2e7437b780516f73 Author: Abu Sufyan AuthorDate: Tue Oct 19 21:35:49 2021 +0600 quick IntelliJ IDEA setup docs added (#698) Co-authored-by: Abu Sufian --- README.md | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index dccdb02..307ead3 100644 --- a/README.md +++ b/README.md @@ -48,7 +48,15 @@ ant eclipse and follow the instructions in [Importing existing projects](https://help.eclipse.org/2019-06/topic/org.eclipse.platform.doc.user/tasks/tasks-importproject.htm). -IntelliJ IDEA users can also import Eclipse projects using the ["Eclipser" plugin](https://www.tutorialspoint.com/intellij_idea/intellij_idea_migrating_from_eclipse.htm)https://plugins.jetbrains.com/plugin/7153-eclipser), see also [Importing Eclipse Projects into IntelliJ IDEA](https://www.jetbrains.com/help/idea/migrating-from-eclipse-to-intellij-idea.html#migratingEclipseProject). +For Intellij IDEA, first install the [IvyIDEA Plugin](https://plugins.jetbrains.com/plugin/3612-ivyidea). then run ```ant eclipse```. + +Then open the project in IntelliJ. You may see popups like "Ant build scripts found", "Frameworks detected - IvyIDEA Framework detected". Just follow the simple steps in these dialogs. + +You must [configure the nutch-site.xml](https://cwiki.apache.org/confluence/display/NUTCH/RunNutchInEclipse) before running. Make sure, you've added ```http.agent.name``` and ```plugin.folders``` properties. The plugin.folders normally points to ```/build/plugins```. + +Now create a Java Application Configuration, choose org.apache.nutch.crawl.Injector, add two paths as arguments. First one is the crawldb directory, second one is the URL directory where, the injector can read urls. Now run your configuration. + +If we still see the ```No plugins found on paths of property plugin.folders="plugins"```, update the plugin.folders in the nutch-default.xml, this is a quick fix, but should not be used. Export Control
[nutch] branch master updated: fireant upgrade dependency elasticsearch-rest-high-level-client in src/plugin/indexer-elastic/ivy.xml from 7.11.1 to 7.13.2 (#688)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 004b62d fireant upgrade dependency elasticsearch-rest-high-level-client in src/plugin/indexer-elastic/ivy.xml from 7.11.1 to 7.13.2 (#688) 004b62d is described below commit 004b62dedb8fd25fc3ae278b1647d7d2826f509e Author: Lewis John McGibbney AuthorDate: Fri Sep 17 19:42:25 2021 -0700 fireant upgrade dependency elasticsearch-rest-high-level-client in src/plugin/indexer-elastic/ivy.xml from 7.11.1 to 7.13.2 (#688) * fireant upgrade dependency elasticsearch-rest-high-level-client in src/plugin/indexer-elastic/ivy.xml from 7.11.1 to 7.13.2 * fireant upgrade dependency elasticsearch-rest-high-level-client in src/plugin/indexer-elastic/ivy.xml from 7.11.1 to 7.13.2 --- src/plugin/indexer-elastic/ivy.xml| 18 ++--- src/plugin/indexer-elastic/plugin.xml | 136 -- 2 files changed, 72 insertions(+), 82 deletions(-) diff --git a/src/plugin/indexer-elastic/ivy.xml b/src/plugin/indexer-elastic/ivy.xml index 3da98e3..9ee8e1c 100644 --- a/src/plugin/indexer-elastic/ivy.xml +++ b/src/plugin/indexer-elastic/ivy.xml @@ -1,6 +1,5 @@ - - - - - -https://nutch.apache.org/"/> + +https://nutch.apache.org/; /> Apache Nutch - + - + - + @@ -44,4 +42,4 @@ - + \ No newline at end of file diff --git a/src/plugin/indexer-elastic/plugin.xml b/src/plugin/indexer-elastic/plugin.xml index 1e41b7e..387a3ac 100644 --- a/src/plugin/indexer-elastic/plugin.xml +++ b/src/plugin/indexer-elastic/plugin.xml @@ -1,84 +1,76 @@ - - - + + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + - - - - + + - - + \ No newline at end of file
[nutch-site] branch main updated: Attempt to implement single page templating.
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/nutch-site.git The following commit(s) were added to refs/heads/main by this push: new 69525d8 Attempt to implement single page templating. 69525d8 is described below commit 69525d8ce6a4a7ff4fa92a1d30cc680090be9811 Author: Lewis John McGibbney AuthorDate: Fri Aug 27 15:41:56 2021 -0700 Attempt to implement single page templating. --- config.toml | 33 content/community/mailing_lists/index.md | 0 content/community/people/index.md| 0 content/community/robots/index.md| 0 content/development/scm/index.md | 0 content/index.md | 0 content/javadoc/inddex.md| 0 content/posts/_index.md | 0 8 files changed, 25 insertions(+), 8 deletions(-) diff --git a/config.toml b/config.toml index 454ef27..80101b7 100644 --- a/config.toml +++ b/config.toml @@ -6,6 +6,7 @@ publishDir = "public" [params] mainbg = "./img/IMG_0295.png" +pagebg = "../img/background.jpg" name = "Apache Nutch" mainTitle = "Apache Nutch" mainText = "Highly extensible, highly scalable, production-ready Web crawler" @@ -18,21 +19,37 @@ publishDir = "public" [[menu.main]] identifier = "community" name = "Community" -[[menu.main]] -identifier = "reporting" -name = "Board Reporting" -parent = "community" -url = "https://whimsy.apache.org/board/minutes/Nutch.html; -weight = 1 +url = "/community" [[menu.main]] identifier = "development" name = "Development" -url = "#development" +url = "/development" [[menu.main]] identifier = "documentation" name = "Documentation" -url = "#documentation" +url = "/documentation" [[menu.main]] identifier = "downloads" name = "Downloads" +url = "/downloads" +[[menu.single]] +identifier = "home" +name = "Home" +url = "/" +weight = 20 +[[menu.single]] +identifier = "community" +name = "Community" +url = "/community" +[[menu.single]] +identifier = "development" +name = "Development" +url = "/development" +[[menu.single]] +identifier = "documentation" +name = "Documentation" +url = "/documentation" +[[menu.single]] +identifier = "downloads" +name = "Downloads" url = "/downloads" \ No newline at end of file diff --git a/content/community/mailing_lists/index.md b/content/community/mailing_lists/index.md deleted file mode 100644 index e69de29..000 diff --git a/content/community/people/index.md b/content/community/people/index.md deleted file mode 100644 index e69de29..000 diff --git a/content/community/robots/index.md b/content/community/robots/index.md deleted file mode 100644 index e69de29..000 diff --git a/content/development/scm/index.md b/content/development/scm/index.md deleted file mode 100644 index e69de29..000 diff --git a/content/index.md b/content/index.md deleted file mode 100644 index e69de29..000 diff --git a/content/javadoc/inddex.md b/content/javadoc/inddex.md deleted file mode 100644 index e69de29..000 diff --git a/content/posts/_index.md b/content/posts/_index.md deleted file mode 100644 index e69de29..000
[nutch-site] branch main created (now ae6f9f2)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a change to branch main in repository https://gitbox.apache.org/repos/asf/nutch-site.git. at ae6f9f2 NUTCH-2826 Migrate Nutch Site from Apache CMS to Hugo This branch includes the following new commits: new ae6f9f2 NUTCH-2826 Migrate Nutch Site from Apache CMS to Hugo The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference.
[nutch-site] 01/01: NUTCH-2826 Migrate Nutch Site from Apache CMS to Hugo
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/nutch-site.git commit ae6f9f2cf51e7e6e84500cbcd8751ce99daa0ce3 Author: Lewis John McGibbney AuthorDate: Thu Aug 26 22:40:50 2021 -0700 NUTCH-2826 Migrate Nutch Site from Apache CMS to Hugo --- .gitmodules | 3 + README.md| 64 ++ archetypes/default.md| 6 + config.toml | 38 ++ content/community/mailing_lists/index.md | 0 content/community/people/index.md| 0 content/community/robots/index.md| 0 content/development/scm/index.md | 0 content/index.md | 0 content/javadoc/inddex.md| 0 content/posts/_index.md | 0 public/categories/index.xml | 10 ++ public/css/index.css | 86 ++ public/css/navbar.css| 53 + public/img/IMG_0292.png | Bin 0 -> 15728243 bytes public/img/IMG_0295.png | Bin 0 -> 16275273 bytes public/img/server_rack.jpg | Bin 0 -> 214612 bytes public/img/wave.png | Bin 0 -> 4789 bytes public/index.html| 198 +++ public/index.xml | 20 public/posts/index.xml | 20 public/sitemap.xml | 28 + public/tags/index.xml| 10 ++ themes/SimpleIntro | 1 + 24 files changed, 537 insertions(+) diff --git a/.gitmodules b/.gitmodules new file mode 100644 index 000..6378a22 --- /dev/null +++ b/.gitmodules @@ -0,0 +1,3 @@ +[submodule "themes/SimpleIntro"] + path = themes/SimpleIntro + url = https://github.com/gangjun06/SimpleIntro diff --git a/README.md b/README.md new file mode 100644 index 000..b5e146f --- /dev/null +++ b/README.md @@ -0,0 +1,64 @@ +Nutch Website += + +https://nutch.apache.org/assets/img/nutch_logo_tm.png; align="right" width="300" /> + +This repository contains the website source code for the [Apache Nutch](https://nutch.apache.org) project. + +# Tooling + +The Website is built using [Hugo](https://gohugo.io/) a popular open-source static website generation framework. + +# Prerequisites +* [Install Hugo](https://gohugo.io/getting-started/installing/) + +# Local Build and Deploy + +```bash +$ hugo server +... +Start building sites … + + | EN +---+- + Pages| 10 + Paginator pages | 0 + Non-page files | 0 + Static files | 10 + Processed images | 0 + Aliases | 0 + Sitemaps | 1 + Cleaned | 0 + +Built in 107 ms +Watching for changes in /path/to/nutch_site/{archetypes,content,data,layouts,static,themes} +Watching for config changes in /path/to/nutch_site/config.toml +Environment: "development" +Serving pages from memory +Running in Fast Render Mode. For full rebuilds on change: hugo server --disableFastRender +Web Server is available at http://localhost:1313/ (bind address 127.0.0.1) +Press Ctrl+C to stop +``` + +# Contributing + +To contribute a patch, follow these instructions (note that installing +[Hub](https://hub.github.com/) is not strictly required, but is recommended). + +``` +0. Download and install hub.github.com +1. File JIRA issue for your fix at https://issues.apache.org/jira/projects/NUTCH/issues +- you will get issue id NUTCH-xxx where xxx is the issue ID. +2. git clone https://github.com/apache/nutch-site.git +3. cd nutch-site +4. git checkout -b NUTCH-xxx +5. edit files +6. git status (make sure it shows what files you expected to edit) +7. git add +8. git commit -m “fix for NUTCH-xxx contributed by ” +9. git fork +10. git push -u NUTCH-xxx +11. git pull-request +``` + +# License diff --git a/archetypes/default.md b/archetypes/default.md new file mode 100644 index 000..00e77bd --- /dev/null +++ b/archetypes/default.md @@ -0,0 +1,6 @@ +--- +title: "{{ replace .Name "-" " " | title }}" +date: {{ .Date }} +draft: true +--- + diff --git a/config.toml b/config.toml new file mode 100644 index 000..454ef27 --- /dev/null +++ b/config.toml @@ -0,0 +1,38 @@ +baseURL = "http://nutch.apache.org/; +languageCode = "en-us" +theme = "SimpleIntro" +#title = "Apache Nutch" +publishDir = "public" + +[params] +mainbg = "./img/IMG_0295.png" +name = "Apache Nutch" +mainTitle = "Apache Nutch" +mainText = "Highly extensible, highly scalable, production-ready Web crawler" + +[menus] +[[menu.main]] +identifier = "about" +name = "About"
[nutch] branch master updated: NUTCH-2885 Upgrade to Log4j2 (#692)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new e4b7be9 NUTCH-2885 Upgrade to Log4j2 (#692) e4b7be9 is described below commit e4b7be9bc30935211c3e7e302788e488b811 Author: Lewis John McGibbney AuthorDate: Wed Aug 4 10:00:56 2021 -0700 NUTCH-2885 Upgrade to Log4j2 (#692) * NUTCH-2885 Upgrade to Log4j2 --- conf/log4j.properties | 123 -- conf/log4j2.xml | 51 + ivy/ivy.xml | 13 ++ 3 files changed, 56 insertions(+), 131 deletions(-) diff --git a/conf/log4j.properties b/conf/log4j.properties deleted file mode 100644 index 7b010cb..000 --- a/conf/log4j.properties +++ /dev/null @@ -1,123 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one or more -# contributor license agreements. See the NOTICE file distributed with -# this work for additional information regarding copyright ownership. -# The ASF licenses this file to You under the Apache License, Version 2.0 -# (the "License"); you may not use this file except in compliance with -# the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -# Define some default values that can be overridden by system properties -hadoop.log.dir=. -hadoop.log.file=hadoop.log - -# RootLogger - DailyRollingFileAppender -log4j.rootLogger=INFO,DRFA - -# Logging Threshold -log4j.threshold=ALL - -#special logging requirements for some commandline tools -log4j.logger.org.apache.nutch.crawl.CrawlDb=INFO,cmdstdout -log4j.logger.org.apache.nutch.crawl.CrawlDbMerger=INFO,cmdstdout -log4j.logger.org.apache.nutch.crawl.CrawlDbReader=INFO,cmdstdout -log4j.logger.org.apache.nutch.crawl.Crawl=INFO,cmdstdout -log4j.logger.org.apache.nutch.crawl.DeduplicationJob=INFO,cmdstdout -log4j.logger.org.apache.nutch.crawl.Generator=INFO,cmdstdout -log4j.logger.org.apache.nutch.crawl.Injector=INFO,cmdstdout -log4j.logger.org.apache.nutch.crawl.LinkDb=INFO,cmdstdout -log4j.logger.org.apache.nutch.crawl.LinkDbMerger=INFO,cmdstdout -log4j.logger.org.apache.nutch.crawl.LinkDbReader=INFO,cmdstdout -log4j.logger.org.apache.nutch.fetcher.Fetcher=INFO,cmdstdout -log4j.logger.org.apache.nutch.fetcher.FetcherItem=INFO,cmdstdout -log4j.logger.org.apache.nutch.fetcher.FetcherItemQueue=INFO,cmdstdout -log4j.logger.org.apache.nutch.fetcher.FetcherItemQueues=INFO,cmdstdout -log4j.logger.org.apache.nutch.fetcher.FetcherThread=INFO,cmdstdout -log4j.logger.org.apache.nutch.fetcher.QueueFeeder=INFO,cmdstdout -log4j.logger.org.apache.nutch.hostdb.UpdateHostDb=INFO,cmdstdout -log4j.logger.org.apache.nutch.hostdb.ReadHostDb=INFO,cmdstdout -log4j.logger.org.apache.nutch.indexer.IndexingFiltersChecker=INFO,cmdstdout -log4j.logger.org.apache.nutch.indexer.IndexingJob=INFO,cmdstdout -log4j.logger.org.apache.nutch.indexer.IndexerOutputFormat=INFO,cmdstdout -log4j.logger.org.apache.nutch.indexwriter.solr.SolrIndexWriter=INFO,cmdstdout -log4j.logger.org.apache.nutch.indexwriter.solr.SolrUtils=INFO,cmdstdout -log4j.logger.org.apache.nutch.exchange.Exchanges=INFO,cmdstdout -log4j.logger.org.apache.nutch.parse.ParserChecker=INFO,cmdstdout -log4j.logger.org.apache.nutch.parse.ParseSegment=INFO,cmdstdout -log4j.logger.org.apache.nutch.plugin.PluginRepository=WARN -log4j.logger.org.apache.nutch.protocol.RobotRulesParser=INFO,cmdstdout -log4j.logger.org.apache.nutch.scoring.webgraph.LinkRank=INFO,cmdstdout -log4j.logger.org.apache.nutch.scoring.webgraph.Loops=INFO,cmdstdout -log4j.logger.org.apache.nutch.scoring.webgraph.ScoreUpdater=INFO,cmdstdout -log4j.logger.org.apache.nutch.scoring.webgraph.WebGraph=INFO,cmdstdout -log4j.logger.org.apache.nutch.scoring.webgraph.NodeDumper=INFO,cmdstdout -log4j.logger.org.apache.nutch.segment.SegmentChecker=INFO,cmdstdout -log4j.logger.org.apache.nutch.segment.SegmentMerger=INFO,cmdstdout -log4j.logger.org.apache.nutch.segment.SegmentReader=INFO,cmdstdout -log4j.logger.org.apache.nutch.service.NutchServer=INFO,cmdstdout -log4j.logger.org.apache.nutch.tools.FreeGenerator=INFO,cmdstdout -log4j.logger.org.apache.nutch.util.domain.DomainStatistics=INFO,cmdstdout -log4j.logger.org.apache.nutch.util.SitemapProcessor=INFO,cmdstdout -log4j.logger.org.apache.nutch.webui.NutchUiServer=INFO,cmdstdout - -log4j.logger.org.apache.nutch=INFO -log4j.logger.org.apache.hadoop=WARN -# log mapreduce job messages and counters -log4j.logger.org.apache.hadoop.mapreduce.Job=INFO - -# -# Daily R
[nutch-webapp] branch master updated: Add missing files
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch-webapp.git The following commit(s) were added to refs/heads/master by this push: new 93e7b23 Add missing files 93e7b23 is described below commit 93e7b23812cbfabdd4fca87fd01b3f82c64a4057 Author: Lewis John McGibbney AuthorDate: Tue Jul 13 20:35:53 2021 -0700 Add missing files --- .asf.yaml | 16 ++ .github/pull_request_template.md | 13 ++ .github/workflows/master-build.yml | 41 + KEYS | 364 + NOTICE.txt | 13 ++ 5 files changed, 447 insertions(+) diff --git a/.asf.yaml b/.asf.yaml new file mode 100644 index 000..aa9a939 --- /dev/null +++ b/.asf.yaml @@ -0,0 +1,16 @@ +github: + description: "Apache Nutch is an extensible and scalable web crawler" + homepage: https://nutch.apache.org/ + labels: +- web-crawler +- crawling +- java +- nutch +- hadoop +- apache + +notifications: + commits: commits@nutch.apache.org + issues: d...@nutch.apache.org + pullrequests: d...@nutch.apache.org + jira_options: link label comment diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md new file mode 100644 index 000..d1f9c54 --- /dev/null +++ b/.github/pull_request_template.md @@ -0,0 +1,13 @@ +Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! Your help is appreciated! + +Before opening the pull request, please verify that +* there is an open issue on the [Nutch issue tracker](https://issues.apache.org/jira/projects/NUTCH) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes. +* the issue ID (`NUTCH-`) + - is referenced in the title of the pull request + - and placed in front of your commit messages surrounded by square brackets (`[NUTCH-] Issue or pull request title`) +* commits are squashed into a single one (or few commits for larger changes) +* Java source code follows [Nutch Eclipse Code Formatting rules](https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml) +* Nutch is successfully built and unit tests pass by running `mvn clean install javadoc:aggregate` +* there should be no conflicts when merging the pull request branch into the *recent* master branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled master branch. + +We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Nutch in general, please sign up for the [Nutch mailing list](https://nutch.apache.org/mailing_lists.html). Thanks! diff --git a/.github/workflows/master-build.yml b/.github/workflows/master-build.yml new file mode 100644 index 000..c1a409c --- /dev/null +++ b/.github/workflows/master-build.yml @@ -0,0 +1,41 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +name: master pr build + +on: + push: +branches: [ master ] + pull_request: +branches: [ master ] + + +jobs: + build: +runs-on: ubuntu-latest +strategy: + matrix: +java: [ '11' ] + +steps: + - uses: actions/checkout@v2 + - name: Set up JDK ${{ matrix.java }} +uses: actions/setup-java@v1 +with: + java-version: ${{ matrix.java }} + - name: Build with Maven +run: mvn clean install javadoc:aggregate diff --git a/KEYS b/KEYS new file mode 100644 index 000..a1331f9 --- /dev/null +++ b/KEYS @@ -0,0 +1,364 @@ +This file contains the PGP keys of various developers. +Please don't use them for email unless you have to. Their main +purpose is code signing. + +Examples of importing this file in your keystore: + gpg --import KEYS.txt + (need pgp and other examples here) + +Examples of adding your key to this file: + pgp -kxa and append it to this file. + (pgpk -ll && pgpk -xa ) >> this file. + (gpg --list-sigs + && gpg --armor --expor
[nutch-webapp] branch master created (now da3c282)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/nutch-webapp.git. at da3c282 Move Nutch WebApp to separate repository This branch includes the following new commits: new da3c282 Move Nutch WebApp to separate repository The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference.
[nutch] branch master updated: fireant upgrade dependency httpcore in ivy/ivy.xml from 4.4.9 to 4.4.14 (#681)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 53ed506 fireant upgrade dependency httpcore in ivy/ivy.xml from 4.4.9 to 4.4.14 (#681) 53ed506 is described below commit 53ed50626b371d163033015b4f8c87167393c33d Author: Lewis John McGibbney AuthorDate: Wed Jun 30 23:07:39 2021 -0700 fireant upgrade dependency httpcore in ivy/ivy.xml from 4.4.9 to 4.4.14 (#681) --- ivy/ivy.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ivy/ivy.xml b/ivy/ivy.xml index e05c81c..2781c6c 100644 --- a/ivy/ivy.xml +++ b/ivy/ivy.xml @@ -143,7 +143,7 @@ - +
[nutch] branch master updated: NUTCH-2882 Configure NutchUiServer for DEPLOYMENT and improve logging (#690)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new d6875e1 NUTCH-2882 Configure NutchUiServer for DEPLOYMENT and improve logging (#690) d6875e1 is described below commit d6875e13515328b759e204a6b4bba8725f2ea7c2 Author: Lewis John McGibbney AuthorDate: Mon Jun 28 09:12:27 2021 -0700 NUTCH-2882 Configure NutchUiServer for DEPLOYMENT and improve logging (#690) --- conf/log4j.properties | 2 ++ src/java/org/apache/nutch/webui/NutchUiApplication.java | 6 ++ 2 files changed, 8 insertions(+) diff --git a/conf/log4j.properties b/conf/log4j.properties index 67311d1..7b010cb 100644 --- a/conf/log4j.properties +++ b/conf/log4j.properties @@ -60,9 +60,11 @@ log4j.logger.org.apache.nutch.scoring.webgraph.NodeDumper=INFO,cmdstdout log4j.logger.org.apache.nutch.segment.SegmentChecker=INFO,cmdstdout log4j.logger.org.apache.nutch.segment.SegmentMerger=INFO,cmdstdout log4j.logger.org.apache.nutch.segment.SegmentReader=INFO,cmdstdout +log4j.logger.org.apache.nutch.service.NutchServer=INFO,cmdstdout log4j.logger.org.apache.nutch.tools.FreeGenerator=INFO,cmdstdout log4j.logger.org.apache.nutch.util.domain.DomainStatistics=INFO,cmdstdout log4j.logger.org.apache.nutch.util.SitemapProcessor=INFO,cmdstdout +log4j.logger.org.apache.nutch.webui.NutchUiServer=INFO,cmdstdout log4j.logger.org.apache.nutch=INFO log4j.logger.org.apache.hadoop=WARN diff --git a/src/java/org/apache/nutch/webui/NutchUiApplication.java b/src/java/org/apache/nutch/webui/NutchUiApplication.java index 67ac281..fc08874 100644 --- a/src/java/org/apache/nutch/webui/NutchUiApplication.java +++ b/src/java/org/apache/nutch/webui/NutchUiApplication.java @@ -18,6 +18,7 @@ package org.apache.nutch.webui; import org.apache.nutch.webui.pages.DashboardPage; import org.apache.nutch.webui.pages.assets.NutchUiCssReference; +import org.apache.wicket.RuntimeConfigurationType; import org.apache.wicket.markup.html.WebPage; import org.apache.wicket.protocol.http.WebApplication; import org.apache.wicket.spring.injection.annot.SpringComponentInjector; @@ -61,6 +62,11 @@ public class NutchUiApplication extends WebApplication implements new SpringComponentInjector(this, context)); } + @Override + public RuntimeConfigurationType getConfigurationType() { +return RuntimeConfigurationType.DEPLOYMENT; + } + private void configureTheme(BootstrapSettings settings) { Theme theme = new Theme(THEME_NAME, BootstrapCssReference.instance(), FontAwesomeCssReference.instance(), NutchUiCssReference.instance());
[nutch] branch master updated: NUTCH-2881 bug in 'nutch' symlink in docker container (#689)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 08de742 NUTCH-2881 bug in 'nutch' symlink in docker container (#689) 08de742 is described below commit 08de74266b2e502d6915831a6e19fea21b099e28 Author: Lewis John McGibbney AuthorDate: Sat Jun 26 19:04:59 2021 -0700 NUTCH-2881 bug in 'nutch' symlink in docker container (#689) * NUTCH-2881 bug in 'nutch' symlink in docker container --- docker/Dockerfile | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/docker/Dockerfile b/docker/Dockerfile index 0f06894..29ead46 100644 --- a/docker/Dockerfile +++ b/docker/Dockerfile @@ -23,6 +23,7 @@ RUN apk update RUN apk --no-cache add apache-ant bash git openjdk11 RUN echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk' >> $HOME/.bashrc +env NUTCH_HOME='/root/nutch_source/runtime/local' # Checkout and build the Nutch master branch (1.x) RUN git clone https://github.com/apache/nutch.git nutch_source && \ @@ -31,5 +32,6 @@ RUN git clone https://github.com/apache/nutch.git nutch_source && \ rm -rf build/ && \ rm -rf /root/.ivy2/ -# Convenience symlink to Nutch runtime local -RUN ln -s nutch_source/runtime/local $HOME/nutch +# Create symlinks for runtime/local/bin/nutch and runtime/local/bin/crawl +RUN ln -sf $NUTCH_HOME/bin/nutch /usr/local/bin/ +RUN ln -sf $NUTCH_HOME/bin/crawl /usr/local/bin/ \ No newline at end of file
[nutch] branch master updated: fireant upgrade dependency junit in ivy/ivy.xml from 4.13.1 to 4.13.2 (#666)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 9c8ae8e fireant upgrade dependency junit in ivy/ivy.xml from 4.13.1 to 4.13.2 (#666) 9c8ae8e is described below commit 9c8ae8e9ad9c0e4b9a9b8cbe53b5021c5485762b Author: Lewis John McGibbney AuthorDate: Sun Jun 13 19:57:30 2021 -0700 fireant upgrade dependency junit in ivy/ivy.xml from 4.13.1 to 4.13.2 (#666) --- ivy/ivy.xml | 94 - 1 file changed, 49 insertions(+), 45 deletions(-) diff --git a/ivy/ivy.xml b/ivy/ivy.xml index 00d67eb..e05c81c 100644 --- a/ivy/ivy.xml +++ b/ivy/ivy.xml @@ -1,20 +1,24 @@ - - - - -http://ant.apache.org/ivy/maven;> + + + + +http://ant.apache.org/ivy/maven; version="1.0"> - https://www.apache.org/licenses/LICENSE-2.0.txt; /> + https://www.apache.org/licenses/LICENSE-2.0.txt; /> https://nutch.apache.org/; /> https://nutch.apache.org/;>Nutch is an open source web-search software. It builds on Hadoop, Tika and Solr, adding web-specifics, @@ -46,7 +50,7 @@ - + @@ -58,14 +62,14 @@ - - - + + + - + @@ -78,36 +82,36 @@ - - - - - - - - - + + + + + + + + + - - - + + + - + - - + + - + - + @@ -130,7 +134,7 @@ - + @@ -138,18 +142,18 @@ - - - + + + - + - + - + \ No newline at end of file
[nutch] branch master updated: NUTCH-2864 Upgrade Dockerfile to use JDK 11 (#647)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new cc8d76a NUTCH-2864 Upgrade Dockerfile to use JDK 11 (#647) cc8d76a is described below commit cc8d76afe4f86691008b5673b182bb0e54a59710 Author: Lewis John McGibbney AuthorDate: Thu Jun 3 13:15:03 2021 -0700 NUTCH-2864 Upgrade Dockerfile to use JDK 11 (#647) * NUTCH-2864 Upgrade Dockerfile to use JDK 11 --- docker/Dockerfile | 16 +--- docker/README.md | 9 - 2 files changed, 17 insertions(+), 8 deletions(-) diff --git a/docker/Dockerfile b/docker/Dockerfile index 3077d1a..0f06894 100644 --- a/docker/Dockerfile +++ b/docker/Dockerfile @@ -13,21 +13,23 @@ # See the License for the specific language governing permissions and # limitations under the License. -FROM ubuntu:18.04 +FROM alpine:3.13 MAINTAINER Apache Nutch Committers WORKDIR /root/ - # Install dependencies -RUN apt update -RUN apt install -y ant git openjdk-8-jdk-headless +RUN apk update +RUN apk --no-cache add apache-ant bash git openjdk11 -# Set up JAVA_HOME -RUN echo 'export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64' >> $HOME/.bashrc +RUN echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk' >> $HOME/.bashrc # Checkout and build the Nutch master branch (1.x) -RUN git clone https://github.com/apache/nutch.git nutch_source && cd nutch_source && ant runtime +RUN git clone https://github.com/apache/nutch.git nutch_source && \ + cd nutch_source && \ + ant runtime && \ + rm -rf build/ && \ + rm -rf /root/.ivy2/ # Convenience symlink to Nutch runtime local RUN ln -s nutch_source/runtime/local $HOME/nutch diff --git a/docker/README.md b/docker/README.md index 58a3b5e..2ac88cc 100644 --- a/docker/README.md +++ b/docker/README.md @@ -1,5 +1,12 @@ # Nutch Dockerfile # +![Docker Pulls](https://img.shields.io/docker/pulls/apache/nutch?style=for-the-badge) +![Docker Image Size (latest by date)](https://img.shields.io/docker/image-size/apache/nutch?style=for-the-badge) +![Docker Image Version (latest semver)](https://img.shields.io/docker/v/apache/nutch?style=for-the-badge) +![MicroBadger Layers](https://img.shields.io/microbadger/layers/apache/nutch?style=for-the-badge) +![Docker Stars](https://img.shields.io/docker/stars/apache/nutch?style=for-the-badge) +![Docker Automated build](https://img.shields.io/docker/automated/apache/nutch?style=for-the-badge) + Get up and running quickly with Nutch on Docker. ## What is Nutch? @@ -18,7 +25,7 @@ Current configuration of this image consists of components: ## Base Image -* [ubuntu:18.04](https://hub.docker.com/_/ubuntu/) +* [alpine:3.13](https://hub.docker.com/_/alpine/) ## Tips
[nutch] branch master updated: NUTCH-2855 Update org.elasticsearch.client (#577)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 2837039 NUTCH-2855 Update org.elasticsearch.client (#577) 2837039 is described below commit 2837039b9c5b52a88c2029a5e29c81cecd8953f3 Author: Lewis John McGibbney AuthorDate: Thu Apr 1 08:56:43 2021 -0700 NUTCH-2855 Update org.elasticsearch.client (#577) * NUTCH-2855 Update org.elasticsearch.client --- src/plugin/indexer-elastic/ivy.xml| 2 +- src/plugin/indexer-elastic/plugin.xml | 79 ++- 2 files changed, 41 insertions(+), 40 deletions(-) diff --git a/src/plugin/indexer-elastic/ivy.xml b/src/plugin/indexer-elastic/ivy.xml index 4b8d4a7..3da98e3 100644 --- a/src/plugin/indexer-elastic/ivy.xml +++ b/src/plugin/indexer-elastic/ivy.xml @@ -36,7 +36,7 @@ - + diff --git a/src/plugin/indexer-elastic/plugin.xml b/src/plugin/indexer-elastic/plugin.xml index 45ac61e..1e41b7e 100644 --- a/src/plugin/indexer-elastic/plugin.xml +++ b/src/plugin/indexer-elastic/plugin.xml @@ -22,49 +22,50 @@ - - - - - - - - - - + + + + + + + + + + + - - - - - - - - - + + + + + + + + + - + - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + @@ -80,4 +81,4 @@ class="org.apache.nutch.indexwriter.elastic.ElasticIndexWriter" /> - \ No newline at end of file +
[nutch] branch master updated: NUTCH-2857 Upgrade from JDK1.8 --> JDK11 (#573)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new b91fae5 NUTCH-2857 Upgrade from JDK1.8 --> JDK11 (#573) b91fae5 is described below commit b91fae53e7de1d4c240ba91c951024441f2ea01f Author: Lewis John McGibbney AuthorDate: Sun Mar 21 08:30:41 2021 -0700 NUTCH-2857 Upgrade from JDK1.8 --> JDK11 (#573) * NUTCH-2857 Upgrade from JDK1.8 --> JDK11 --- .github/workflows/master-build.yml | 2 +- default.properties | 4 ++-- ivy/mvn.template | 4 ++-- .../org/apache/nutch/indexer/IndexWriterParams.java| 6 +++--- src/java/org/apache/nutch/metadata/MetaWrapper.java| 2 +- src/java/org/apache/nutch/net/URLNormalizers.java | 4 ++-- src/java/org/apache/nutch/parse/ParserChecker.java | 18 +- .../org/apache/nutch/segment/SegmentMergeFilter.java | 2 +- .../org/apache/nutch/segment/SegmentMergeFilters.java | 4 ++-- .../net/urlnormalizer/regex/RegexURLNormalizer.java| 2 +- 10 files changed, 24 insertions(+), 24 deletions(-) diff --git a/.github/workflows/master-build.yml b/.github/workflows/master-build.yml index 7e74840..e3ed11c 100644 --- a/.github/workflows/master-build.yml +++ b/.github/workflows/master-build.yml @@ -29,7 +29,7 @@ jobs: runs-on: ubuntu-latest strategy: matrix: -java: [ '1.8' ] +java: [ '11' ] steps: - uses: actions/checkout@v2 diff --git a/default.properties b/default.properties index f250904..cf82c84 100644 --- a/default.properties +++ b/default.properties @@ -43,7 +43,7 @@ test.junit.output.format = plain # Proxy Host and Port to use for building JavaDoc javadoc.proxy.host=-J-DproxyHost= javadoc.proxy.port=-J-DproxyPort= -javadoc.link.java=https://docs.oracle.com/javase/8/docs/api/ +javadoc.link.java=https://docs.oracle.com/en/java/javase/11/docs/api/ javadoc.link.hadoop=https://hadoop.apache.org/docs/r3.1.3/api/ #javadoc.link.lucene.core=https://lucene.apache.org/core/8_4_1/core/ #javadoc.link.lucene.analyzers-common=https://lucene.apache.org/core/8_4_1/analyzers-common/ @@ -57,7 +57,7 @@ bin.dist.version.dir=${dist.dir}/${final.name}-bin javac.debug=on javac.optimize=on javac.deprecation=on -javac.version=1.8 +javac.version=11 runtime.dir=./runtime runtime.deploy=${runtime.dir}/deploy diff --git a/ivy/mvn.template b/ivy/mvn.template index edfb550..b38b37f 100644 --- a/ivy/mvn.template +++ b/ivy/mvn.template @@ -130,8 +130,8 @@ maven-compiler-plugin 3.8.1 -1.8 -1.8 +11 +11 diff --git a/src/java/org/apache/nutch/indexer/IndexWriterParams.java b/src/java/org/apache/nutch/indexer/IndexWriterParams.java index e7b3152..52cc4f9 100644 --- a/src/java/org/apache/nutch/indexer/IndexWriterParams.java +++ b/src/java/org/apache/nutch/indexer/IndexWriterParams.java @@ -24,10 +24,10 @@ import java.util.Map; public class IndexWriterParams extends HashMap { /** - * Constructs a new HashMap with the same mappings as the - * specified Map. The HashMap is created with + * Constructs a new HashMap with the same mappings as the + * specified Map. The HashMap is created with * default load factor (0.75) and an initial capacity sufficient to - * hold the mappings in the specified Map. + * hold the mappings in the specified Map. * * @param m the map whose mappings are to be placed in this map * @throws NullPointerException if the specified map is null diff --git a/src/java/org/apache/nutch/metadata/MetaWrapper.java b/src/java/org/apache/nutch/metadata/MetaWrapper.java index a58253c..2547734 100644 --- a/src/java/org/apache/nutch/metadata/MetaWrapper.java +++ b/src/java/org/apache/nutch/metadata/MetaWrapper.java @@ -26,7 +26,7 @@ import org.apache.nutch.crawl.NutchWritable; /** * This is a simple decorator that adds metadata to any Writable-s that can be - * serialized by NutchWritable. This is useful when data needs to be + * serialized by {@link NutchWritable}. This is useful when data needs to be * temporarily enriched during processing, but this temporary metadata doesn't * need to be permanently stored after the job is done. * diff --git a/src/java/org/apache/nutch/net/URLNormalizers.java b/src/java/org/apache/nutch/net/URLNormalizers.java index 4ec904d..bf947f7 100644 --- a/src/java/org/apache/nutch/net/URLNormalizers.java +++ b/src/java/org/apache/nutch/net/URLNormalizers.java @@ -42,7 +42,7 @@ import org.apache.nutch.util.ObjectCache; * This class uses a "chained filter" pattern to run defined normalizers. * Different lists of normalizers may be defined for different "scopes", or * contexts wher
[nutch] branch master updated: NUTCH-2850 Method ignores exceptional return value (#570)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 2724578 NUTCH-2850 Method ignores exceptional return value (#570) 2724578 is described below commit 2724578ab41cb9e8098975bddbde7df2085b1c61 Author: Lewis John McGibbney AuthorDate: Thu Feb 18 07:21:43 2021 -0800 NUTCH-2850 Method ignores exceptional return value (#570) --- src/java/org/apache/nutch/tools/FileDumper.java | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/src/java/org/apache/nutch/tools/FileDumper.java b/src/java/org/apache/nutch/tools/FileDumper.java index 4e7338e..65c7dca 100644 --- a/src/java/org/apache/nutch/tools/FileDumper.java +++ b/src/java/org/apache/nutch/tools/FileDumper.java @@ -234,7 +234,9 @@ public class FileDumper { File fullOutputDir = new File(org.apache.commons.lang3.StringUtils.join(Arrays.copyOf(splitPath, splitPath.length - 1), "/")); if (!fullOutputDir.exists()) { - fullOutputDir.mkdirs(); + if(!fullOutputDir.mkdirs()); +throw new Exception("Unable to create: [" + + fullOutputDir.getAbsolutePath() + "]"); } } else { outputFullPath = String.format("%s/%s", fullDir, DumpFileUtil.createFileName(md5Ofurl, baseName, extension));
[nutch] branch master updated: NUTCH-2851 Random object created and used only once (#571)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 5250d62 NUTCH-2851 Random object created and used only once (#571) 5250d62 is described below commit 5250d62986468b23509a82d2aaa32bdc11cf02a8 Author: Lewis John McGibbney AuthorDate: Thu Feb 18 07:20:59 2021 -0800 NUTCH-2851 Random object created and used only once (#571) --- src/java/org/apache/nutch/crawl/Generator.java | 5 +++-- src/java/org/apache/nutch/indexer/IndexingJob.java | 4 +++- src/java/org/apache/nutch/segment/SegmentReader.java | 5 - src/java/org/apache/nutch/tools/DmozParser.java | 5 - 4 files changed, 14 insertions(+), 5 deletions(-) diff --git a/src/java/org/apache/nutch/crawl/Generator.java b/src/java/org/apache/nutch/crawl/Generator.java index dcba9bf..00eb18f 100644 --- a/src/java/org/apache/nutch/crawl/Generator.java +++ b/src/java/org/apache/nutch/crawl/Generator.java @@ -35,7 +35,6 @@ import org.apache.hadoop.conf.Configurable; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import org.apache.commons.jexl3.JexlExpression; -import org.antlr.v4.parse.ANTLRParser.throwsSpec_return; import org.apache.commons.jexl3.JexlContext; import org.apache.commons.jexl3.MapContext; import org.apache.hadoop.mapreduce.Counter; @@ -90,6 +89,8 @@ import org.apache.nutch.util.URLUtil; **/ public class Generator extends NutchTool implements Tool { + private static final Random RANDOM = new Random(); + protected static final Logger LOG = LoggerFactory .getLogger(MethodHandles.lookup().lookupClass()); @@ -1013,7 +1014,7 @@ public class Generator extends NutchTool implements Tool { Job job = NutchJob.getInstance(getConf()); job.setJobName("generate: partition " + segment); Configuration conf = job.getConfiguration(); -conf.setInt("partition.url.seed", new Random().nextInt()); +conf.setInt("partition.url.seed", RANDOM.nextInt()); FileInputFormat.addInputPath(job, inputDir); job.setInputFormatClass(SequenceFileInputFormat.class); diff --git a/src/java/org/apache/nutch/indexer/IndexingJob.java b/src/java/org/apache/nutch/indexer/IndexingJob.java index 0966276..0fe29a7 100644 --- a/src/java/org/apache/nutch/indexer/IndexingJob.java +++ b/src/java/org/apache/nutch/indexer/IndexingJob.java @@ -54,6 +54,8 @@ import org.slf4j.LoggerFactory; public class IndexingJob extends NutchTool implements Tool { + private static final Random RANDOM = new Random(); + private static final Logger LOG = LoggerFactory .getLogger(MethodHandles.lookup().lookupClass()); @@ -136,7 +138,7 @@ public class IndexingJob extends NutchTool implements Tool { job.setReduceSpeculativeExecution(false); final Path tmp = new Path("tmp_" + System.currentTimeMillis() + "-" -+ new Random().nextInt()); ++ RANDOM.nextInt()); FileOutputFormat.setOutputPath(job, tmp); try { diff --git a/src/java/org/apache/nutch/segment/SegmentReader.java b/src/java/org/apache/nutch/segment/SegmentReader.java index 284daed..2f2fefd 100644 --- a/src/java/org/apache/nutch/segment/SegmentReader.java +++ b/src/java/org/apache/nutch/segment/SegmentReader.java @@ -35,6 +35,7 @@ import java.util.HashMap; import java.util.Iterator; import java.util.List; import java.util.Map; +import java.util.Random; import org.slf4j.Logger; import org.slf4j.LoggerFactory; @@ -74,6 +75,8 @@ import org.apache.nutch.util.SegmentReaderUtil; /** Dump the content of a segment. */ public class SegmentReader extends Configured implements Tool { + private static final Random RANDOM = new Random(); + private static final Logger LOG = LoggerFactory .getLogger(MethodHandles.lookup().lookupClass()); @@ -220,7 +223,7 @@ public class SegmentReader extends Configured implements Tool { job.setJarByClass(SegmentReader.class); Path tempDir = new Path(conf.get("hadoop.tmp.dir", "/tmp") + "/segread-" -+ new java.util.Random().nextInt()); ++ RANDOM.nextInt()); FileSystem fs = tempDir.getFileSystem(conf); fs.delete(tempDir, true); diff --git a/src/java/org/apache/nutch/tools/DmozParser.java b/src/java/org/apache/nutch/tools/DmozParser.java index b68facb..8db4817 100644 --- a/src/java/org/apache/nutch/tools/DmozParser.java +++ b/src/java/org/apache/nutch/tools/DmozParser.java @@ -54,6 +54,9 @@ import org.apache.nutch.util.NutchConfiguration; * RDF into a flat file of URLs to be injected. */ public class DmozParser { + + private static final Random RANDOM = new Random(); + private static final Logger LOG = LoggerFactory .getLogger(MethodHandles.lookup().lookupClass()); @@ -134,7 +137,7 @@ public class DmozParser { this.includeAdult = incl
[nutch] branch master updated: NUTCH-2849 Replace remaining package.html files with package-info.java (#569)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 2fae4cd NUTCH-2849 Replace remaining package.html files with package-info.java (#569) 2fae4cd is described below commit 2fae4cde67a05cf1fa9ecdd6b6bd5307c0e46fe7 Author: Lewis John McGibbney AuthorDate: Tue Feb 16 10:40:00 2021 -0800 NUTCH-2849 Replace remaining package.html files with package-info.java (#569) --- build.xml | 7 +++- .../org/apache/nutch/crawl/package-info.java} | 8 ++-- src/java/org/apache/nutch/crawl/package.html | 5 --- .../org/apache/nutch/fetcher/package-info.java}| 8 ++-- src/java/org/apache/nutch/fetcher/package.html | 5 --- .../org/apache/nutch/indexer/package-info.java}| 16 --- src/java/org/apache/nutch/indexer/package.html | 10 - .../org/apache/nutch/metadata/package-info.java} | 11 ++--- src/java/org/apache/nutch/metadata/package.html| 6 --- src/java/org/apache/nutch/plugin/package-info.java | 42 +++ src/java/org/apache/nutch/plugin/package.html | 40 -- .../apache/nutch/util/domain/package-info.java}| 17 +--- src/java/org/apache/nutch/util/domain/package.html | 14 --- .../org/creativecommons/nutch/package-info.java} | 8 ++-- .../java/org/creativecommons/nutch/package.html| 5 --- .../apache/nutch/indexer/anchor/package-info.java} | 8 ++-- .../org/apache/nutch/indexer/anchor/package.html | 5 --- .../apache/nutch/indexer/basic/package-info.java} | 10 ++--- .../org/apache/nutch/indexer/basic/package.html| 5 --- .../apache/nutch/indexer/more/package-info.java} | 11 ++--- .../org/apache/nutch/indexer/more/package.html | 6 --- .../nutch/indexer/staticfield/package-info.java} | 12 +++--- .../apache/nutch/indexer/staticfield/package.html | 5 --- .../apache/nutch/analysis/lang/package-info.java} | 13 +++--- .../org/apache/nutch/analysis/lang/package.html| 6 --- .../nutch/protocol/http/api/package-info.java} | 11 ++--- .../apache/nutch/protocol/http/api/package.html| 6 --- .../nutch/microformats/reltag/package-info.java} | 11 ++--- .../apache/nutch/microformats/reltag/package.html | 8 .../org/apache/nutch/parse/html/package-info.java} | 11 ++--- .../java/org/apache/nutch/parse/html/package.html | 5 --- .../apache/nutch/protocol/file/package-info.java} | 8 ++-- .../org/apache/nutch/protocol/file/package.html| 5 --- .../apache/nutch/protocol/ftp/package-info.java} | 8 ++-- .../org/apache/nutch/protocol/ftp/package.html | 5 --- .../htmlunit/{package.html => package-info.java} | 8 ++-- .../apache/nutch/protocol/http/package-info.java} | 8 ++-- .../org/apache/nutch/protocol/http/package.html| 5 --- .../nutch/protocol/httpclient/package-info.java} | 15 --- .../apache/nutch/protocol/httpclient/package.html | 9 .../interactiveselenium/package-info.java} | 8 ++-- .../protocol/interactiveselenium/package.html | 5 --- .../nutch/protocol/selenium/package-info.java} | 8 ++-- .../apache/nutch/protocol/selenium/package.html| 5 --- .../nutch/scoring/metadata/package-info.java | 32 ++ .../org/apache/nutch/scoring/metadata/package.html | 33 --- .../org/apache/nutch/collection/package-info.java | 49 ++ .../java/org/apache/nutch/collection/package.html | 36 .../apache/nutch/indexer/tld/package-info.java}| 8 ++-- .../java/org/apache/nutch/indexer/tld/package.html | 5 --- .../apache/nutch/scoring/tld/package-info.java}| 8 ++-- .../java/org/apache/nutch/scoring/tld/package.html | 5 --- .../nutch/urlfilter/automaton/package-info.java} | 12 +++--- .../apache/nutch/urlfilter/automaton/package.html | 9 .../nutch/urlfilter/prefix/package-info.java} | 8 ++-- .../org/apache/nutch/urlfilter/prefix/package.html | 5 --- .../nutch/urlfilter/regex/package-info.java} | 10 ++--- .../org/apache/nutch/urlfilter/regex/package.html | 5 --- .../nutch/urlfilter/validator/package-info.java} | 14 --- .../apache/nutch/urlfilter/validator/package.html | 9 .../nutch/indexer/urlmeta/package-info.java} | 16 --- .../org/apache/nutch/indexer/urlmeta/package.html | 12 -- .../nutch/scoring/urlmeta/package-info.java} | 15 --- .../org/apache/nutch/scoring/urlmeta/package.html | 11 - 64 files changed, 292 insertions(+), 442 deletions(-) diff --git a/build.xml b/build.xml index ec003c3..dcb7b94 100644 --- a/build.xml +++ b/build.xml @@ -186,6 +186,7 @@ doctitle="${name} ${version} API" bottom="Copyright copy; ${year} The Apache Software Foundation" failonerror="true
[nutch] branch master updated: NUTCH-2840 Fix 'report-vulnerabilities' ant target in build.xml (#561)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 66bb62a NUTCH-2840 Fix 'report-vulnerabilities' ant target in build.xml (#561) 66bb62a is described below commit 66bb62a589ac2651771bf61b62786991e65539f8 Author: Lewis John McGibbney AuthorDate: Sun Jan 31 16:06:52 2021 -0800 NUTCH-2840 Fix 'report-vulnerabilities' ant target in build.xml (#561) * NUTCH-2840 Fix 'report-vulnerabilities' ant target in build.xml --- .gitignore | 2 ++ build.xml | 46 ++--- ivy/dependency-check-ant/lib/.gitignore | 19 ++ 3 files changed, 52 insertions(+), 15 deletions(-) diff --git a/.gitignore b/.gitignore index 6d96644..0612a99 100644 --- a/.gitignore +++ b/.gitignore @@ -25,3 +25,5 @@ naivebayes-model *.iml *.swp csvindexwriter +lib/spotbugs-* +ivy/dependency-check-ant/* diff --git a/build.xml b/build.xml index 882a54a..02a7cdd 100644 --- a/build.xml +++ b/build.xml @@ -37,9 +37,11 @@ - + + + - + @@ -646,24 +648,38 @@ - - - - - - - + + + + + + +https://github.com/jeremylong/DependencyCheck/releases/download/v${dependency-check-ant.version}/dependency-check-ant-${dependency-check-ant.version}-release.zip; + dest="${ivy.dir}/dependency-check-ant-${dependency-check-ant.version}-release.zip" usetimestamp="false" /> + + + + + + + + + + - - - - + + + + + - + diff --git a/ivy/dependency-check-ant/lib/.gitignore b/ivy/dependency-check-ant/lib/.gitignore new file mode 100644 index 000..e2dec72 --- /dev/null +++ b/ivy/dependency-check-ant/lib/.gitignore @@ -0,0 +1,19 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Ignore everything in this directory +* +# Except this file +!.gitignore
[nutch] branch master updated: NUTCH-2819 Move spotbugs "installation" directory to avoid that spotbugs is shipped in Nutch runtime (#565)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new cc0da7e NUTCH-2819 Move spotbugs "installation" directory to avoid that spotbugs is shipped in Nutch runtime (#565) cc0da7e is described below commit cc0da7e860723f7b8e89429a8f1f11551ecf118f Author: Sebastian Nagel AuthorDate: Mon Feb 1 01:05:27 2021 +0100 NUTCH-2819 Move spotbugs "installation" directory to avoid that spotbugs is shipped in Nutch runtime (#565) - install spotbugs into to ivy/spotbugs-x.x.x/ - upgrade to Spotbugs 4.2.0 - move task definition into spotbugs target, otherwise running download/installation and bug spotting together fails --- .gitignore | 1 + build.xml | 19 +-- 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/.gitignore b/.gitignore index 249ca77..6d96644 100644 --- a/.gitignore +++ b/.gitignore @@ -11,6 +11,7 @@ ivy/ivy-2.3.0.jar ivy/ivy-2.4.0.jar ivy/ivy-2.5.0-rc1.jar ivy/ivy-2.5.0.jar +ivy/spotbugs-*/ naivebayes-model .naivebayes-model.crc .gitconfig diff --git a/build.xml b/build.xml index 68a0f44..882a54a 100644 --- a/build.xml +++ b/build.xml @@ -41,8 +41,8 @@ - - + + @@ -1066,20 +1066,19 @@ https://github.com/spotbugs/spotbugs/releases/download/${spotbugs.version}/spotbugs-${spotbugs.version}.tgz " - dest="${basedir}/lib/spotbugs-${spotbugs.version}.tgz" usetimestamp="false" /> + dest="${ivy.dir}/spotbugs-${spotbugs.version}.tgz" usetimestamp="false" /> - + - + - - +
[nutch] branch master updated: Prepare for Nutch 1.19-SNAPSHOT development
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new ebf348c Prepare for Nutch 1.19-SNAPSHOT development ebf348c is described below commit ebf348cc6ec88a15ca0243c12fe18c31157ede89 Author: Lewis John McGibbney AuthorDate: Mon Jan 25 20:05:00 2021 -0800 Prepare for Nutch 1.19-SNAPSHOT development --- CHANGES.txt| 49 +++-- NOTICE.txt | 2 +- build.xml | 25 - conf/nutch-default.xml | 2 +- default.properties | 4 ++-- ivy/mvn.template | 12 +++- src/bin/nutch | 2 +- 7 files changed, 79 insertions(+), 17 deletions(-) diff --git a/CHANGES.txt b/CHANGES.txt index e5c5984..9946bc9 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -1,10 +1,55 @@ # Nutch Change Log -Nutch 1.18 Development +Nutch 1.18 Release 14/01/2021 (dd/mm/) +Release Report: https://s.apache.org/lqara Breaking Changes -- As part of NUTCH-2805, the plugin urlfilter-domainblacklist has been renamed to urlfilter-domaindenylist. And the fields required for the plugin urlfilter.domainblacklist.rules and urlfilter.domainblacklist.file has been replaced with urlfilter.domaindenylist.rules and urlfilter.domaindenylist.file respectively. See NUTCH-2802 for more details. +- As part of NUTCH-2805, the plugin urlfilter-domainblacklist has been renamed to urlfilter-domaindenylist. And the fields required for the plugin urlfilter.domainblacklist.rules and urlfilter.domainblacklist.file has been replaced with urlfilter.domaindenylist.rules and urlfilter.domaindenylist.file respectively. See NUTCH-2802 for more details. + +Sub-task + +[NUTCH-2671] - Upgrade ant ivy library +[NUTCH-2672] - Ant build erronously installs *-test.jar instead *.jar for target "nightly" +[NUTCH-2805] - Rename plugin urlfilter-domainblacklist +[NUTCH-2809] - Upgrade any23 plugin dependency to 2.4 +[NUTCH-2816] - Add Spotbugs target to ant build +[NUTCH-2817] - Avoid check for equality of URL path and file part using ==/!= +[NUTCH-2829] - Fix ant target "clean-cache" + +Bug + +[NUTCH-2669] - Reliable solution for javax.ws packaging.type +[NUTCH-2697] - Upgrade Ivy to fix the issue of an unset packaging.type property +[NUTCH-2801] - RobotsRulesParser command-line checker to use http.robots.agents as fall-back +[NUTCH-2810] - FreeGenerator to actually apply configured number of fetch lists +[NUTCH-2813] - MoreIndexingFilter - can't parse erroneous date - 2019-07-03T10:28:14 +[NUTCH-2814] - HttpDateFormat's internal time zone may change after parsing a date +[NUTCH-2818] - Ant build: upgrade Apache Rat report task +[NUTCH-2823] - IllegalStateException in IndexWriters.describe() when validating url param for SolrIndexer +[NUTCH-2824] - urlnormalizer-basic to unescape percent-encoded host names + +Improvement + +[NUTCH-1190] - MoreIndexingFilter refactor: move data formats used to parse "lastModified" to a config file. +[NUTCH-2582] - Set pool size of XML SAX parsers used for MIME detection in Tika 1.19 +[NUTCH-2730] - SitemapProcessor to treat sitemap URLs as Set instead of List +[NUTCH-2782] - protocol-http / lib-http: support TLSv1.3 +[NUTCH-2796] - Upgrade to crawler-commons 1.1 +[NUTCH-2799] - Add .asf.yaml file +[NUTCH-2833] - Upgrade to Tika 1.25 +[NUTCH-2835] - Upgrade commons-jexl from 2 --> 3 +[NUTCH-2836] - Upgrade various commons dependencies +[NUTCH-2837] - Update multiple dependencies +[NUTCH-2841] - Upgrade xercesImpl dependency + +Wish + +[NUTCH-2834] - Deduplication mode via command line in crawl script + +Task + +[NUTCH-2830] - Upgrade any23 to v2.4 Nutch 1.17 Release 18/06/2020 (dd/mm/) Release Report: https://s.apache.org/ovhry diff --git a/NOTICE.txt b/NOTICE.txt index 71f29fa..1c9efd0 100644 --- a/NOTICE.txt +++ b/NOTICE.txt @@ -1,5 +1,5 @@ Apache Nutch -Copyright 2020 The Apache Software Foundation +Copyright 2021 The Apache Software Foundation This product includes software developed by The Apache Software Foundation (http://www.apache.org/). diff --git a/build.xml b/build.xml index 62ed5d1..68a0f44 100644 --- a/build.xml +++ b/build.xml @@ -37,6 +37,8 @@ + + @@ -311,8 +313,9 @@ - - + + + @@ -321,8 +324,9 @@ - - + + + @@ -332,8 +336,9 @@ - - + + + @@ -362,10 +367,12 @@ - - + + + + - +
svn commit: r45580 - /release/nutch/1.17/
Author: lewismc Date: Sun Jan 24 21:09:56 2021 New Revision: 45580 Log: Remove Nutch 1.17 from release area Removed: release/nutch/1.17/
svn commit: r1885887 - in /nutch/cms_site/trunk/content: ./ apidocs/apidocs-1.18/ apidocs/apidocs-1.18/org/ apidocs/apidocs-1.18/org/apache/ apidocs/apidocs-1.18/org/apache/nutch/ apidocs/apidocs-1.18
Author: lewismc Date: Sun Jan 24 20:45:18 2021 New Revision: 1885887 URL: http://svn.apache.org/viewvc?rev=1885887=rev Log: Update Nutch CMR website for 1.18 [This commit notification would consist of 260 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.]
svn commit: r45570 - in /release/nutch/1.18: apache-nutch-1.18-bin.zip.md5 apache-nutch-1.18-src.tar.gz.md5 apache-nutch-1.18-src.zip.md5
Author: lewismc Date: Sun Jan 24 20:04:00 2021 New Revision: 45570 Log: Remove Nutch 1.18 .md5 artifacts Removed: release/nutch/1.18/apache-nutch-1.18-bin.zip.md5 release/nutch/1.18/apache-nutch-1.18-src.tar.gz.md5 release/nutch/1.18/apache-nutch-1.18-src.zip.md5
svn commit: r45569 - /release/nutch/1.18/apache-nutch-1.18-bin.tar.gz.md5
Author: lewismc Date: Sun Jan 24 20:02:57 2021 New Revision: 45569 Log: Remove .md5 Removed: release/nutch/1.18/apache-nutch-1.18-bin.tar.gz.md5
svn commit: r45568 - /dev/nutch/1.18/ /release/nutch/1.18/
Author: lewismc Date: Sun Jan 24 20:02:19 2021 New Revision: 45568 Log: Publish Nutch 1.18 to release area. Added: release/nutch/1.18/ - copied from r45567, dev/nutch/1.18/ Removed: dev/nutch/1.18/
svn commit: r45520 [3/3] - /dev/nutch/1.18/
Added: dev/nutch/1.18/apache-nutch-1.18-bin.tar.gz == Binary file - no diff available. Propchange: dev/nutch/1.18/apache-nutch-1.18-bin.tar.gz -- svn:mime-type = application/octet-stream Added: dev/nutch/1.18/apache-nutch-1.18-bin.tar.gz.asc == --- dev/nutch/1.18/apache-nutch-1.18-bin.tar.gz.asc (added) +++ dev/nutch/1.18/apache-nutch-1.18-bin.tar.gz.asc Thu Jan 21 00:44:46 2021 @@ -0,0 +1,16 @@ +-BEGIN PGP SIGNATURE- + +iQIzBAABCAAdFiEE23tRmRIcCKXI9AUrOkcX8Ei66/YFAmAHg5oACgkQOkcX8Ei6 +6/bhThAAu2KbUKiuaqomM4M+Kl9QLLfKU5fqwl5ffQ9I4ZOC/yWqaqpJbOriPNvX +2t4hDpTEKFA6yJTE1DggxxTLugsCSNYapRQc1ZBCf2gcPEoGEbDdMIxDyvZsZeQ0 +/XSDqP+OOFbX/Ggpl6MsJjO+1dM1Xn/QpRkAG65aW4rP2b0xR0gO3Uv9yonld1Fr +jrGbarItalZmKhuvlWQKidOYmpXeuIs1rQ0MBHgWbFVpgo/cLxbNRSk71nIrZKia +CAWMVQx43CuukqvjSwBTbrb04lI3I2F6PMC8pIiQPcXhCi8oHSrZ11I2TOaw4LnC +0WGN0qgQb/fJuI8nqCfOqaJY254r+Gy01BPO+boDH5XdcQy/OhlTm4smKaFOmACv +KoY0Y/lpf3eWumn51saMGjzpkYRTGB/p8zkEOmYfIUoLDT8MdMDfTZzkxn7lYiw/ +eGJvv6hD+pPksvNQdIFa3yydTEVsWST+z2jvsUK2gI8TwUUlp9JR63NMNijg4sN9 +JW64TjAopuWQrciuq1mGcTAGK7b/uxdmGk4NSX76cHFbRu6J8FOPIBnY2IF3rx03 +30UPu9c9SI0dokzsTLNNSxnXmN5LGawZ1tmqi7SiL6kOARKWljNzu06C5ZslYPLZ +eHJLzne64g6FuvkslIotkoEPVZ/fS3UetHh14jSr5QzYZ7KAyN8= +=sEcc +-END PGP SIGNATURE- Added: dev/nutch/1.18/apache-nutch-1.18-bin.tar.gz.md5 == --- dev/nutch/1.18/apache-nutch-1.18-bin.tar.gz.md5 (added) +++ dev/nutch/1.18/apache-nutch-1.18-bin.tar.gz.md5 Thu Jan 21 00:44:46 2021 @@ -0,0 +1 @@ +MD5(apache-nutch-1.18-bin.tar.gz)= 6f024cd88ee098340f0667125ad0578c Added: dev/nutch/1.18/apache-nutch-1.18-bin.tar.gz.sha512 == --- dev/nutch/1.18/apache-nutch-1.18-bin.tar.gz.sha512 (added) +++ dev/nutch/1.18/apache-nutch-1.18-bin.tar.gz.sha512 Thu Jan 21 00:44:46 2021 @@ -0,0 +1 @@ +SHA512(apache-nutch-1.18-bin.tar.gz)= f52d97a98c1facfb4586344068be8d57ff961abcfa8f4b416bde8d568fb5c44e78f9ecb5e4afa067b1162f01599fa52401d5eb31812f01c18e4fae5229968ff0 Added: dev/nutch/1.18/apache-nutch-1.18-bin.zip == Binary file - no diff available. Propchange: dev/nutch/1.18/apache-nutch-1.18-bin.zip -- svn:mime-type = application/octet-stream Added: dev/nutch/1.18/apache-nutch-1.18-bin.zip.asc == --- dev/nutch/1.18/apache-nutch-1.18-bin.zip.asc (added) +++ dev/nutch/1.18/apache-nutch-1.18-bin.zip.asc Thu Jan 21 00:44:46 2021 @@ -0,0 +1,16 @@ +-BEGIN PGP SIGNATURE- + +iQIzBAABCAAdFiEE23tRmRIcCKXI9AUrOkcX8Ei66/YFAmAHg6YACgkQOkcX8Ei6 +6/YUmxAAgckunK+2hpSpUzEQ2r2hnRvFt+jq8fqpBAPSjh6H9lnn7zBEK/aJHRgb +zskF4ZtATfkCmxHC5JYYA3noOwvjz/cSbNEYCXF8bngBUW02CtiZEXAHSIr2aPVD +4HuplylbdZ1ihlhSRiTKAzqA1f3LaGRR+Kpw4ag/eSrPquBeN+VSl+8ThvszJlwN +btZdOFshOYYkV6dVgI17Qp2rYY/XwUG/crBTlOIV9HBASYXs2sxFpkuUIlI60N8m +KYbVlxngFtCaNaBii76qh2mLCTD4SSN48XjY4cvD2bJOlCwdEGXRuAo+NB8auuTp +qdqMG4/3upgpsxeCErq5fTbgqR7weyPQAx3kelw9T/rM86YyXRva9pVs0mk9JxHv +yi5LcqrjnhD2xVa26vMQacfvVkBw8ev1a/Gahv8Xq1B1YzAn8YpTqsb1kC/nWKVe +1fD5KZPwYDBCGI0/puwXin90Y0jZ/D4xuI0sP/M5ZZ8fQuYWV3JGReI6+vH9KGha +x5jjfXMQ7k5BVFZA7DmirdW5IGfoHJLT7sRo0dTWMRTevlNC02TMp5jf62LbzZtW +dW6Nw1DPGcTmdHR2Cob8zgPwRV8iDoM2yj0290zcw5h59JDqyhhg9yG7kt23TAQS +xrlxTT76dTGrgtB3QsuR7uWf3xmNKFah9aeGdxb+j9cb2PWbZDY= +=BHRm +-END PGP SIGNATURE- Added: dev/nutch/1.18/apache-nutch-1.18-bin.zip.md5 == --- dev/nutch/1.18/apache-nutch-1.18-bin.zip.md5 (added) +++ dev/nutch/1.18/apache-nutch-1.18-bin.zip.md5 Thu Jan 21 00:44:46 2021 @@ -0,0 +1 @@ +MD5(apache-nutch-1.18-bin.zip)= 4563aa7c3216078ede022d4f182f48be Added: dev/nutch/1.18/apache-nutch-1.18-bin.zip.sha512 == --- dev/nutch/1.18/apache-nutch-1.18-bin.zip.sha512 (added) +++ dev/nutch/1.18/apache-nutch-1.18-bin.zip.sha512 Thu Jan 21 00:44:46 2021 @@ -0,0 +1 @@ +SHA512(apache-nutch-1.18-bin.zip)= be681ff067691d680669ca3fd84ca9f8c86d3d3ed04ab9c7b65b11eeb45c19d324820f24cd9f80db0d4b82034e9993c8412f2ac9a7d9943b262a03bb86f41595 Added: dev/nutch/1.18/apache-nutch-1.18-src.tar.gz == Binary file - no diff available. Propchange: dev/nutch/1.18/apache-nutch-1.18-src.tar.gz -- svn:mime-type = application/octet-stream Added: dev/nutch/1.18/apache-nutch-1.18-src.tar.gz.asc
svn commit: r45520 [2/3] - /dev/nutch/1.18/
ue in tika mimetype detection +[NUTCH-2224] - Average bytes/second calculated incorrectly in fetcher +[NUTCH-2225] - Parsed time calculated incorrectly +[NUTCH-2228] - Plugin index-replace unit test broken on Java 8 +[NUTCH-2232] - DeduplicationJob should decode URL's before length is compared +[NUTCH-2241] - Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration +[NUTCH-2256] - Inconsistent log level practice + +Improvement + +[NUTCH-1233] - Rely on Tika for outlink extraction +[NUTCH-1712] - Use MultipleInputs in Injector to make it a single mapreduce job +[NUTCH-2172] - index-more: document format of contenttype-mapping.txt +[NUTCH-2178] - DeduplicationJob to optionally group on host or domain +[NUTCH-2182] - Make reverseUrlDirs file dumper option hash the URL for consistency +[NUTCH-2183] - Improvement to SegmentChecker for skipping non-segments present in segments directory +[NUTCH-2187] - Change FileDumper SHAs to all uppercase +[NUTCH-2195] - IndexingFilterChecker to optionally follow N redirects +[NUTCH-2196] - IndexingFilterChecker to optionally normalize +[NUTCH-2197] - Add solr5 solrcloud indexer support +[NUTCH-2204] - Remove junit lib from runtime +[NUTCH-2218] - Switch CrawlCompletion arg parsing to Commons CLI +[NUTCH-2221] - Introduce db.ignore.internal.links to FetcherThread +[NUTCH-2229] - Allow Jexl expressions on CrawlDatum's fixed attributes +[NUTCH-2231] - Jexl support in generator job +[NUTCH-2252] - Allow phantomjs as a browser for selenium options +[NUTCH-2263] - Support for mingram and maxgram at Unigram Cosine Similarity Model + +New Feature + +[NUTCH-961] - Expose Tika's boilerpipe support +[NUTCH-1325] - HostDB for Nutch +[NUTCH-2144] - Plugin to override db.ignore.external to exempt interesting external domain URLs +[NUTCH-2190] - Protocol normalizer +[NUTCH-2191] - Add protocol-htmlunit +[NUTCH-2194] - Run IndexingFilterChecker as simple Telnet server +[NUTCH-2219] - Criteria order to be configurable in DeduplicationJob +[NUTCH-2227] - RegexParseFilter +[NUTCH-2245] - Developed the NGram Model on the existing Unigram Cosine Similarity Model + +Task + +[NUTCH-2201] - Remove loops program from webgraph package +[NUTCH-2211] - Filter and normalizer checkers missing in bin/nutch +[NUTCH-2220] - Rename db.* options used only by the linkdb to linkdb.* + +Nutch 1.11 Release 03/12/2015 (dd/mm/) +Release Report: http://s.apache.org/nutch11 + +* NUTCH-2176 Clean up of log4j.properties (markus) + +* NUTCH-2107 plugin.xml to validate against plugin.dtd (snagel) + +* NUTCH-2177 Generator produces only one partition even in distributed mode (jnioche, snagel) + +* NUTCH-2158 Upgrade to Tika 1.11 (jnioche, snagel) + +* NUTCH-2175 Typos in property descriptions in nutch-default.xml (Roannel Fernández Hernández via snagel) + +* NUTCH-2069 Ignore external links based on domain (jnioche) + +* NUTCH-2173 String.join in FileDumper breaks the build (joyce) + +* NUTCH-2166 Add reverse URL format to dump tool (joyce) + +* NUTCH-2157 Addressing Miredot REST API Warnings (Sujen Shah) + +* NUTCH-2165 FileDumper Util hard codes part-# folder name (joyce) + +* NUTCH-2167 Backport TableUtil from 2.x for URL reversing (joyce) + +* NUTCH-2160 Upgrade Selenium Java to 2.48.2 (lewismc, kwhitehall) + +* NUTCH-2120 Remove MapWritable from trunk codebase (lewismc) + +* NUTCH-1911 Improve DomainStatistics tool command line parsing (joyce) + +* NUTCH-2064 URLNormalizer basic to encode reserved chars and decode non-reserved chars (markus, snagel) + +* NUTCH-2159 Ensure that all WebApp files are copied into generated artifacts for 1.X Webapp (lewismc) + +* NUTCH-2154 Nutch REST API (DB) suffering NullPointerException (Aron Ahmadia, Sujen Shah via mattmann) + +* NUTCH-2150 Add protocolstats utility (Michael Joyce via mattmann) + +* NUTCH-2146 hashCode on the Outlink class (jorgelbg via mattmann) + +* NUTCH-2155 Create a "crawl completeness" utility (Michael Joyce via mattmann) + +* NUTCH-1988 Make nested output directory dump optional... again (Michael Joyce via lewismc) + +* NUTCH-1800 Documentation for Nutch 1.X and 2.X REST APIs (lewismc) + +* NUTCH-2149 REST endpoint to read Nutch sequence files (Sujen Shah) + +* NUTCH-2139 Basic plugin to index inlinks and outlinks (jorgelbg) + +* NUTCH-2128 Review and update mapred --> mapreduce config params in crawl script (lewismc) + +* NUTCH-2141 Change the InteractiveSelenium plugin handler Interface to return page content + (Balaji Gurumurthy via mattmann) + +* NUTCH-2129 Add protocol status tracking to crawl datum (Michael Joyce via mattmann) + +* NUTCH-2142 Nutch File Dump - FileNotFoundException (Invalid Argument) Error (Karanjeet Singh via mattmann) + +* NUTCH-2136 Implement a different version of Naive Bayes Parse Filter (Asitang Mishra) + +* NUTCH-2109 Create a brute fo
svn commit: r45520 [1/3] - /dev/nutch/1.18/
Author: lewismc Date: Thu Jan 21 00:44:46 2021 New Revision: 45520 Log: Stage Apache Nutch 1.18 RC#1 Added: dev/nutch/1.18/ dev/nutch/1.18/CHANGES.txt dev/nutch/1.18/apache-nutch-1.18-bin.tar.gz (with props) dev/nutch/1.18/apache-nutch-1.18-bin.tar.gz.asc dev/nutch/1.18/apache-nutch-1.18-bin.tar.gz.md5 dev/nutch/1.18/apache-nutch-1.18-bin.tar.gz.sha512 dev/nutch/1.18/apache-nutch-1.18-bin.zip (with props) dev/nutch/1.18/apache-nutch-1.18-bin.zip.asc dev/nutch/1.18/apache-nutch-1.18-bin.zip.md5 dev/nutch/1.18/apache-nutch-1.18-bin.zip.sha512 dev/nutch/1.18/apache-nutch-1.18-src.tar.gz (with props) dev/nutch/1.18/apache-nutch-1.18-src.tar.gz.asc dev/nutch/1.18/apache-nutch-1.18-src.tar.gz.md5 dev/nutch/1.18/apache-nutch-1.18-src.tar.gz.sha512 dev/nutch/1.18/apache-nutch-1.18-src.zip (with props) dev/nutch/1.18/apache-nutch-1.18-src.zip.asc dev/nutch/1.18/apache-nutch-1.18-src.zip.md5 dev/nutch/1.18/apache-nutch-1.18-src.zip.sha512
[nutch] annotated tag release-1.18 updated (43f3550 -> a8ef299)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a change to annotated tag release-1.18 in repository https://gitbox.apache.org/repos/asf/nutch.git. *** WARNING: tag release-1.18 was modified! *** from 43f3550 (commit) to a8ef299 (tag) tagging 43f3550c1adef70a0acd9938737c5c3f899bc2be (commit) replaces release-1.13 by Lewis John McGibbney on Tue Jan 19 15:36:39 2021 -0800 - Log - Apache Nutch 1.18 RC#1 Tag --- No new revisions were added by this update. Summary of changes:
[nutch] branch branch-1.18 updated: Prepare for Nutch 1.18 release
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch branch-1.18 in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/branch-1.18 by this push: new 43f3550 Prepare for Nutch 1.18 release 43f3550 is described below commit 43f3550c1adef70a0acd9938737c5c3f899bc2be Author: Lewis John McGibbney AuthorDate: Tue Jan 19 15:33:48 2021 -0800 Prepare for Nutch 1.18 release --- build.xml| 25 - ivy/mvn.template | 12 +++- 2 files changed, 27 insertions(+), 10 deletions(-) diff --git a/build.xml b/build.xml index 62ed5d1..1d71bc2 100644 --- a/build.xml +++ b/build.xml @@ -37,6 +37,8 @@ + + @@ -311,8 +313,9 @@ - - + + + @@ -321,8 +324,9 @@ - - + + + @@ -332,8 +336,9 @@ - - + + + @@ -362,10 +367,12 @@ - - + + + + - +
[nutch] branch branch-1.18 created (now e9f125c)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a change to branch branch-1.18 in repository https://gitbox.apache.org/repos/asf/nutch.git. at e9f125c Prepare for Nutch 1.18 release This branch includes the following new commits: new e9f125c Prepare for Nutch 1.18 release The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference.
[nutch] 01/01: Prepare for Nutch 1.18 release
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch branch-1.18 in repository https://gitbox.apache.org/repos/asf/nutch.git commit e9f125c62ae71903187959351b0f72da29937749 Author: Lewis John McGibbney AuthorDate: Thu Jan 14 15:27:00 2021 -0800 Prepare for Nutch 1.18 release --- CHANGES.txt| 50 -- NOTICE.txt | 2 +- conf/nutch-default.xml | 2 +- default.properties | 4 ++-- src/bin/nutch | 2 +- 5 files changed, 53 insertions(+), 7 deletions(-) diff --git a/CHANGES.txt b/CHANGES.txt index e5c5984..0613585 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -1,10 +1,56 @@ # Nutch Change Log -Nutch 1.18 Development +Nutch 1.18 Release 14/01/2021 (dd/mm/) +Release Report: https://s.apache.org/lqara Breaking Changes -- As part of NUTCH-2805, the plugin urlfilter-domainblacklist has been renamed to urlfilter-domaindenylist. And the fields required for the plugin urlfilter.domainblacklist.rules and urlfilter.domainblacklist.file has been replaced with urlfilter.domaindenylist.rules and urlfilter.domaindenylist.file respectively. See NUTCH-2802 for more details. +- As part of NUTCH-2805, the plugin urlfilter-domainblacklist has been renamed to urlfilter-domaindenylist. And the fields required for the plugin urlfilter.domainblacklist.rules and urlfilter.domainblacklist.file has been replaced with urlfilter.domaindenylist.rules and urlfilter.domaindenylist.file respectively. See NUTCH-2802 for more details. + +Sub-task + +[NUTCH-2671] - Upgrade ant ivy library +[NUTCH-2672] - Ant build erronously installs *-test.jar instead *.jar for target "nightly" +[NUTCH-2805] - Rename plugin urlfilter-domainblacklist +[NUTCH-2809] - Upgrade any23 plugin dependency to 2.4 +[NUTCH-2816] - Add Spotbugs target to ant build +[NUTCH-2817] - Avoid check for equality of URL path and file part using ==/!= +[NUTCH-2829] - Fix ant target "clean-cache" + +Bug + +[NUTCH-2669] - Reliable solution for javax.ws packaging.type +[NUTCH-2697] - Upgrade Ivy to fix the issue of an unset packaging.type property +[NUTCH-2801] - RobotsRulesParser command-line checker to use http.robots.agents as fall-back +[NUTCH-2810] - FreeGenerator to actually apply configured number of fetch lists +[NUTCH-2813] - MoreIndexingFilter - can't parse erroneous date - 2019-07-03T10:28:14 +[NUTCH-2814] - HttpDateFormat's internal time zone may change after parsing a date +[NUTCH-2818] - Ant build: upgrade Apache Rat report task +[NUTCH-2823] - IllegalStateException in IndexWriters.describe() when validating url param for SolrIndexer +[NUTCH-2824] - urlnormalizer-basic to unescape percent-encoded host names + +Improvement + +[NUTCH-1190] - MoreIndexingFilter refactor: move data formats used to parse "lastModified" to a config file. +[NUTCH-2582] - Set pool size of XML SAX parsers used for MIME detection in Tika 1.19 +[NUTCH-2730] - SitemapProcessor to treat sitemap URLs as Set instead of List +[NUTCH-2782] - protocol-http / lib-http: support TLSv1.3 +[NUTCH-2796] - Upgrade to crawler-commons 1.1 +[NUTCH-2799] - Add .asf.yaml file +[NUTCH-2833] - Upgrade to Tika 1.25 +[NUTCH-2835] - Upgrade commons-jexl from 2 --> 3 +[NUTCH-2836] - Upgrade various commons dependencies +[NUTCH-2837] - Update multiple dependencies +[NUTCH-2841] - Upgrade xercesImpl dependency + +Wish + +[NUTCH-2834] - Deduplication mode via command line in crawl script + +Task + +[NUTCH-2830] - Upgrade any23 to v2.4 + Nutch 1.17 Release 18/06/2020 (dd/mm/) Release Report: https://s.apache.org/ovhry diff --git a/NOTICE.txt b/NOTICE.txt index 71f29fa..1c9efd0 100644 --- a/NOTICE.txt +++ b/NOTICE.txt @@ -1,5 +1,5 @@ Apache Nutch -Copyright 2020 The Apache Software Foundation +Copyright 2021 The Apache Software Foundation This product includes software developed by The Apache Software Foundation (http://www.apache.org/). diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml index 6932eb5..df6916b 100644 --- a/conf/nutch-default.xml +++ b/conf/nutch-default.xml @@ -164,7 +164,7 @@ http.agent.version - Nutch-1.18-SNAPSHOT + Nutch-1.18 A version string to advertise in the User-Agent header. diff --git a/default.properties b/default.properties index e4b9619..fdb35b9 100644 --- a/default.properties +++ b/default.properties @@ -14,9 +14,9 @@ # limitations under the License. name=apache-nutch -version=1.18-SNAPSHOT +version=1.18 final.name=${name}-${version} -year=2020 +year=2021 basedir = ./ src.dir = ./src/java diff --git a/src/bin/nutch b/src/bin/nutch index 7d0d8ee..c501ea5 100755 --- a/src/bin/nutch +++ b/src/bin/nutch @@ -60,7 +60,7 @@ done # if no args specified, show usage if [ $# = 0 ]; then - echo "nutch 1.18-SNAPSHOT"
[nutch] branch master updated: NUTCH-2841 Upgrade xercesImpl dependency (#563)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 59c63c7 NUTCH-2841 Upgrade xercesImpl dependency (#563) 59c63c7 is described below commit 59c63c7d8a13b0de1fd1da6aa4a1ab6e20fa478d Author: Lewis John McGibbney AuthorDate: Wed Jan 13 10:56:07 2021 -0800 NUTCH-2841 Upgrade xercesImpl dependency (#563) * NUTCH-2841 Upgrade xercesImpl dependency --- ivy/ivy.xml | 2 +- src/java/org/apache/nutch/tools/DmozParser.java | 3 +++ 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/ivy/ivy.xml b/ivy/ivy.xml index ad1e65f..3f1faf3 100644 --- a/ivy/ivy.xml +++ b/ivy/ivy.xml @@ -66,7 +66,7 @@ - + diff --git a/src/java/org/apache/nutch/tools/DmozParser.java b/src/java/org/apache/nutch/tools/DmozParser.java index 63dbde8..a447646 100644 --- a/src/java/org/apache/nutch/tools/DmozParser.java +++ b/src/java/org/apache/nutch/tools/DmozParser.java @@ -276,8 +276,11 @@ public class DmozParser { throws IOException, SAXException, ParserConfigurationException { SAXParserFactory parserFactory = SAXParserFactory.newInstance(); + parserFactory.setFeature("http://xml.org/sax/features/external-general-entities;, false); + parserFactory.setFeature("http://apache.org/xml/features/disallow-doctype-decl;, true); SAXParser parser = parserFactory.newSAXParser(); XMLReader reader = parser.getXMLReader(); +reader.setFeature("http://xml.org/sax/features/external-general-entities;, false); // Create our own processor to receive SAX events RDFProcessor rp = new RDFProcessor(reader, subsetDenom, includeAdult, skew,
[nutch] branch master updated: NUTCH-2837 Update multiple dependencies (#560)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 7f0fdb1 NUTCH-2837 Update multiple dependencies (#560) 7f0fdb1 is described below commit 7f0fdb15a339cae72fda9624f1260ee4869688ef Author: Lewis John McGibbney AuthorDate: Fri Jan 8 10:01:38 2021 -0800 NUTCH-2837 Update multiple dependencies (#560) * NUTCH-2837 Upgrade Slf4j dependencies * NUTCH-2837 Update multiple dependencies --- ivy/ivy.xml | 30 +++--- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/ivy/ivy.xml b/ivy/ivy.xml index 0aa1de4..ad1e65f 100644 --- a/ivy/ivy.xml +++ b/ivy/ivy.xml @@ -32,8 +32,8 @@ - - + + - + - + @@ -78,18 +78,18 @@ - - - - - - - - - + + + + + + + + + - + @@ -105,7 +105,7 @@ - +
[nutch] branch master updated: NUTCH-2836 Upgrade various commons dependencies (#559)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new fbd53ba NUTCH-2836 Upgrade various commons dependencies (#559) fbd53ba is described below commit fbd53ba16bc8dd751425757273996216ec80cd78 Author: Lewis John McGibbney AuthorDate: Thu Jan 7 20:41:37 2021 -0800 NUTCH-2836 Upgrade various commons dependencies (#559) --- ivy/ivy.xml | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/ivy/ivy.xml b/ivy/ivy.xml index a20d8a6..0aa1de4 100644 --- a/ivy/ivy.xml +++ b/ivy/ivy.xml @@ -41,11 +41,11 @@ - - - - - + + + + +
[nutch] branch master updated: Add possibility to setup deduplication group mode in crawl script (#557)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 88a17f2 Add possibility to setup deduplication group mode in crawl script (#557) 88a17f2 is described below commit 88a17f26b4160720bacb3ead1cad71ae24a559bc Author: Jakob Berlin AuthorDate: Thu Dec 17 17:59:30 2020 +0100 Add possibility to setup deduplication group mode in crawl script (#557) --- src/bin/crawl | 16 +++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/src/bin/crawl b/src/bin/crawl index 23a2940..db42218 100755 --- a/src/bin/crawl +++ b/src/bin/crawl @@ -48,6 +48,8 @@ # --time-limit-fetch Number of minutes allocated to the fetching [default: 180] # --num-threadsNumber of threads for fetching / sitemap processing [default: 50] # +# -dedup-groupDeduplication group method [default: none] +# function __to_seconds() { NUMBER=$(echo $1 | tr -dc '0-9') @@ -107,6 +109,7 @@ function __print_usage { echo -e " \t\t\t\t\t - never [default]" echo -e " \t\t\t\t\t - always (processing takes place in every iteration)" echo -e " \t\t\t\t\t - once (processing only takes place in the first iteration)" + echo -e " -dedup-group \tDeduplication group method [default: none]" exit 1 } @@ -124,6 +127,7 @@ SIZE_FETCHLIST=5 # 25K x NUM_TASKS TIME_LIMIT_FETCH=180 NUM_THREADS=50 SITEMAPS_FROM_HOSTDB_FREQUENCY=never +DEDUP_GROUP=none while [[ $# > 0 ]] do @@ -177,6 +181,10 @@ do SITEMAPS_FROM_HOSTDB_FREQUENCY="${2}" shift 2 ;; +--dedup-group) +DEDUP_GROUP="${2}" +shift 2 +;; --hostdbupdate) HOSTDBUPDATE=true shift @@ -197,6 +205,12 @@ if [[ ! "$SITEMAPS_FROM_HOSTDB_FREQUENCY" =~ ^(never|always|once)$ ]]; then __print_usage fi +if [[ ! "$DEDUP_GROUP" =~ ^(none|host|domain)$ ]]; then + echo "Error: --dedup-group has to be one of none, host, domain." + echo -e "" + __print_usage +fi + if [[ $# != 2 ]]; then __print_usage fi @@ -385,7 +399,7 @@ do __bin_nutch invertlinks "${commonOptions[@]}" "$CRAWL_PATH"/linkdb "$CRAWL_PATH"/segments/$SEGMENT -noNormalize -nofilter echo "Dedup on crawldb" - __bin_nutch dedup "${commonOptions[@]}" "$CRAWL_PATH"/crawldb + __bin_nutch dedup "${commonOptions[@]}" "$CRAWL_PATH"/crawldb -group "$DEDUP_GROUP" if $INDEXFLAG; then echo "Indexing $SEGMENT to index"
[nutch] branch master updated: NUTCH-2835 Upgrade commons-jexl from 2 --> 3 (#558)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 8d8e08b NUTCH-2835 Upgrade commons-jexl from 2 --> 3 (#558) 8d8e08b is described below commit 8d8e08b354fd94fced548c0b73623a375bcc8b2b Author: Lewis John McGibbney AuthorDate: Thu Dec 17 08:56:04 2020 -0800 NUTCH-2835 Upgrade commons-jexl from 2 --> 3 (#558) --- ivy/ivy.xml | 2 +- src/java/org/apache/nutch/crawl/CrawlDatum.java | 8 src/java/org/apache/nutch/crawl/CrawlDbReader.java | 4 ++-- src/java/org/apache/nutch/crawl/Generator.java | 12 ++-- src/java/org/apache/nutch/hostdb/ReadHostDb.java| 17 +++-- src/java/org/apache/nutch/util/JexlUtil.java| 12 +--- .../org/apache/nutch/exchange/jexl/JexlExchange.java| 8 .../apache/nutch/indexer/jexl/JexlIndexingFilter.java | 10 +- 8 files changed, 34 insertions(+), 39 deletions(-) diff --git a/ivy/ivy.xml b/ivy/ivy.xml index 16ed8a6..a20d8a6 100644 --- a/ivy/ivy.xml +++ b/ivy/ivy.xml @@ -46,7 +46,7 @@ - + diff --git a/src/java/org/apache/nutch/crawl/CrawlDatum.java b/src/java/org/apache/nutch/crawl/CrawlDatum.java index e05d7fd..5159bdb 100644 --- a/src/java/org/apache/nutch/crawl/CrawlDatum.java +++ b/src/java/org/apache/nutch/crawl/CrawlDatum.java @@ -25,9 +25,9 @@ import java.util.HashSet; import java.util.Map; import java.util.Map.Entry; -import org.apache.commons.jexl2.JexlContext; -import org.apache.commons.jexl2.Expression; -import org.apache.commons.jexl2.MapContext; +import org.apache.commons.jexl3.JexlContext; +import org.apache.commons.jexl3.JexlExpression; +import org.apache.commons.jexl3.MapContext; import org.apache.hadoop.io.FloatWritable; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; @@ -542,7 +542,7 @@ public class CrawlDatum implements WritableComparable, Cloneable { } } - public boolean evaluate(Expression expr, String url) { + public boolean evaluate(JexlExpression expr, String url) { if (expr != null && url != null) { // Create a context and add data JexlContext jcontext = new MapContext(); diff --git a/src/java/org/apache/nutch/crawl/CrawlDbReader.java b/src/java/org/apache/nutch/crawl/CrawlDbReader.java index 1bb8160..3af63d3 100644 --- a/src/java/org/apache/nutch/crawl/CrawlDbReader.java +++ b/src/java/org/apache/nutch/crawl/CrawlDbReader.java @@ -74,7 +74,7 @@ import org.apache.nutch.util.NutchJob; import org.apache.nutch.util.SegmentReaderUtil; import org.apache.nutch.util.StringUtil; import org.apache.nutch.util.TimingUtil; -import org.apache.commons.jexl2.Expression; +import org.apache.commons.jexl3.JexlExpression; import com.fasterxml.jackson.core.JsonGenerationException; import com.fasterxml.jackson.core.JsonGenerator; @@ -864,7 +864,7 @@ public class CrawlDbReader extends AbstractChecker implements Closeable { Matcher matcher = null; String status = null; Integer retry = null; -Expression expr = null; +JexlExpression expr = null; float sample; @Override diff --git a/src/java/org/apache/nutch/crawl/Generator.java b/src/java/org/apache/nutch/crawl/Generator.java index 04c2ae8..c3f4469 100644 --- a/src/java/org/apache/nutch/crawl/Generator.java +++ b/src/java/org/apache/nutch/crawl/Generator.java @@ -34,9 +34,9 @@ import java.util.Random; import org.apache.hadoop.conf.Configurable; import org.slf4j.Logger; import org.slf4j.LoggerFactory; -import org.apache.commons.jexl2.Expression; -import org.apache.commons.jexl2.JexlContext; -import org.apache.commons.jexl2.MapContext; +import org.apache.commons.jexl3.JexlExpression; +import org.apache.commons.jexl3.JexlContext; +import org.apache.commons.jexl3.MapContext; import org.apache.hadoop.mapreduce.Counter; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; @@ -182,7 +182,7 @@ public class Generator extends NutchTool implements Tool { private float scoreThreshold = 0f; private int intervalThreshold = -1; private byte restrictStatus = -1; -private Expression expr = null; +private JexlExpression expr = null; @Override public void setup( @@ -306,8 +306,8 @@ public class Generator extends NutchTool implements Tool { private URLNormalizers normalizers; private static boolean normalise; private SequenceFile.Reader[] hostdbReaders = null; -private Expression maxCountExpr = null; -private Expression fetchDelayExpr = null; +private JexlExpression maxCountExpr = null; +private JexlExpression fetchDelayExpr = null; pu
[nutch] branch master updated: NUTCH-2809 Upgrade any23 plugin dependency to 2.4 (#553)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 235af3c NUTCH-2809 Upgrade any23 plugin dependency to 2.4 (#553) 235af3c is described below commit 235af3c8ed547590dd83049d90ec0f86b78e5f7a Author: Lewis John McGibbney AuthorDate: Tue Nov 17 19:10:18 2020 -0800 NUTCH-2809 Upgrade any23 plugin dependency to 2.4 (#553) * NUTCH-2809 Upgrade any23 plugin dependency to 2.4 --- .gitignore | 1 + src/plugin/any23/ivy.xml | 2 +- src/plugin/any23/plugin.xml| 283 +++-- .../apache/nutch/any23/TestAny23ParseFilter.java | 4 +- 4 files changed, 157 insertions(+), 133 deletions(-) diff --git a/.gitignore b/.gitignore index 02a74cf..249ca77 100644 --- a/.gitignore +++ b/.gitignore @@ -23,3 +23,4 @@ naivebayes-model .idea/ *.iml *.swp +csvindexwriter diff --git a/src/plugin/any23/ivy.xml b/src/plugin/any23/ivy.xml index 9a0aa34..d821b32 100644 --- a/src/plugin/any23/ivy.xml +++ b/src/plugin/any23/ivy.xml @@ -36,7 +36,7 @@ - + diff --git a/src/plugin/any23/plugin.xml b/src/plugin/any23/plugin.xml index 71c5522..934709d 100644 --- a/src/plugin/any23/plugin.xml +++ b/src/plugin/any23/plugin.xml @@ -25,162 +25,185 @@ - - - - - - - - - - - - + + + + + + + + + + + + + + + - - + + - - - + + + + - - - + + + - - - - - + + + + + + + - - - + + + + + + - - - - - - - - - - - - + + + + + + + + + + - - - - - - - - + + + + + + + + + + + + + + + + - + + + - - - - - - - + + + + + + + - - - - - - + + + + - + + - - + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + - - - - - - + + + + + + + + - - - + + + + + - - - - - + + + + + + + diff --git a/src/plugin/any23/src/test/org/apache/nutch/any23/TestAny23ParseFilter.java b/src/plugin/any23/src/test/org/apache/nutch/any23/TestAny23ParseFilter.java index 3f0ace3..09c253f 100644 --- a/src/plugin/any23/src/test/org/apache/nutch/any23/TestAny23ParseFilter.java +++ b/src/plugin/any23/src/test/org/apache/nutch/any23/TestAny23ParseFilter.java @@ -49,9 +49,9 @@ public class TestAny23ParseFilter { private String file2 = "microdata_basic.html"; - private static final int EXPECTED_TRIPLES_1 = 68; + private static final int EXPECTED_TRIPLES_1 = 79; - private static final int EXPECTED_TRIPLES_2 = 38; + private static final int EXPECTED_TRIPLES_2 = 40; @Before public void setUp() {
[nutch] branch branch-2.4 created (now 4944597)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a change to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/nutch.git. at 4944597 Prepare for Nutch 2.4 release candidate This branch includes the following new commits: new 4944597 Prepare for Nutch 2.4 release candidate The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference.
[nutch] 01/01: Prepare for Nutch 2.4 release candidate
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/nutch.git commit 49445974a1f31d2e304c75e274aa6fd39afc95b9 Author: Lewis John McGibbney AuthorDate: Sat Mar 9 16:23:32 2019 -0800 Prepare for Nutch 2.4 release candidate --- CHANGES.txt| 108 - NOTICE.txt | 2 +- README.md | 4 ++ conf/nutch-default.xml | 2 +- default.properties | 4 +- 5 files changed, 107 insertions(+), 13 deletions(-) diff --git a/CHANGES.txt b/CHANGES.txt index b7f1345..e27e358 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -1,14 +1,104 @@ Nutch Change Log -Nutch 2.4 Development - - * NUTCH-2256 Inconsistent log level (songwanging via snagel) - - * NUTCH-961 GitHub-92 Add the boilerpipe parsing adapted from NUTCH-961 (Jeremie Bourseaux via mattmann) - - * GitHub-94 Fix the issue of the bad timestamp. (Jeremie Bourseaux via mattmann) - - * NUTCH-1314 Impose a limit on the length of outlink target urls (ferdy, lewismc, tejasp, Canan Girgin, Tien Nguyen Manh) +Nutch 2.4 Release 09032018 (ddmm) +Release Report - https://s.apache.org/bFfL + +Sub-task + +[NUTCH-2284] - Basic Authentication Support for REST API +[NUTCH-2285] - Digest Authentication Support for REST API +[NUTCH-2289] - SSL Support for REST API +[NUTCH-2294] - Authorization Support for REST API +[NUTCH-2301] - Create Tests for Security Layer of NutchServer + +Bug + +[NUTCH-2089] - Move Nutch 2.x to compile on JDK 8 +[NUTCH-2112] - Missing org.restlet.jee when building with gora-solr +[NUTCH-] - re-fetch deletes all metadata except _csh_ and _rs_ +[NUTCH-2256] - Inconsistent log level practice +[NUTCH-2259] - Nutch 2.x HBase Docker requires a logs folder to run exception free +[NUTCH-2260] - JAVA_HOME and hbase-common dependency absent from hbase Docker image +[NUTCH-2266] - Fix dead link in build.xml for javadoc +[NUTCH-2269] - Clean not working after crawl +[NUTCH-2282] - Incorrect content-type returned in 4 API calls +[NUTCH-2283] - "Bad substitution" error when running cassandra docker scripts +[NUTCH-2305] - generate.min.score doesn't work in 2.x +[NUTCH-2314] - Use indexer-elastic2 Plugin for javadoc and eclipse Targets +[NUTCH-2337] - urlnormalizer-basic to strip empty port +[NUTCH-2346] - Check Types at Object Equality +[NUTCH-2348] - Close GZIPInputStream +[NUTCH-2349] - urlnormalizer-basic NPE for ill-formed URL "http:/" +[NUTCH-2350] - Add Missing activeConfId Field to NutchStatus Object +[NUTCH-2358] - HostInjectorJob doesn't work +[NUTCH-2364] - http.agent.rotate: IllegalArgumentException / last element of agent names ignored +[NUTCH-2388] - bin/crawl indexing only webpages containing batchID instead of all in 2.x +[NUTCH-2393] - 2.x patch for MD5 duplication issue addressed in NUTCH-2391 +[NUTCH-2404] - Failed Jenkin Build #1588 error in unit test resolved +[NUTCH-2405] - jsoup-extractor structure correction, typo fixed +[NUTCH-2437] - gora mongodb mapping file error +[NUTCH-2446] - URLFiltersCheck fix +[NUTCH-2448] - Allow Sending an empty http.agent.version +[NUTCH-2451] - protocol-ftp to resolve relative URL when following redirects +[NUTCH-2469] - Documents not commited to solr in Sever mode +[NUTCH-2475] - If and else-if branches has the same condition +[NUTCH-2513] - ant eclipse target fails with "protocol switch unsafe" +[NUTCH-2520] - Wrong Accept-Charset sent when http.accept.charset is not defined +[NUTCH-2533] - Injector: NullPointerException if seed URL dir contains non-file entries +[NUTCH-2536] - GeneratorReducer.count is a static variable +[NUTCH-2548] - Compressed content skipped. Content of size 78 was truncated to 74 +[NUTCH-2581] - Caching of redirected robots.txt may overwrite correct robots.txt rules +[NUTCH-2637] - Number of fetcher reducers is misconfigured when the arg not passed +[NUTCH-2639] - bin/nutch fails to set native library path on Cygwin causing jobs to fail with UnsatisfiedLinkError +[NUTCH-2640] - Typo: DbUpdaterJob: updatinging all +[NUTCH-2641] - ClassCastException in webui +[NUTCH-2642] - MoreIndexingFilter parses ISO 8601 UTC dates in local time zone + +New Feature + +[NUTCH-1741] - Support of Sitemaps in Nutch 2.x +[NUTCH-2199] - Documentation for Nutch 2.X REST API +[NUTCH-2238] - Indexer for Elasticsearch 2.x +[NUTCH-2243] - Documentation for Nutch 2.X REST API +[NUTCH-2344] - Authentication Support for Web GUI +[NUTCH-2373] - Indexer for Hbase +[NUTCH-2389] - Precise data parsing using Jsoup CSS selectors + +Improvement + +[NUTCH-1314] - Impose a limit on the length of outlink target urls +[NUTCH-1678] - Remove dependency on o
[nutch] branch master updated: NUTCH-2698 Remove sonar build task from build.xml (#443)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 8bdec5e NUTCH-2698 Remove sonar build task from build.xml (#443) 8bdec5e is described below commit 8bdec5e3ef77f816c616c978c775a0eb3b4a391a Author: Lewis John McGibbney AuthorDate: Tue Mar 5 13:04:36 2019 -0800 NUTCH-2698 Remove sonar build task from build.xml (#443) --- build.xml | 26 -- 1 file changed, 26 deletions(-) diff --git a/build.xml b/build.xml index 65e8f3f..18f659a 100644 --- a/build.xml +++ b/build.xml @@ -999,32 +999,6 @@ - - - - - - - - - - - - - - - - - - - - - - - - -
[nutch] branch master updated: NUTCH-2697: Upgrade Ivy to fix the issue of an unset packaging.type property. (#441)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 0b0fcea NUTCH-2697: Upgrade Ivy to fix the issue of an unset packaging.type property. (#441) 0b0fcea is described below commit 0b0fcea1720cbe0722a18a2a29977e4bfec685bb Author: Chris Gavin AuthorDate: Sat Mar 2 03:48:27 2019 + NUTCH-2697: Upgrade Ivy to fix the issue of an unset packaging.type property. (#441) --- default.properties | 2 +- ivy/ivysettings.xml | 8 2 files changed, 1 insertion(+), 9 deletions(-) diff --git a/default.properties b/default.properties index bb987d9..1423025 100644 --- a/default.properties +++ b/default.properties @@ -63,7 +63,7 @@ runtime.dir=./runtime runtime.deploy=${runtime.dir}/deploy runtime.local=${runtime.dir}/local -ivy.version=2.4.0 +ivy.version=2.5.0-rc1 ivy.dir=${basedir}/ivy ivy.file=${ivy.dir}/ivy.xml ivy.jar=${ivy.dir}/ivy-${ivy.version}.jar diff --git a/ivy/ivysettings.xml b/ivy/ivysettings.xml index a2dc700..d9b5044 100644 --- a/ivy/ivysettings.xml +++ b/ivy/ivysettings.xml @@ -38,14 +38,6 @@ value="[organisation]/[module]/[revision]/[module]-[revision](-[classifier])"/> - -
[nutch] branch master updated: NUTCH-2633 Fix deprecation warnings when building Nutch master branch under JDK 10.0.2+13 (#374)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new f02110f NUTCH-2633 Fix deprecation warnings when building Nutch master branch under JDK 10.0.2+13 (#374) f02110f is described below commit f02110f42c53e77450835776cf41f22c23f030ec Author: Lewis John McGibbney AuthorDate: Fri Aug 10 17:43:36 2018 -0700 NUTCH-2633 Fix deprecation warnings when building Nutch master branch under JDK 10.0.2+13 (#374) --- .../apache/nutch/crawl/AbstractFetchSchedule.java | 0 .../apache/nutch/crawl/AdaptiveFetchSchedule.java | 0 src/java/org/apache/nutch/crawl/CrawlDatum.java| 2 +- src/java/org/apache/nutch/crawl/CrawlDbMerger.java | 1 - src/java/org/apache/nutch/crawl/CrawlDbReader.java | 4 --- .../apache/nutch/crawl/DefaultFetchSchedule.java | 0 src/java/org/apache/nutch/crawl/FetchSchedule.java | 0 .../apache/nutch/crawl/FetchScheduleFactory.java | 2 +- .../nutch/crawl/MimeAdaptiveFetchSchedule.java | 2 +- .../org/apache/nutch/crawl/SignatureFactory.java | 2 +- src/java/org/apache/nutch/fetcher/Fetcher.java | 2 +- src/java/org/apache/nutch/hostdb/ReadHostDb.java | 4 +-- .../org/apache/nutch/hostdb/ResolverThread.java| 1 + src/java/org/apache/nutch/indexer/CleaningJob.java | 2 ++ src/java/org/apache/nutch/indexer/IndexWriter.java | 3 ++ .../org/apache/nutch/indexer/IndexingFilters.java | 8 - src/java/org/apache/nutch/plugin/Extension.java| 10 +-- src/java/org/apache/nutch/plugin/Plugin.java | 3 +- src/java/org/apache/nutch/protocol/Content.java| 0 src/java/org/apache/nutch/protocol/Protocol.java | 0 .../apache/nutch/protocol/ProtocolException.java | 0 .../org/apache/nutch/protocol/ProtocolFactory.java | 6 .../org/apache/nutch/protocol/ProtocolStatus.java | 34 +++--- .../nutch/segment/ContentAsTextInputFormat.java| 1 + .../org/apache/nutch/segment/SegmentReader.java| 14 - .../org/apache/nutch/service/impl/LinkReader.java | 22 ++ .../org/apache/nutch/service/impl/NodeReader.java | 22 ++ .../service/impl/NutchServerPoolExecutor.java | 1 + .../service/model/response/FetchNodeDbInfo.java| 4 +++ .../apache/nutch/service/resources/DbResource.java | 3 ++ src/java/org/apache/nutch/tools/Benchmark.java | 2 ++ .../apache/nutch/tools/CommonCrawlDataDumper.java | 2 +- .../apache/nutch/tools/CommonCrawlFormatWARC.java | 2 -- src/java/org/apache/nutch/tools/DmozParser.java| 15 ++ src/java/org/apache/nutch/tools/FileDumper.java| 2 +- .../apache/nutch/tools/arc/ArcSegmentCreator.java | 1 + .../org/apache/nutch/tools/warc/WARCExporter.java | 1 - .../org/apache/nutch/util/AbstractChecker.java | 2 ++ .../apache/nutch/util/CrawlCompletionStats.java| 6 ++-- .../org/apache/nutch/util/EncodingDetector.java| 3 ++ .../nutch/util/GenericWritableConfigurable.java| 2 +- .../apache/nutch/util/domain/DomainStatistics.java | 2 -- .../apache/nutch/any23/TestAny23ParseFilter.java | 13 - .../creativecommons/nutch/TestCCParseFilter.java | 0 .../apache/nutch/parse/feed/TestFeedParser.java| 10 +-- .../nutch/indexer/basic/BasicIndexingFilter.java | 6 .../nutch/indexer/geoip/GeoIPDocumentCreator.java | 3 +- .../nutch/indexer/jexl/JexlIndexingFilter.java | 2 +- .../indexer/links/TestLinksIndexingFilter.java | 1 - .../nutch/indexer/replace/ReplaceIndexer.java | 2 +- .../cloudsearch/CloudSearchIndexWriter.java| 1 + .../nutch/indexwriter/dummy/DummyIndexWriter.java | 4 --- .../elasticrest/ElasticRestIndexWriter.java| 5 .../indexwriter/elastic/ElasticIndexWriter.java| 1 + .../elastic/TestElasticIndexWriter.java| 3 ++ .../nutch/indexwriter/rabbit/RabbitDocument.java | 2 ++ .../indexer/filter/MimeTypeIndexingFilter.java | 1 + .../indexer/filter/MimeTypeIndexingFilterTest.java | 1 - .../org/apache/nutch/parse/html/HtmlParser.java| 1 + .../java/org/apache/nutch/parse/swf/SWFParser.java | 4 +-- .../parse/tika/BoilerpipeExtractorRepository.java | 2 +- .../org/apache/nutch/parse/tika/TikaParser.java| 2 +- .../apache/nutch/parse/tika/TestFeedParser.java| 7 - .../nutch/parsefilter/regex/RegexParseFilter.java | 1 - .../parsefilter/regex/TestRegexParseFilter.java| 2 -- .../org/apache/nutch/protocol/file/FileError.java | 1 + .../apache/nutch/protocol/file/FileResponse.java | 4 +-- .../java/org/apache/nutch/protocol/ftp/Ftp.java| 1 + .../org/apache/nutch/protocol/ftp/FtpError.java| 1 + .../org/apache/nutch/protocol/ftp/FtpResponse.java | 8 ++--- .../nutch/protocol/htmlunit/HttpResponse.java | 2 ++ .../java/org/apache/nutch/protocol/http/Http.java | 0
[nutch] branch 2.x updated: NUTCH-2222 re-fetch deletes all metadata except _csh_ and _rs_
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch 2.x in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/2.x by this push: new c43c2c8 NUTCH- re-fetch deletes all metadata except _csh_ and _rs_ c43c2c8 is described below commit c43c2c85874295ef94982694fc28c068d5447234 Author: Lewis John McGibbney AuthorDate: Wed Aug 1 11:26:04 2018 -0700 NUTCH- re-fetch deletes all metadata except _csh_ and _rs_ --- src/java/org/apache/nutch/fetcher/FetcherJob.java | 1 + 1 file changed, 1 insertion(+) diff --git a/src/java/org/apache/nutch/fetcher/FetcherJob.java b/src/java/org/apache/nutch/fetcher/FetcherJob.java index 82e7a12..f4b97cb 100644 --- a/src/java/org/apache/nutch/fetcher/FetcherJob.java +++ b/src/java/org/apache/nutch/fetcher/FetcherJob.java @@ -75,6 +75,7 @@ public class FetcherJob extends NutchTool implements Tool { FIELDS.add(WebPage.Field.MARKERS); FIELDS.add(WebPage.Field.REPR_URL); FIELDS.add(WebPage.Field.FETCH_TIME); +FIELDS.add(WebPage.Field.METADATA); } /**
[nutch] branch master updated: NUTCH-2539 (#300)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 666022d NUTCH-2539 (#300) 666022d is described below commit 666022d67ff0e3694540e4b97369cb73f1dfa377 Author: Semyon <oked...@users.noreply.github.com> AuthorDate: Wed Apr 11 00:53:40 2018 +0200 NUTCH-2539 (#300) * Merge branch 'master', remote branch 'origin' * Squashed commit of the following: commit 68363b1bba07ac8b21f6418633dec3f554996703 Author: Semyon Semyonov <semyon.semyo...@mail.com> Date: Mon Mar 19 14:48:11 2018 +0100 added description to crawldb.url.normalizers commit b53039e4b877fd52cac95d3df52133a0c914e4e1 Author: Semyon Semyonov <semyon.semyo...@mail.com> Date: Mon Mar 19 14:27:25 2018 +0100 misspelling in nutch-default crawldb.url.filters commit 73e3f6493f2f5cdb3b5336ee61854a3754e4b051 Author: Semyon Semyonov <semyon.semyo...@mail.com> Date: Mon Mar 19 14:23:26 2018 +0100 db.url.filters and db.url.normalizers renamed to crawldb.* for the code match --- conf/nutch-default.xml | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml index 20b8691..405b99f 100644 --- a/conf/nutch-default.xml +++ b/conf/nutch-default.xml @@ -548,15 +548,21 @@ -db.url.normalizers +crawldb.url.normalizers false -Normalize urls when updating crawldb + + !Temporary, can be overwritten with the command line! + Normalize urls when updating crawldb + -db.url.filters +crawldb.url.filters false -Filter urls when updating crawldb + + !Temporary, can be overwritten with the command line! + Filter urls when updating crawldb + -- To stop receiving notification emails like this one, please contact lewi...@apache.org.
[nutch] 01/01: Merge pull request #309 from HansBrende/NUTCH-2550
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git commit f612c4133444e7c765a6c690b5ee6c373ee12265 Merge: 8682b96 de19028 Author: Lewis John McGibbney <lewis.mcgibb...@gmail.com> AuthorDate: Tue Apr 10 15:51:25 2018 -0700 Merge pull request #309 from HansBrende/NUTCH-2550 fix for NUTCH-2550 contributed by Hans Brende src/java/org/apache/nutch/fetcher/FetcherThread.java | 2 ++ 1 file changed, 2 insertions(+) -- To stop receiving notification emails like this one, please contact lewi...@apache.org.
[nutch] branch master updated (8682b96 -> f612c41)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git. from 8682b96 Merge pull request #307 from Omkar20895/NUTCH-2518 add de19028 fix for NUTCH-2550 contributed by Hans Brende new f612c41 Merge pull request #309 from HansBrende/NUTCH-2550 The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: src/java/org/apache/nutch/fetcher/FetcherThread.java | 2 ++ 1 file changed, 2 insertions(+) -- To stop receiving notification emails like this one, please contact lewi...@apache.org.
[nutch] 01/01: Merge pull request #306 from lewismc/NUTCH-2545
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git commit 615331b81e04e3f50f766df442594f0436e51bca Merge: 2934d43 d7e8a26 Author: Lewis John McGibbney <lewis.mcgibb...@gmail.com> AuthorDate: Mon Apr 2 09:09:45 2018 -0700 Merge pull request #306 from lewismc/NUTCH-2545 NUTCH-2545 Upgrade to Any23 2.2 ivy/ivysettings.xml| 12 - src/plugin/any23/howto_upgrade_any23.txt | 8 +- src/plugin/any23/ivy.xml | 3 +- src/plugin/any23/plugin.xml| 323 ++--- .../org/apache/nutch/any23/Any23ParseFilter.java | 29 +- .../apache/nutch/any23/TestAny23ParseFilter.java | 9 +- 6 files changed, 174 insertions(+), 210 deletions(-) -- To stop receiving notification emails like this one, please contact lewi...@apache.org.
[nutch] branch master updated (2934d43 -> 615331b)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git. from 2934d43 Merge pull request #305 from sebastian-nagel/NUTCH-2447-ssl-handshake-alert add 5233a79 NUTCH-2545 Upgrade to Any23 2.2 add d6ed255 ANY23-2545 remove previous syntax correction. add 40e92a5 NUTCH-2545 Revert syntax correction to original implementation, add commons-rdf-api dependency add d7e8a26 NUTCH-2545 Upgrade to Any23 2.2 new 615331b Merge pull request #306 from lewismc/NUTCH-2545 The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: ivy/ivysettings.xml| 12 - src/plugin/any23/howto_upgrade_any23.txt | 8 +- src/plugin/any23/ivy.xml | 3 +- src/plugin/any23/plugin.xml| 323 ++--- .../org/apache/nutch/any23/Any23ParseFilter.java | 29 +- .../apache/nutch/any23/TestAny23ParseFilter.java | 9 +- 6 files changed, 174 insertions(+), 210 deletions(-) -- To stop receiving notification emails like this one, please contact lewi...@apache.org.
[nutch] 01/01: Merge pull request #298 from benmvachon/NUTCH-2536
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch 2.x in repository https://gitbox.apache.org/repos/asf/nutch.git commit d48c67e2f1853cc1f2a7da1a04f6a22d524d5685 Merge: 4c72756 dcada64 Author: Lewis John McGibbney <lewis.mcgibb...@gmail.com> AuthorDate: Tue Mar 27 12:04:24 2018 -0700 Merge pull request #298 from benmvachon/NUTCH-2536 NUTCH-2536 change GeneratorReducer.count field to non-static variable… src/java/org/apache/nutch/crawl/GeneratorReducer.java | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- To stop receiving notification emails like this one, please contact lewi...@apache.org.
[nutch] branch 2.x updated (4c72756 -> d48c67e)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a change to branch 2.x in repository https://gitbox.apache.org/repos/asf/nutch.git. from 4c72756 NUTCH-2520 Use default value for Accept-Charset if http.accept.charset is undefined add dcada64 NUTCH-2536 change GeneratorReducer.count field to non-static variable for easier SDK experience new d48c67e Merge pull request #298 from benmvachon/NUTCH-2536 The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: src/java/org/apache/nutch/crawl/GeneratorReducer.java | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- To stop receiving notification emails like this one, please contact lewi...@apache.org.
[nutch] branch master updated (31819b7 -> 7cb7abd)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git. from 31819b7 NUTCH-2523 UpdateHostDB blocks usage of plugins unintentionally (contributed by Yossi Tamari) add b834b81 NUTCH-2516 Hadoop imports use wildcards add eff0b86 NUTCH-2516 Hadoop imports use wildcards add 303fd19 NUTCH-2516 Hadoop imports use wildcards new 7cb7abd Merge pull request #295 from lewismc/NUTCH-2516 The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: .gitignore | 6 + ivy/ivy-2.4.0.jar | Bin 1282424 -> 0 bytes src/java/org/apache/nutch/crawl/CrawlDatum.java| 28 ++- src/java/org/apache/nutch/crawl/CrawlDb.java | 30 ++- src/java/org/apache/nutch/crawl/CrawlDbFilter.java | 1 - src/java/org/apache/nutch/crawl/CrawlDbMerger.java | 15 +- src/java/org/apache/nutch/crawl/CrawlDbReader.java | 4 - .../org/apache/nutch/crawl/CrawlDbReducer.java | 8 +- .../org/apache/nutch/crawl/DeduplicationJob.java | 4 - src/java/org/apache/nutch/crawl/Generator.java | 31 ++- src/java/org/apache/nutch/crawl/Inlink.java| 8 +- src/java/org/apache/nutch/crawl/Inlinks.java | 19 +- src/java/org/apache/nutch/crawl/LinkDbFilter.java | 2 - src/java/org/apache/nutch/crawl/LinkDbMerger.java | 1 - src/java/org/apache/nutch/crawl/LinkDbReader.java | 19 +- .../org/apache/nutch/crawl/SignatureFactory.java | 1 - .../org/apache/nutch/crawl/URLPartitioner.java | 3 +- src/java/org/apache/nutch/fetcher/FetchNodeDb.java | 1 - src/java/org/apache/nutch/fetcher/Fetcher.java | 28 ++- .../apache/nutch/fetcher/FetcherOutputFormat.java | 5 - .../org/apache/nutch/fetcher/FetcherThread.java| 2 - .../apache/nutch/fetcher/FetcherThreadEvent.java | 1 - src/java/org/apache/nutch/fetcher/QueueFeeder.java | 1 - src/java/org/apache/nutch/hostdb/HostDatum.java| 2 - src/java/org/apache/nutch/hostdb/ReadHostDb.java | 5 - src/java/org/apache/nutch/hostdb/UpdateHostDb.java | 6 - .../apache/nutch/hostdb/UpdateHostDbMapper.java| 4 - .../apache/nutch/hostdb/UpdateHostDbReducer.java | 3 - src/java/org/apache/nutch/indexer/CleaningJob.java | 3 - src/java/org/apache/nutch/indexer/IndexWriter.java | 1 - .../org/apache/nutch/indexer/IndexWriters.java | 1 - .../org/apache/nutch/indexer/IndexerMapReduce.java | 5 - .../apache/nutch/indexer/IndexerOutputFormat.java | 1 - .../org/apache/nutch/indexer/IndexingFilter.java | 2 - .../org/apache/nutch/indexer/IndexingFilters.java | 1 - .../nutch/indexer/IndexingFiltersChecker.java | 1 - src/java/org/apache/nutch/indexer/IndexingJob.java | 1 - src/java/org/apache/nutch/indexer/NutchField.java | 17 +- .../org/apache/nutch/metadata/CreativeCommons.java | 6 +- .../org/apache/nutch/metadata/HttpHeaders.java | 18 +- .../org/apache/nutch/net/URLExemptionFilter.java | 3 +- src/java/org/apache/nutch/net/URLFilter.java | 2 - .../org/apache/nutch/net/URLFilterChecker.java | 7 - .../org/apache/nutch/net/URLNormalizerChecker.java | 7 - .../org/apache/nutch/net/protocols/Response.java | 2 - .../org/apache/nutch/parse/HtmlParseFilter.java| 3 - src/java/org/apache/nutch/parse/ParseData.java | 16 +- src/java/org/apache/nutch/parse/ParseImpl.java | 7 +- .../org/apache/nutch/parse/ParseOutputFormat.java | 19 +- .../org/apache/nutch/parse/ParsePluginList.java| 1 - .../org/apache/nutch/parse/ParsePluginsReader.java | 4 - src/java/org/apache/nutch/parse/ParseSegment.java | 39 ++-- src/java/org/apache/nutch/parse/ParseText.java | 24 +- src/java/org/apache/nutch/parse/ParseUtil.java | 2 - src/java/org/apache/nutch/parse/Parser.java| 2 - src/java/org/apache/nutch/parse/ParserFactory.java | 4 - src/java/org/apache/nutch/protocol/Content.java| 3 - src/java/org/apache/nutch/protocol/Protocol.java | 2 - .../org/apache/nutch/protocol/ProtocolFactory.java | 6 +- .../apache/nutch/protocol/RobotRulesParser.java| 3 - .../apache/nutch/scoring/webgraph/LinkDumper.java | 4 - .../apache/nutch/scoring/webgraph/LinkRank.java| 2 - .../apache/nutch/scoring/webgraph/NodeDumper.java | 2 - .../apache/nutch/scoring/webgraph/NodeReader.java | 1 - .../nutch/scoring/webgraph/ScoreUpdater.java | 2 - .../apache/nutch/scoring/webgraph/WebGraph.java| 2 - .../org/apache/nutch/segment/SegmentReader.java| 9 +- src/java/org/apache/nutch/service/NutchReader.java | 1 - .../org/apache/nutch/service/impl/NodeReader.java |
[nutch] branch master updated (0e28af6 -> 8bf139d)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git. from 0e28af6 fixed hdfs file checks in crawl script add dc516b7 NUTCH-2517 mergesegs corrupts segment data new 8bf139d Merge pull request #293 from lewismc/NUTCH-2517 The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: src/java/org/apache/nutch/crawl/LinkDb.java| 130 ++--- .../org/apache/nutch/segment/SegmentMerger.java| 201 ++--- 2 files changed, 163 insertions(+), 168 deletions(-) -- To stop receiving notification emails like this one, please contact lewi...@apache.org.
[nutch] 01/01: Merge pull request #293 from lewismc/NUTCH-2517
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git commit 8bf139d76ac3d9c7a557fb297b0947b8bc1d6065 Merge: 0e28af6 dc516b7 Author: Lewis John McGibbney <lewis.mcgibb...@gmail.com> AuthorDate: Wed Mar 14 08:36:00 2018 -0700 Merge pull request #293 from lewismc/NUTCH-2517 NUTCH-2517 mergesegs corrupts segment data src/java/org/apache/nutch/crawl/LinkDb.java| 130 ++--- .../org/apache/nutch/segment/SegmentMerger.java| 201 ++--- 2 files changed, 163 insertions(+), 168 deletions(-) -- To stop receiving notification emails like this one, please contact lewi...@apache.org.
[nutch] branch master updated (a2f637e -> 54510e5)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git. from a2f637e Merge pull request #284 from YossiTamari/master add c93d908 NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce add fe5bfb4 Merge branch 'master' into NUTCH-2375 add 405682e Merge branch 'NUTCH-2375' of https://github.com/Omkar20895/nutch into NUTCH-2375 new 54510e5 Merge pull request #221 from Omkar20895/NUTCH-2375 The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: src/java/org/apache/nutch/crawl/CrawlDb.java | 48 +- src/java/org/apache/nutch/crawl/CrawlDbFilter.java | 38 +- src/java/org/apache/nutch/crawl/CrawlDbMerger.java | 46 +- src/java/org/apache/nutch/crawl/CrawlDbReader.java | 315 .../org/apache/nutch/crawl/CrawlDbReducer.java | 44 +- .../org/apache/nutch/crawl/DeduplicationJob.java | 121 ++- src/java/org/apache/nutch/crawl/Generator.java | 873 +++-- src/java/org/apache/nutch/crawl/LinkDb.java| 226 +++--- src/java/org/apache/nutch/crawl/LinkDbFilter.java | 30 +- src/java/org/apache/nutch/crawl/LinkDbMerger.java | 90 +-- src/java/org/apache/nutch/crawl/LinkDbReader.java | 52 +- .../nutch/crawl/MimeAdaptiveFetchSchedule.java | 2 +- .../org/apache/nutch/crawl/URLPartitioner.java | 15 +- src/java/org/apache/nutch/fetcher/FetchNode.java | 2 +- src/java/org/apache/nutch/fetcher/FetchNodeDb.java | 2 +- src/java/org/apache/nutch/fetcher/Fetcher.java | 576 +++--- .../apache/nutch/fetcher/FetcherOutputFormat.java | 70 +- .../org/apache/nutch/fetcher/FetcherThread.java| 118 +-- src/java/org/apache/nutch/fetcher/QueueFeeder.java | 26 +- src/java/org/apache/nutch/hostdb/HostDatum.java| 2 +- src/java/org/apache/nutch/hostdb/ReadHostDb.java | 5 +- .../org/apache/nutch/hostdb/ResolverThread.java| 37 +- src/java/org/apache/nutch/hostdb/UpdateHostDb.java | 56 +- .../apache/nutch/hostdb/UpdateHostDbMapper.java| 50 +- .../apache/nutch/hostdb/UpdateHostDbReducer.java | 52 +- src/java/org/apache/nutch/indexer/CleaningJob.java | 76 +- src/java/org/apache/nutch/indexer/IndexWriter.java | 5 +- .../org/apache/nutch/indexer/IndexWriters.java | 6 +- .../org/apache/nutch/indexer/IndexerMapReduce.java | 497 ++-- .../apache/nutch/indexer/IndexerOutputFormat.java | 22 +- .../nutch/indexer/IndexingFiltersChecker.java | 6 +- src/java/org/apache/nutch/indexer/IndexingJob.java | 49 +- .../org/apache/nutch/net/URLExemptionFilters.java | 2 +- src/java/org/apache/nutch/parse/ParseCallable.java | 2 +- .../org/apache/nutch/parse/ParseOutputFormat.java | 117 ++- src/java/org/apache/nutch/parse/ParseSegment.java | 207 ++--- .../apache/nutch/scoring/webgraph/LinkDumper.java | 164 ++-- .../apache/nutch/scoring/webgraph/LinkRank.java| 484 ++-- .../apache/nutch/scoring/webgraph/NodeDumper.java | 317 .../apache/nutch/scoring/webgraph/NodeReader.java | 7 +- .../nutch/scoring/webgraph/ScoreUpdater.java | 146 ++-- .../apache/nutch/scoring/webgraph/WebGraph.java| 656 +--- .../nutch/segment/ContentAsTextInputFormat.java| 50 +- .../org/apache/nutch/segment/SegmentChecker.java | 2 +- .../org/apache/nutch/segment/SegmentMerger.java| 587 +++--- src/java/org/apache/nutch/segment/SegmentPart.java | 2 +- .../org/apache/nutch/segment/SegmentReader.java| 183 +++-- .../org/apache/nutch/service/impl/JobFactory.java | 2 +- .../nutch/service/model/request/JobConfig.java | 2 +- src/java/org/apache/nutch/tools/Benchmark.java | 10 +- src/java/org/apache/nutch/tools/FreeGenerator.java | 179 +++-- .../org/apache/nutch/tools/arc/ArcInputFormat.java | 26 +- .../apache/nutch/tools/arc/ArcRecordReader.java| 22 +- .../apache/nutch/tools/arc/ArcSegmentCreator.java | 466 +-- .../org/apache/nutch/tools/warc/WARCExporter.java | 296 +++ src/java/org/apache/nutch/util/JexlUtil.java | 2 +- src/java/org/apache/nutch/util/NutchJob.java | 17 +- src/java/org/apache/nutch/util/NutchTool.java | 2 +- .../util/{NutchJob.java => SegmentReaderUtil.java} | 25 +- .../nutch/webui/client/model/ConnectionStatus.java | 2 +- .../pages/components/ColorEnumLabelBuilder.java| 2 +- .../webui/pages/components/CpmIteratorAdapter.java | 2 +- .../apache/nutch/indexer/geoip/package-info.java | 2 +- .../indexer/links/TestLinksIndexingFilter.java | 2 +- .../test/org/apache/nutch/parse/TestOutlinks.java | 2 +- .../cloudsearch/CloudSearchIndexWriter.java| 9 +- .../nutch/indexwriter/dumm
[nutch] 01/01: Merge pull request #221 from Omkar20895/NUTCH-2375
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git commit 54510e503f7da7301a59f5f0e5bf4509b37d35b4 Merge: a2f637e 405682e Author: Lewis John McGibbney <lewis.mcgibb...@gmail.com> AuthorDate: Tue Feb 27 14:02:02 2018 -0800 Merge pull request #221 from Omkar20895/NUTCH-2375 NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce src/java/org/apache/nutch/crawl/CrawlDb.java | 48 +- src/java/org/apache/nutch/crawl/CrawlDbFilter.java | 38 +- src/java/org/apache/nutch/crawl/CrawlDbMerger.java | 46 +- src/java/org/apache/nutch/crawl/CrawlDbReader.java | 315 .../org/apache/nutch/crawl/CrawlDbReducer.java | 44 +- .../org/apache/nutch/crawl/DeduplicationJob.java | 121 ++- src/java/org/apache/nutch/crawl/Generator.java | 873 +++-- src/java/org/apache/nutch/crawl/LinkDb.java| 226 +++--- src/java/org/apache/nutch/crawl/LinkDbFilter.java | 30 +- src/java/org/apache/nutch/crawl/LinkDbMerger.java | 90 +-- src/java/org/apache/nutch/crawl/LinkDbReader.java | 52 +- .../nutch/crawl/MimeAdaptiveFetchSchedule.java | 2 +- .../org/apache/nutch/crawl/URLPartitioner.java | 15 +- src/java/org/apache/nutch/fetcher/FetchNode.java | 2 +- src/java/org/apache/nutch/fetcher/FetchNodeDb.java | 2 +- src/java/org/apache/nutch/fetcher/Fetcher.java | 576 +++--- .../apache/nutch/fetcher/FetcherOutputFormat.java | 70 +- .../org/apache/nutch/fetcher/FetcherThread.java| 118 +-- src/java/org/apache/nutch/fetcher/QueueFeeder.java | 26 +- src/java/org/apache/nutch/hostdb/HostDatum.java| 2 +- src/java/org/apache/nutch/hostdb/ReadHostDb.java | 5 +- .../org/apache/nutch/hostdb/ResolverThread.java| 37 +- src/java/org/apache/nutch/hostdb/UpdateHostDb.java | 56 +- .../apache/nutch/hostdb/UpdateHostDbMapper.java| 50 +- .../apache/nutch/hostdb/UpdateHostDbReducer.java | 52 +- src/java/org/apache/nutch/indexer/CleaningJob.java | 76 +- src/java/org/apache/nutch/indexer/IndexWriter.java | 5 +- .../org/apache/nutch/indexer/IndexWriters.java | 6 +- .../org/apache/nutch/indexer/IndexerMapReduce.java | 497 ++-- .../apache/nutch/indexer/IndexerOutputFormat.java | 22 +- .../nutch/indexer/IndexingFiltersChecker.java | 6 +- src/java/org/apache/nutch/indexer/IndexingJob.java | 49 +- .../org/apache/nutch/net/URLExemptionFilters.java | 2 +- src/java/org/apache/nutch/parse/ParseCallable.java | 2 +- .../org/apache/nutch/parse/ParseOutputFormat.java | 117 ++- src/java/org/apache/nutch/parse/ParseSegment.java | 207 ++--- .../apache/nutch/scoring/webgraph/LinkDumper.java | 164 ++-- .../apache/nutch/scoring/webgraph/LinkRank.java| 484 ++-- .../apache/nutch/scoring/webgraph/NodeDumper.java | 317 .../apache/nutch/scoring/webgraph/NodeReader.java | 7 +- .../nutch/scoring/webgraph/ScoreUpdater.java | 146 ++-- .../apache/nutch/scoring/webgraph/WebGraph.java| 656 +--- .../nutch/segment/ContentAsTextInputFormat.java| 50 +- .../org/apache/nutch/segment/SegmentChecker.java | 2 +- .../org/apache/nutch/segment/SegmentMerger.java| 587 +++--- src/java/org/apache/nutch/segment/SegmentPart.java | 2 +- .../org/apache/nutch/segment/SegmentReader.java| 183 +++-- .../org/apache/nutch/service/impl/JobFactory.java | 2 +- .../nutch/service/model/request/JobConfig.java | 2 +- src/java/org/apache/nutch/tools/Benchmark.java | 10 +- src/java/org/apache/nutch/tools/FreeGenerator.java | 179 +++-- .../org/apache/nutch/tools/arc/ArcInputFormat.java | 26 +- .../apache/nutch/tools/arc/ArcRecordReader.java| 22 +- .../apache/nutch/tools/arc/ArcSegmentCreator.java | 466 +-- .../org/apache/nutch/tools/warc/WARCExporter.java | 296 +++ src/java/org/apache/nutch/util/JexlUtil.java | 2 +- src/java/org/apache/nutch/util/NutchJob.java | 17 +- src/java/org/apache/nutch/util/NutchTool.java | 2 +- .../util/{NutchJob.java => SegmentReaderUtil.java} | 25 +- .../nutch/webui/client/model/ConnectionStatus.java | 2 +- .../pages/components/ColorEnumLabelBuilder.java| 2 +- .../webui/pages/components/CpmIteratorAdapter.java | 2 +- .../apache/nutch/indexer/geoip/package-info.java | 2 +- .../indexer/links/TestLinksIndexingFilter.java | 2 +- .../test/org/apache/nutch/parse/TestOutlinks.java | 2 +- .../cloudsearch/CloudSearchIndexWriter.java| 9 +- .../nutch/indexwriter/dummy/DummyIndexWriter.java | 5 +- .../elasticrest/ElasticRestIndexWriter.java| 32 +- .../indexwriter/elastic/ElasticConstants.java | 2 +- .../indexwriter/elastic/ElasticIndexWriter.java| 17 +- .../elastic/TestElasticIndexWriter.java| 14 +- .../indexwriter/rabbit/RabbitIndexWriter.java | 3 +-
[nutch] branch master updated (75d0166 -> a2f637e)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git. from 75d0166 Merge pull request #283 from smartive/NUTCH-2508 add f18b327 NUTCH-2489: Dependency collision with lucene-analyzers-common in scoring-similarity plugin new a2f637e Merge pull request #284 from YossiTamari/master The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: src/plugin/scoring-similarity/ivy.xml | 2 +- src/plugin/scoring-similarity/plugin.xml | 4 ++-- .../org/apache/nutch/scoring/similarity/util/LuceneAnalyzerUtil.java | 2 +- .../org/apache/nutch/scoring/similarity/util/LuceneTokenizer.java | 2 +- 4 files changed, 5 insertions(+), 5 deletions(-) -- To stop receiving notification emails like this one, please contact lewi...@apache.org.
[nutch] branch master updated (2b66cda -> 75d0166)
This is an automated email from the ASF dual-hosted git repository. lewismc pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git. from 2b66cda NUTCH-2466 add 4f82d8f fix for NUTCH-2508 contributed by mfeltscher new 75d0166 Merge pull request #283 from smartive/NUTCH-2508 The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: conf/nutch-default.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- To stop receiving notification emails like this one, please contact lewi...@apache.org.