This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch branch-1.17 in repository https://gitbox.apache.org/repos/asf/nutch.git
commit 77fa56e34ccd4ecf35f14111a4a3a0e2912e7f29 Author: Sebastian Nagel <sna...@apache.org> AuthorDate: Wed Jun 17 23:10:35 2020 +0200 Nutch 1.16 release - update current year in API docs etc. - update version number - add changes / release notes --- CHANGES.txt | 82 +++++++++++++++++++++++++++++++++++++++++++++++++- NOTICE.txt | 2 +- conf/nutch-default.xml | 2 +- default.properties | 4 +-- src/bin/nutch | 2 +- 5 files changed, 86 insertions(+), 6 deletions(-) diff --git a/CHANGES.txt b/CHANGES.txt index 3f26a8d..dcdc6e2 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -1,6 +1,86 @@ # Nutch Change Log -Nutch 1.17 Development +Nutch 1.17 Release 18/06/2020 (dd/mm/yyyy) +Release Report: https://s.apache.org/ovhry + +Bug + + [NUTCH-1559] - parse-metatags duplicates extracted metatags + [NUTCH-2379] - crawl script dedup's crawldb update is slow + [NUTCH-2419] - Some URL filters and normalizers do not respect command-line override for rule file + [NUTCH-2507] - NutchTutorial wiki pages as a lot of outdated command line calls when it starts with the solr interaction + [NUTCH-2511] - SitemapProcessor limited by http.content.limit + [NUTCH-2525] - Metadata indexer cannot handle uppercase parse metadata + [NUTCH-2567] - parse-metatags writes all meta tags twice + [NUTCH-2720] - ROBOTS metatag ignored when capitalized + [NUTCH-2745] - Solr schema.xml not shipped in binary release + [NUTCH-2748] - Fetch status gone (redirect exceeded) not to overwrite existing items in CrawlDb + [NUTCH-2751] - nutch clean does not work with secured solr cloud + [NUTCH-2753] - Add -listen option to command-line help of CrawlDbReader and LinkDbReader + [NUTCH-2754] - fetcher.max.crawl.delay ignored if exceeding 5 min. / 300 sec. + [NUTCH-2760] - protocol-okhttp: properly record HTTP version in request message header + [NUTCH-2761] - ivy jar fails to download + [NUTCH-2763] - protocol-okhttp (store.http.headers): add whitespace in status line after status code also when message is empty + [NUTCH-2770] - Subcollection logic allows empty string as a whitelist value, thus matching every incoming document. + [NUTCH-2778] - indexer-elastic to properly log errors + [NUTCH-2787] - CrawlDb JSON dump does not export metadata primitive data types correctly + [NUTCH-2789] - Documentation: update links to point to cwiki + [NUTCH-2790] - CSVIndexWriter does not escape leading quotes properly + [NUTCH-2791] - domainstats, protocolstats and crawlcomplete do not handle GCS URLs + +New Feature + + [NUTCH-1863] - Add JSON format dump output to readdb command + +Improvement + + [NUTCH-1194] - Generator: CrawlDB lock should be released earlier + [NUTCH-2002] - ParserChecker and IndexingFiltersChecker to check robots.txt + [NUTCH-2184] - Enable IndexingJob to function with no crawldb + [NUTCH-2495] - Use -deleteGone instead of clean job in crawler script while indexing + [NUTCH-2496] - Speed up link inversion step in crawling script + [NUTCH-2501] - allow to set Java heap size when using crawl script in distributed mode + [NUTCH-2649] - Optionally skip TLS/SSL certificate validation for protocol-selenium and protocol-htmlunit + [NUTCH-2733] - protocol-okhttp: add support for Brotli compression (Content-Encoding) + [NUTCH-2739] - indexer-elastic: Upgrade ES and migrate to REST client + [NUTCH-2743] - Add list of Nutch properties (nutch-default.xml) to documentation + [NUTCH-2746] - Basic URL normalizer to normalize Unicode domain names + [NUTCH-2747] - Replace remaining o.a.commons.logging by org.slf4j + [NUTCH-2750] - Improve CrawlDbReader & LinkDbReader reader handling + [NUTCH-2752] - indexer-solr: Upgrade to latest Solr version + [NUTCH-2755] - Remove obsolete plugin indexer-elastic-rest + [NUTCH-2757] - indexer-elastic: add authentication options + [NUTCH-2758] - Add plugin READMEs to binary release packages + [NUTCH-2759] - bin/crawl: Rename option --num-slaves + [NUTCH-2762] - Replace http:// URLs by https:// (build files and documentation) + [NUTCH-2767] - Fetcher to stop filling queues skipped due to repeated exceptions + [NUTCH-2768] - FetcherThread: unnecessary usage of class casts + [NUTCH-2772] - Debugging parse filter to show serialized DOM tree + [NUTCH-2773] - SegmentReader (-dump or -get): show HTML content as UTF-8 + [NUTCH-2774] - Annotate methods implementing the Hadoop API by @Override + [NUTCH-2775] - Fetcher to guarantee minimum delay even if robots.txt defines shorter Crawl-delay + [NUTCH-2776] - Fetcher to temporarily deduplicate followed redirects + [NUTCH-2777] - Upgrade to Hadoop 3.1 + [NUTCH-2779] - Upgrade to Tika 1.24.1 + [NUTCH-2780] - Upgrade index-solr to use Solr 8.5.1 + [NUTCH-2781] - Increase default Java heap size + [NUTCH-2783] - Use (more) parametrized logging + [NUTCH-2784] - Add tool to list Nutch and Hadoop properties + [NUTCH-2785] - FreeGenerator: command-line option to define number of generated fetch lists + [NUTCH-2788] - ParseData: improve presentation of Metadata in method toString() + [NUTCH-2794] - Add additional ciphers to HTTP base's default cipher suite + +Test + + [NUTCH-1945] - Test for XLSX parser + +Task + + [NUTCH-2434] - Add methods to reset parameters HTMLMetaTags + +Sub-task + + [NUTCH-2735] - Update the indexer-solr documentation about the schema.xml usage Nutch 1.16 Release 02/10/2019 (dd/mm/yyyy) diff --git a/NOTICE.txt b/NOTICE.txt index 5b46045..71f29fa 100644 --- a/NOTICE.txt +++ b/NOTICE.txt @@ -1,5 +1,5 @@ Apache Nutch -Copyright 2019 The Apache Software Foundation +Copyright 2020 The Apache Software Foundation This product includes software developed by The Apache Software Foundation (http://www.apache.org/). diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml index 23af74b..b7c9570 100644 --- a/conf/nutch-default.xml +++ b/conf/nutch-default.xml @@ -164,7 +164,7 @@ <property> <name>http.agent.version</name> - <value>Nutch-1.17-SNAPSHOT</value> + <value>Nutch-1.17</value> <description>A version string to advertise in the User-Agent header.</description> </property> diff --git a/default.properties b/default.properties index 4181800..960f788 100644 --- a/default.properties +++ b/default.properties @@ -14,9 +14,9 @@ # limitations under the License. name=apache-nutch -version=1.17-SNAPSHOT +version=1.17 final.name=${name}-${version} -year=2019 +year=2020 basedir = ./ src.dir = ./src/java diff --git a/src/bin/nutch b/src/bin/nutch index 244d812..57bf970 100755 --- a/src/bin/nutch +++ b/src/bin/nutch @@ -60,7 +60,7 @@ done # if no args specified, show usage if [ $# = 0 ]; then - echo "nutch 1.17-SNAPSHOT" + echo "nutch 1.17" echo "Usage: nutch COMMAND [-Dproperty=value]... [command-specific args]..." echo "where COMMAND is one of:" echo " readdb read / dump crawl db"