(nutch) branch master updated: NUTCH-3029

2024-03-14 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 98902236d NUTCH-3029 98902236d is described

(nutch) branch master updated: NUTCH-3029 Host specific max. and min. intervals in adaptive scheduler

2024-03-13 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new a8ec17ca8 NUTCH-3029 Host specific max. and min

(nutch) branch master updated: NUTCH-3029 Host specific max. and min. intervals in adaptive scheduler

2024-03-13 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 84cda2abd NUTCH-3029 Host specific max. and min

(nutch) branch master updated: NUTCH-3029 Host specific max. and min. intervals in adaptive scheduler

2024-03-13 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 5ba50c0c6 NUTCH-3029 Host specific max. and min

(nutch) branch master updated: NUTCH-3029 Host specific max. and min. intervals in adaptive scheduler

2024-03-13 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 4642c30c2 NUTCH-3029 Host specific max. and min

(nutch) branch master updated: NUTCH-3030 Use system default cipher suites instead of hard-coded set

2024-03-13 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 551c50b1c NUTCH-3030 Use system default cipher

(nutch) branch master updated: NUTCH-3031 ProtocolFactory host mapper to support domains

2024-03-12 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new c390dfc8b NUTCH-3031 ProtocolFactory host mapper

(nutch) branch master updated: NUTCH-3027 Trivial resource leak patch in DomainSuffixes.java

2024-01-19 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new d95e1a79d NUTCH-3027 Trivial resource leak patch

[nutch] branch master updated: NUTCH-2924 Generate maxCount expr evaluated only once

2022-12-12 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 7d3900450 NUTCH-2924 Generate maxCount expr

[nutch] branch master updated: NUTCH-2977

2022-12-07 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new d806aa450 NUTCH-2977 d806aa450 is described

[nutch] branch master updated: NUTCH-2794 Add additional ciphers to HTTP base's default cipher suite

2020-06-17 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 1c2e411 NUTCH-2794 Add additional ciphers

[nutch] branch master updated: NUTCH-2612 Support for sitemap processing by hostname

2019-09-09 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 9dbb4be NUTCH-2612 Support for sitemap

[nutch] branch master updated: NUTCH-2725 Plugin lib-http to support per-host configurable cookies

2019-07-29 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 54f73bf NUTCH-2725 Plugin lib-http to support

[nutch] branch master updated: NUTCH-2724 Metadata indexer not to emit empty values

2019-07-15 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new a67c9be NUTCH-2724 Metadata indexer not to emit

[nutch] branch master updated: NUTCH-2723 Indexer Solr not to decode URLs before deletion

2019-07-12 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 5150c44 NUTCH-2723 Indexer Solr not to decode

[nutch] branch master updated: NUTCH-2703 parse-tika: Boilerpipe should not run for non-(X)HTML pages

2019-04-11 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 7e6eabb NUTCH-2703 parse-tika: Boilerpipe

[nutch] branch master updated: NUTCH-2692 Removing previously accidentally added file

2019-02-22 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new f7fdca3 NUTCH-2692 Removing previously

[nutch] 02/03: NUTCH-2692 Subcollection to support case-insensitive white and black lists

2019-02-22 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git commit 3fa2f4a7efac598258eb01a4387b5fde43c1a813 Author: Markus Jelsma AuthorDate: Fri Feb 22 16:46:42 2019 +0100 NUTCH

[nutch] 01/03: NUTCH-2692 Subcollection to support case-insensitive white and black lists

2019-02-22 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git commit 89c41e1b5a245322b27e8dd0728b543faa171e9d Author: Markus Jelsma AuthorDate: Fri Feb 22 16:44:25 2019 +0100 NUTCH

[nutch] branch master updated (78af89f -> 0085ee7)

2019-02-22 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git. from 78af89f Merge pull request #436 from r0ann3l/NUTCH-2684 new 89c41e1 NUTCH-2692 Subcollection to support case

[nutch] 03/03: Merge branch 'master' of https://gitbox.apache.org/repos/asf/nutch

2019-02-22 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git commit 0085ee740e78b58091d1aa39614277f1a612810c Merge: 3fa2f4a 78af89f Author: Markus Jelsma AuthorDate: Fri Feb 22 16:48:45

[nutch] branch master updated: NUTCH-2694 HostDB to aggregate by long instead of integer

2019-02-22 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 33922fe NUTCH-2694 HostDB to aggregate by long

[nutch] branch master updated: NUTCH-2687 Regex for reading title from Content-Disposition is wrong

2019-01-18 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 9cc076f NUTCH-2687 Regex for reading title from

[nutch] branch master updated: NUTCH-2647 Skip TLS certificate checks in protocol-http plugin

2018-09-28 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 61d7e8c NUTCH-2647 Skip TLS certificate checks

[nutch] branch master updated: NUTCH-2411 Index-metadata to support indexing multiple values for a field

2018-03-08 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 9a77f43 NUTCH-2411 Index-metadata to support

[nutch] branch master updated: NUTCH-2458

2017-11-10 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new c345618 NUTCH-2458 new 705686e Merge

[nutch] branch master updated: NUTCH-2420 Bug in variable generate.max.count and fetcher.server.delay

2017-11-06 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 6199492 NUTCH-2420 Bug in variable

[nutch] branch master updated: NUTCH-2386 BasicURLNormalizer does not encode curly braces

2017-10-25 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new bd8c847 NUTCH-2386 BasicURLNormalizer does

[nutch] branch master updated: NUTCH-2445 Fetcher following outlinks to keep track of already fetched items

2017-10-23 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 0cdd095 NUTCH-2445 Fetcher following outlinks

[nutch] branch master updated: NUTCH-2444 HostDB CSV dumper to emit field header by default

2017-10-23 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new d7e4046 NUTCH-2444 HostDB CSV dumper to emit

[nutch] branch master updated: NUTCH-2367 Get single record from HostDB

2017-03-16 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new be3aea1 NUTCH-2367 Get single record from

[nutch] branch master updated: NUTCH-2366 Deprecated Job constructor in hostdb/ReadHostDb.java\

2017-03-15 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 3926910 NUTCH-2366 Deprecated Job

[nutch] branch master updated: remove test again

2017-03-15 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 6d47e14 remove test again 6d47e14

[nutch] branch master updated: test markus using git box

2017-03-15 Thread markus
This is an automated email from the ASF dual-hosted git repository. markus pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git The following commit(s) were added to refs/heads/master by this push: new 7143a4c test markus using git box 7143a4c

nutch git commit: NUTCH-2359 Parsefilter-regex raises IndexOutOfBoundsException when rules are ill-formed

2017-02-14 Thread markus
mit/9a9c4b32 Tree: http://git-wip-us.apache.org/repos/asf/nutch/tree/9a9c4b32 Diff: http://git-wip-us.apache.org/repos/asf/nutch/diff/9a9c4b32 Branch: refs/heads/master Commit: 9a9c4b32b9c1ab9c47583a217665e4694272d58a Parents: 76aedcb Author: Markus Jelsma <mar...@apache.org> Authored: Tue Feb 14

nutch git commit: revert 2320

2016-10-06 Thread markus
ttp://git-wip-us.apache.org/repos/asf/nutch/diff/d4c924e5 Branch: refs/heads/master Commit: d4c924e56030d6b1fa3b115686e80c8cf516db61 Parents: 836b2e0 Author: Markus Jelsma <mar...@apache.org> Authored: Thu Oct 6 10:56:50 2016 +0200 Committer: Markus Jelsma <mar...@apache.org> Committe

nutch git commit: NUTCH-2320 URLFilterChecker to run as TCP Telnet service

2016-10-05 Thread markus
wip-us.apache.org/repos/asf/nutch/tree/836b2e01 Diff: http://git-wip-us.apache.org/repos/asf/nutch/diff/836b2e01 Branch: refs/heads/master Commit: 836b2e01d1a4e0e9443601da755ea37de91b8c7d Parents: e53b34b Author: Markus Jelsma <mar...@apache.org> Authored: Wed Oct 5 14:53:05 2016 +0200 Committer: Mark

nutch git commit: NUTCH-2272 Index checker server to optionally keep client connection open

2016-06-03 Thread markus
ttp://git-wip-us.apache.org/repos/asf/nutch/tree/beb48a84 Diff: http://git-wip-us.apache.org/repos/asf/nutch/diff/beb48a84 Branch: refs/heads/master Commit: beb48a84b2be52f92af24956ae59286ad116913c Parents: 7956dae Author: Markus Jelsma <mar...@apache.org> Authored: Fri Jun 3 15:02:12 2

svn commit: r1732332 - /nutch/trunk/src/java/org/apache/nutch/util/JexlUtil.java

2016-02-25 Thread markus
Author: markus Date: Thu Feb 25 16:44:18 2016 New Revision: 1732332 URL: http://svn.apache.org/viewvc?rev=1732332=rev Log: NUTCH-2231 Jexl support in generator job Modified: nutch/trunk/src/java/org/apache/nutch/util/JexlUtil.java Modified: nutch/trunk/src/java/org/apache/nutch/util

svn commit: r1732177 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/crawl/CrawlDatum.java src/java/org/apache/nutch/crawl/CrawlDbReader.java src/java/org/apache/nutch/crawl/Generator.java sr

2016-02-24 Thread markus
Author: markus Date: Wed Feb 24 15:51:21 2016 New Revision: 1732177 URL: http://svn.apache.org/viewvc?rev=1732177=rev Log: NUTCH-2231 Jexl support in generator job Added: nutch/trunk/src/java/org/apache/nutch/util/JexlUtil.java Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java

svn commit: r1732160 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/crawl/DeduplicationJob.java

2016-02-24 Thread markus
Author: markus Date: Wed Feb 24 14:12:42 2016 New Revision: 1732160 URL: http://svn.apache.org/viewvc?rev=1732160=rev Log: NUTCH-2232 DeduplicationJob should decode URL's before length is compared Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/crawl

svn commit: r1732140 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/crawl/CrawlDatum.java src/java/org/apache/nutch/crawl/CrawlDbReader.java

2016-02-24 Thread markus
Author: markus Date: Wed Feb 24 13:05:02 2016 New Revision: 1732140 URL: http://svn.apache.org/viewvc?rev=1732140=rev Log: NUTCH-2229 Allow Jexl expressions on CrawlDatum's fixed attributes Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDatum.java

svn commit: r1731849 - in /nutch/trunk: ./ conf/ src/plugin/ src/plugin/parsefilter-regex/ src/plugin/parsefilter-regex/data/ src/plugin/parsefilter-regex/src/ src/plugin/parsefilter-regex/src/java/ s

2016-02-23 Thread markus
Author: markus Date: Tue Feb 23 12:58:54 2016 New Revision: 1731849 URL: http://svn.apache.org/viewvc?rev=1731849=rev Log: NUTCH-2227 RegexParseFilter Added: nutch/trunk/conf/regex-parsefilter.txt nutch/trunk/src/plugin/parsefilter-regex/ nutch/trunk/src/plugin/parsefilter-regex

svn commit: r1731836 - in /nutch/trunk: CHANGES.txt conf/nutch-default.xml src/java/org/apache/nutch/fetcher/FetcherThread.java src/java/org/apache/nutch/parse/ParseOutputFormat.java

2016-02-23 Thread markus
Author: markus Date: Tue Feb 23 10:38:31 2016 New Revision: 1731836 URL: http://svn.apache.org/viewvc?rev=1731836=rev Log: NUTCH-2221 Introduce db.ignore.internal.links to FetcherThread Modified: nutch/trunk/CHANGES.txt nutch/trunk/conf/nutch-default.xml nutch/trunk/src/java/org

svn commit: r1731831 - in /nutch/trunk: CHANGES.txt conf/nutch-default.xml src/java/org/apache/nutch/crawl/LinkDb.java src/java/org/apache/nutch/crawl/LinkDbMerger.java

2016-02-23 Thread markus
Author: markus Date: Tue Feb 23 10:23:24 2016 New Revision: 1731831 URL: http://svn.apache.org/viewvc?rev=1731831=rev Log: NUTCH-2220 Rename db.* options used only by the linkdb to linkdb.* Modified: nutch/trunk/CHANGES.txt nutch/trunk/conf/nutch-default.xml nutch/trunk/src/java/org

svn commit: r1731824 - in /nutch/trunk: CHANGES.txt src/plugin/index-replace/src/test/org/apache/nutch/indexer/replace/TestIndexReplace.java

2016-02-23 Thread markus
Author: markus Date: Tue Feb 23 09:50:05 2016 New Revision: 1731824 URL: http://svn.apache.org/viewvc?rev=1731824=rev Log: NUTCH-2228 Plugin index-replace unit test broken on Java 8 Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/plugin/index-replace/src/test/org/apache/nutch/indexer

svn commit: r1731651 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/crawl/DeduplicationJob.java

2016-02-22 Thread markus
Author: markus Date: Mon Feb 22 14:41:37 2016 New Revision: 1731651 URL: http://svn.apache.org/viewvc?rev=1731651=rev Log: NUTCH-2219 Criteria order to be configurable in DeduplicationJob Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/crawl/DeduplicationJob.java

svn commit: r1730803 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/fetcher/Fetcher.java

2016-02-17 Thread markus
Author: markus Date: Wed Feb 17 09:55:27 2016 New Revision: 1730803 URL: http://svn.apache.org/viewvc?rev=1730803=rev Log: NUTCH-2224 Average bytes/second calculated incorrectly in fetcher Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java

svn commit: r1730802 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/parse/ParseSegment.java

2016-02-17 Thread markus
Author: markus Date: Wed Feb 17 09:51:14 2016 New Revision: 1730802 URL: http://svn.apache.org/viewvc?rev=1730802=rev Log: NUTCH-2225 Parsed time calculated incorrectly Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java Modified: nutch/trunk

svn commit: r1730687 - in /nutch/trunk: ./ src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/

2016-02-16 Thread markus
Author: markus Date: Tue Feb 16 13:39:18 2016 New Revision: 1730687 URL: http://svn.apache.org/viewvc?rev=1730687=rev Log: NUTCH-1233 Rely on Tika for outlink extraction Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika

svn commit: r1728313 - in /nutch/trunk: ./ src/plugin/indexer-solr/ src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/

2016-02-03 Thread markus
Author: markus Date: Wed Feb 3 13:51:10 2016 New Revision: 1728313 URL: http://svn.apache.org/viewvc?rev=1728313=rev Log: NUTCH-2197 Add Solr 5 cloud indexer support Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/plugin/indexer-solr/ivy.xml nutch/trunk/src/plugin/indexer-solr

svn commit: r1725981 - in /nutch/trunk: ./ src/java/org/apache/nutch/scoring/webgraph/

2016-01-21 Thread markus
Author: markus Date: Thu Jan 21 15:18:07 2016 New Revision: 1725981 URL: http://svn.apache.org/viewvc?rev=1725981=rev Log: NUTCH-2201 Remove loops program from webgraph package Removed: nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/LoopReader.java nutch/trunk/src/java/org/apache

svn commit: r1725538 - in /nutch/trunk: CHANGES.txt src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java

2016-01-19 Thread markus
Author: markus Date: Tue Jan 19 14:53:05 2016 New Revision: 1725538 URL: http://svn.apache.org/viewvc?rev=1725538=rev Log: NUTCH-2203 Suffix URL filter can't handle trailing/leading whitespaces Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/plugin/urlfilter-suffix/src/java/org/apache

svn commit: r1724771 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java

2016-01-15 Thread markus
Author: markus Date: Fri Jan 15 10:45:27 2016 New Revision: 1724771 URL: http://svn.apache.org/viewvc?rev=1724771=rev Log: NUTCH-2194 Run IndexingFilterChecker as simple Telnet server Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/indexer

svn commit: r1724418 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java

2016-01-13 Thread markus
Author: markus Date: Wed Jan 13 13:10:19 2016 New Revision: 1724418 URL: http://svn.apache.org/viewvc?rev=1724418=rev Log: NUTCH-2196 IndexingFilterChecker to optionally normalize Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java

svn commit: r1724409 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java

2016-01-13 Thread markus
Author: markus Date: Wed Jan 13 12:17:03 2016 New Revision: 1724409 URL: http://svn.apache.org/viewvc?rev=1724409=rev Log: NUTCH-2195 IndexingFilterChecker to optionally follow N redirects Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/indexer

svn commit: r1724199 - /nutch/trunk/conf/protocols.txt

2016-01-12 Thread markus
Author: markus Date: Tue Jan 12 10:33:59 2016 New Revision: 1724199 URL: http://svn.apache.org/viewvc?rev=1724199=rev Log: NUTCH-2190 Protocol normalizer Added: nutch/trunk/conf/protocols.txt Added: nutch/trunk/conf/protocols.txt URL: http://svn.apache.org/viewvc/nutch/trunk/conf

svn commit: r1724085 - in /nutch/trunk: ./ src/plugin/ src/plugin/urlnormalizer-protocol/ src/plugin/urlnormalizer-protocol/data/ src/plugin/urlnormalizer-protocol/src/ src/plugin/urlnormalizer-protoc

2016-01-11 Thread markus
Author: markus Date: Mon Jan 11 17:10:30 2016 New Revision: 1724085 URL: http://svn.apache.org/viewvc?rev=1724085=rev Log: NUTCH-2190 Protocol normalizer Added: nutch/trunk/src/plugin/urlnormalizer-protocol/ nutch/trunk/src/plugin/urlnormalizer-protocol/build.xml nutch/trunk/src

svn commit: r1723688 - in /nutch/trunk: CHANGES.txt conf/nutch-default.xml src/java/org/apache/nutch/indexer/IndexerMapReduce.java

2016-01-08 Thread markus
Author: markus Date: Fri Jan 8 11:10:38 2016 New Revision: 1723688 URL: http://svn.apache.org/viewvc?rev=1723688=rev Log: NUTCH-1449 Optionally delete documents skipped by IndexingFilters Modified: nutch/trunk/CHANGES.txt nutch/trunk/conf/nutch-default.xml nutch/trunk/src/java/org

svn commit: r1723690 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/crawl/DeduplicationJob.java

2016-01-08 Thread markus
Author: markus Date: Fri Jan 8 11:14:33 2016 New Revision: 1723690 URL: http://svn.apache.org/viewvc?rev=1723690=rev Log: NUTCH-2178 DeduplicationJob to optionally group on host or domain Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/crawl/DeduplicationJob.java

svn commit: r1723710 - in /nutch/trunk: ./ src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/ src/plugin/urlfilter-automaton/src/java/org/apache/nutch/urlfilter/automaton/ src/plugin

2016-01-08 Thread markus
Author: markus Date: Fri Jan 8 12:11:18 2016 New Revision: 1723710 URL: http://svn.apache.org/viewvc?rev=1723710=rev Log: NUTCH-1838 Host and domain based regex and automaton filtering Added: nutch/trunk/src/plugin/urlfilter-regex/sample/nutch1838.rules nutch/trunk/src/plugin

svn commit: r1721615 - in /nutch/trunk: CHANGES.txt src/plugin/urlfilter-domain/src/java/org/apache/nutch/urlfilter/domain/DomainURLFilter.java src/plugin/urlfilter-domain/src/test/org/apache/nutch/ur

2015-12-24 Thread markus
Author: markus Date: Thu Dec 24 12:45:27 2015 New Revision: 1721615 URL: http://svn.apache.org/viewvc?rev=1721615=rev Log: NUTCH-2189 Domain filter must deactivate if no rules are present Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/plugin/urlfilter-domain/src/java/org/apache/nutch

svn commit: r1717622 - in /nutch/trunk: CHANGES.txt conf/log4j.properties

2015-12-02 Thread markus
Author: markus Date: Wed Dec 2 12:40:27 2015 New Revision: 1717622 URL: http://svn.apache.org/viewvc?rev=1717622=rev Log: NUTCH-2176 Clean up of log4j.properties Modified: nutch/trunk/CHANGES.txt nutch/trunk/conf/log4j.properties Modified: nutch/trunk/CHANGES.txt URL: http

svn commit: r1703111 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/indexer/IndexerMapReduce.java

2015-09-15 Thread markus
Author: markus Date: Tue Sep 15 06:51:48 2015 New Revision: 1703111 URL: http://svn.apache.org/r1703111 Log: NUTCH-2093 Indexing filters to use current signatures Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java Modified: nutch/trunk

svn commit: r1688566 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/segment/SegmentReader.java

2015-07-01 Thread markus
Author: markus Date: Wed Jul 1 07:00:40 2015 New Revision: 1688566 URL: http://svn.apache.org/r1688566 Log: NUTCH-1692 SegmentReader was broken in distributed mode Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java Modified: nutch/trunk

svn commit: r1688561 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/crawl/CrawlDbReducer.java

2015-07-01 Thread markus
Author: markus Date: Wed Jul 1 06:56:32 2015 New Revision: 1688561 URL: http://svn.apache.org/r1688561 Log: NUTCH-1684 ParseMeta to be added before fetch schedulers are run Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java Modified: nutch

svn commit: r1675058 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/segment/SegmentMerger.java

2015-04-21 Thread markus
Author: markus Date: Tue Apr 21 07:43:32 2015 New Revision: 1675058 URL: http://svn.apache.org/r1675058 Log: NUTCH-1697 SegmentMerger to implement Tool Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/segment/SegmentMerger.java Modified: nutch/trunk/CHANGES.txt

svn commit: r1666471 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/crawl/NutchWritable.java

2015-03-13 Thread markus
Author: markus Date: Fri Mar 13 14:58:05 2015 New Revision: 1666471 URL: http://svn.apache.org/r1666471 Log: NUTCH-1955 ByteWritable missing in NutchWritable Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/crawl/NutchWritable.java Modified: nutch/trunk

svn commit: r1663698 - in /nutch/trunk: ./ conf/ src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/ src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/ src/plugin/protocol-

2015-03-03 Thread markus
Author: markus Date: Tue Mar 3 13:16:39 2015 New Revision: 1663698 URL: http://svn.apache.org/r1663698 Log: NUTCH 1921 Optionally disable HTTP if-modified-since header Modified: nutch/trunk/CHANGES.txt nutch/trunk/conf/nutch-default.xml nutch/trunk/src/plugin/lib-http/src/java/org

svn commit: r1659532 - in /nutch/branches/2.x: CHANGES.txt ivy/ivy.xml src/plugin/parse-tika/ivy.xml src/plugin/parse-tika/plugin.xml

2015-02-13 Thread markus
Author: markus Date: Fri Feb 13 12:25:13 2015 New Revision: 1659532 URL: http://svn.apache.org/r1659532 Log: NUTCH-1925 Upgrade to Apache Tika 1.7 Modified: nutch/branches/2.x/CHANGES.txt nutch/branches/2.x/ivy/ivy.xml nutch/branches/2.x/src/plugin/parse-tika/ivy.xml nutch

svn commit: r1659533 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/crawl/LinkDbReader.java

2015-02-13 Thread markus
Author: markus Date: Fri Feb 13 12:28:13 2015 New Revision: 1659533 URL: http://svn.apache.org/r1659533 Log: NUTCH-1724 LinkDBReader to support regex output filtering Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/crawl/LinkDbReader.java Modified: nutch/trunk

svn commit: r1659169 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/crawl/LinkDb.java

2015-02-12 Thread markus
Author: markus Date: Thu Feb 12 08:42:49 2015 New Revision: 1659169 URL: http://svn.apache.org/r1659169 Log: NUTCH-1913 LinkDB to implement db.ignore.external.links Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/crawl/LinkDb.java Modified: nutch/trunk

svn commit: r1659167 - in /nutch/trunk: ./ src/plugin/ src/plugin/urlnormalizer-ajax/ src/plugin/urlnormalizer-ajax/src/ src/plugin/urlnormalizer-ajax/src/java/ src/plugin/urlnormalizer-ajax/src/java/

2015-02-12 Thread markus
Author: markus Date: Thu Feb 12 08:30:31 2015 New Revision: 1659167 URL: http://svn.apache.org/r1659167 Log: NUTCH-1323 AjaxNormalizer Added: nutch/trunk/src/plugin/urlnormalizer-ajax/ nutch/trunk/src/plugin/urlnormalizer-ajax/build.xml nutch/trunk/src/plugin/urlnormalizer-ajax

svn commit: r1607043 - /nutch/cms_site/trunk/templates/std.html

2014-07-01 Thread markus
Author: markus Date: Tue Jul 1 11:07:57 2014 New Revision: 1607043 URL: http://svn.apache.org/r1607043 Log: have at least a title on all pages Modified: nutch/cms_site/trunk/templates/std.html Modified: nutch/cms_site/trunk/templates/std.html URL: http://svn.apache.org/viewvc/nutch

svn commit: r914579 - /websites/production/nutch/content/

2014-07-01 Thread markus
Author: markus Date: Tue Jul 1 11:09:09 2014 New Revision: 914579 Log: Add title to pages. Added: websites/production/nutch/content/ - copied from r914578, websites/staging/nutch/trunk/content/

svn commit: r1606693 - /nutch/cms_site/trunk/content/index.md

2014-06-30 Thread markus
Author: markus Date: Mon Jun 30 11:44:03 2014 New Revision: 1606693 URL: http://svn.apache.org/r1606693 Log: page title missing Modified: nutch/cms_site/trunk/content/index.md Modified: nutch/cms_site/trunk/content/index.md URL: http://svn.apache.org/viewvc/nutch/cms_site/trunk/content

svn commit: r1606694 - /nutch/cms_site/trunk/content/index.md

2014-06-30 Thread markus
Author: markus Date: Mon Jun 30 11:46:31 2014 New Revision: 1606694 URL: http://svn.apache.org/r1606694 Log: Apparently the page header input box does not result in a title Modified: nutch/cms_site/trunk/content/index.md Modified: nutch/cms_site/trunk/content/index.md URL: http

svn commit: r1606695 - /nutch/cms_site/trunk/templates/std.html

2014-06-30 Thread markus
Author: markus Date: Mon Jun 30 11:50:44 2014 New Revision: 1606695 URL: http://svn.apache.org/r1606695 Log: added title Modified: nutch/cms_site/trunk/templates/std.html Modified: nutch/cms_site/trunk/templates/std.html URL: http://svn.apache.org/viewvc/nutch/cms_site/trunk/templates

svn commit: r1606696 - /nutch/cms_site/trunk/templates/std.html

2014-06-30 Thread markus
Author: markus Date: Mon Jun 30 11:52:25 2014 New Revision: 1606696 URL: http://svn.apache.org/r1606696 Log: actually put something in the title Modified: nutch/cms_site/trunk/templates/std.html Modified: nutch/cms_site/trunk/templates/std.html URL: http://svn.apache.org/viewvc/nutch

svn commit: r1606703 - /nutch/cms_site/trunk/content/index.md

2014-06-30 Thread markus
Author: markus Date: Mon Jun 30 12:01:53 2014 New Revision: 1606703 URL: http://svn.apache.org/r1606703 Log: CMS commit to nutch by markus Modified: nutch/cms_site/trunk/content/index.md Modified: nutch/cms_site/trunk/content/index.md URL: http://svn.apache.org/viewvc/nutch/cms_site/trunk

svn commit: r1606704 - /nutch/cms_site/trunk/content/index.md

2014-06-30 Thread markus
Author: markus Date: Mon Jun 30 12:03:56 2014 New Revision: 1606704 URL: http://svn.apache.org/r1606704 Log: will this work? Modified: nutch/cms_site/trunk/content/index.md Modified: nutch/cms_site/trunk/content/index.md URL: http://svn.apache.org/viewvc/nutch/cms_site/trunk/content

svn commit: r1606705 - /nutch/cms_site/trunk/content/index.md

2014-06-30 Thread markus
Author: markus Date: Mon Jun 30 12:04:50 2014 New Revision: 1606705 URL: http://svn.apache.org/r1606705 Log: restore stuff i broke Modified: nutch/cms_site/trunk/content/index.md Modified: nutch/cms_site/trunk/content/index.md URL: http://svn.apache.org/viewvc/nutch/cms_site/trunk/content

svn commit: r1600566 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/util/NodeWalker.java

2014-06-05 Thread markus
Author: markus Date: Thu Jun 5 08:34:01 2014 New Revision: 1600566 URL: http://svn.apache.org/r1600566 Log: NUTCH-1782 NodeWalker to return current node Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/util/NodeWalker.java Modified: nutch/trunk/CHANGES.txt URL

svn commit: r1562058 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/util/hostdb/HostDb.java

2014-01-28 Thread markus
Author: markus Date: Tue Jan 28 13:07:09 2014 New Revision: 1562058 URL: http://svn.apache.org/r1562058 Log: NUTCH-1717 HostDB not to complain if filters/normalizers are disabled Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/util/hostdb/HostDb.java Modified

svn commit: r1560985 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/plugin/Extension.java src/java/org/apache/nutch/plugin/PluginClassLoader.java src/java/org/apache/nutch/plugin/PluginRepos

2014-01-24 Thread markus
Author: markus Date: Fri Jan 24 13:12:00 2014 New Revision: 1560985 URL: http://svn.apache.org/r1560985 Log: NUTCH-356 Plugin repository cache can lead to memory leak Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/plugin/Extension.java nutch/trunk/src/java

svn commit: r1559657 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/crawl/CrawlDbReader.java

2014-01-20 Thread markus
Author: markus Date: Mon Jan 20 09:29:42 2014 New Revision: 1559657 URL: http://svn.apache.org/r1559657 Log: NUTCH-1680 CrawlDbReader to dump minRetry value Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReader.java Modified: nutch/trunk/CHANGES.txt

svn commit: r1556474 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/indexer/NutchDocument.java

2014-01-08 Thread markus
Author: markus Date: Wed Jan 8 09:39:47 2014 New Revision: 1556474 URL: http://svn.apache.org/r1556474 Log: NUTCH-1695 Add NutchDocument.toString() to ease debugging Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/indexer/NutchDocument.java Modified: nutch/trunk

svn commit: r1554791 - /nutch/trunk/conf/nutch-default.xml

2014-01-02 Thread markus
Author: markus Date: Thu Jan 2 11:53:36 2014 New Revision: 1554791 URL: http://svn.apache.org/r1554791 Log: NUTCH-1360 fix entity in configuration Modified: nutch/trunk/conf/nutch-default.xml Modified: nutch/trunk/conf/nutch-default.xml URL: http://svn.apache.org/viewvc/nutch/trunk/conf

svn commit: r1553115 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/util/URLUtil.java src/test/org/apache/nutch/util/TestURLUtil.java

2013-12-23 Thread markus
Author: markus Date: Mon Dec 23 14:17:40 2013 New Revision: 1553115 URL: http://svn.apache.org/r1553115 Log: NUTCH-1681 In URLUtil.java, toUNICODE method does not work correctly Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/util/URLUtil.java nutch/trunk/src

svn commit: r1528072 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/indexer/IndexerMapReduce.java

2013-10-01 Thread markus
Author: markus Date: Tue Oct 1 12:50:06 2013 New Revision: 1528072 URL: http://svn.apache.org/r1528072 Log: NUTCH-1646 IndexerMapReduce to consider DB status Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java Modified: nutch/trunk

svn commit: r1499948 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/segment/SegmentMerger.java

2013-07-05 Thread markus
Author: markus Date: Fri Jul 5 08:52:51 2013 New Revision: 1499948 URL: http://svn.apache.org/r1499948 Log: NUTCH-1520 SegmentMerger looses records Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/segment/SegmentMerger.java Modified: nutch/trunk/CHANGES.txt URL

svn commit: r1499959 - in /nutch/branches/2.x: CHANGES.txt ivy/ivy.xml src/plugin/parse-tika/howto_upgrade_tika.txt src/plugin/parse-tika/ivy.xml src/plugin/parse-tika/plugin.xml

2013-07-05 Thread markus
Author: markus Date: Fri Jul 5 10:27:47 2013 New Revision: 1499959 URL: http://svn.apache.org/r1499959 Log: NUTCH-1595 Upgrade to Tika 1.4 (jnioche, markus) Added: nutch/branches/2.x/src/plugin/parse-tika/howto_upgrade_tika.txt Modified: nutch/branches/2.x/CHANGES.txt nutch/branches

svn commit: r1499960 - in /nutch/trunk: CHANGES.txt ivy/ivy.xml src/plugin/parse-tika/howto_upgrade_tika.txt src/plugin/parse-tika/ivy.xml src/plugin/parse-tika/plugin.xml

2013-07-05 Thread markus
Author: markus Date: Fri Jul 5 10:28:46 2013 New Revision: 1499960 URL: http://svn.apache.org/r1499960 Log: NUTCH-1595 Upgrade to Tika 1.4 Added: nutch/trunk/src/plugin/parse-tika/howto_upgrade_tika.txt Modified: nutch/trunk/CHANGES.txt nutch/trunk/ivy/ivy.xml nutch/trunk/src

svn commit: r1499684 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/crawl/Injector.java

2013-07-04 Thread markus
Author: markus Date: Thu Jul 4 08:50:25 2013 New Revision: 1499684 URL: http://svn.apache.org/r1499684 Log: NUTCH-1600 Injector overwrite does not always work properly Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/crawl/Injector.java Modified: nutch/trunk

svn commit: r1499696 - in /nutch/trunk: CHANGES.txt src/plugin/headings/src/java/org/apache/nutch/parse/headings/HeadingsParseFilter.java

2013-07-04 Thread markus
Author: markus Date: Thu Jul 4 09:07:12 2013 New Revision: 1499696 URL: http://svn.apache.org/r1499696 Log: NUTCH-1597 HeadingsParseFilter to trim and remove exess whitespace Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/plugin/headings/src/java/org/apache/nutch/parse/headings

svn commit: r1499722 - in /nutch/trunk: CHANGES.txt src/plugin/headings/src/java/org/apache/nutch/parse/headings/HeadingsParseFilter.java

2013-07-04 Thread markus
Author: markus Date: Thu Jul 4 11:13:34 2013 New Revision: 1499722 URL: http://svn.apache.org/r1499722 Log: NUTCH-1596 HeadingsParseFilter not thread safe Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/plugin/headings/src/java/org/apache/nutch/parse/headings/HeadingsParseFilter.java

svn commit: r1498830 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/crawl/CrawlDbReader.java

2013-07-02 Thread markus
Author: markus Date: Tue Jul 2 08:36:13 2013 New Revision: 1498830 URL: http://svn.apache.org/r1498830 Log: NUTCH-1327 QueryStringNormalizer Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReader.java Modified: nutch/trunk/CHANGES.txt URL: http

svn commit: r1498832 - in /nutch/trunk: ./ src/plugin/ src/plugin/urlnormalizer-querystring/ src/plugin/urlnormalizer-querystring/src/ src/plugin/urlnormalizer-querystring/src/java/ src/plugin/urlnorm

2013-07-02 Thread markus
Author: markus Date: Tue Jul 2 08:37:40 2013 New Revision: 1498832 URL: http://svn.apache.org/r1498832 Log: NUTCH-1581 CrawlDB csv output to include metadata Added: nutch/trunk/src/plugin/urlnormalizer-querystring/ nutch/trunk/src/plugin/urlnormalizer-querystring/build.xml nutch

svn commit: r1498346 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/segment/SegmentMerger.java

2013-07-01 Thread markus
Author: markus Date: Mon Jul 1 10:03:12 2013 New Revision: 1498346 URL: http://svn.apache.org/r1498346 Log: NUTCH-1593 Normalize option missing in SegmentMerger's usage Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/segment/SegmentMerger.java Modified: nutch

svn commit: r1496023 - in /nutch/branches/2.x: ./ src/plugin/ src/plugin/urlfilter-prefix/src/test/ src/plugin/urlfilter-prefix/src/test/org/ src/plugin/urlfilter-prefix/src/test/org/apache/ src/plugi

2013-06-24 Thread markus
Author: markus Date: Mon Jun 24 13:12:59 2013 New Revision: 1496023 URL: http://svn.apache.org/r1496023 Log: NUTCH-1126 JUnit test for urlfilter-prefix Added: nutch/branches/2.x/src/plugin/urlfilter-prefix/src/test/ nutch/branches/2.x/src/plugin/urlfilter-prefix/src/test/org/ nutch

  1   2   >