svn commit: r1387357 - in /nutch/trunk: CHANGES.txt build.xml

2012-09-18 Thread snagel
Author: snagel Date: Tue Sep 18 20:54:05 2012 New Revision: 1387357 URL: http://svn.apache.org/viewvc?rev=1387357view=rev Log: NUTCH-1415 release packages to contain top level folder apache-nutch-x.x Modified: nutch/trunk/CHANGES.txt nutch/trunk/build.xml Modified: nutch/trunk

svn commit: r1396796 - in /nutch/trunk: CHANGES.txt conf/regex-normalize.xml.template src/plugin/urlnormalizer-regex/sample/regex-normalize-default.test src/plugin/urlnormalizer-regex/sample/regex-nor

2012-10-10 Thread snagel
Author: snagel Date: Wed Oct 10 21:06:27 2012 New Revision: 1396796 URL: http://svn.apache.org/viewvc?rev=1396796view=rev Log: NUTCH-706 Url regex normalizer: pattern for session id removal not to match newsId Modified: nutch/trunk/CHANGES.txt nutch/trunk/conf/regex

svn commit: r1396817 - in /nutch/trunk: conf/regex-normalize.xml.template src/plugin/urlnormalizer-regex/sample/regex-normalize-default.test src/plugin/urlnormalizer-regex/sample/regex-normalize-defau

2012-10-10 Thread snagel
Author: snagel Date: Wed Oct 10 21:54:37 2012 New Revision: 1396817 URL: http://svn.apache.org/viewvc?rev=1396817view=rev Log: NUTCH-706 (applied correct patch) Modified: nutch/trunk/conf/regex-normalize.xml.template nutch/trunk/src/plugin/urlnormalizer-regex/sample/regex-normalize

svn commit: r1401458 - /nutch/branches/2.x/CHANGES.txt

2012-10-23 Thread snagel
Author: snagel Date: Tue Oct 23 20:47:16 2012 New Revision: 1401458 URL: http://svn.apache.org/viewvc?rev=1401458view=rev Log: NUTCH-1344 BasicURLNormalizer to normalize https same as http - forgot to add committer Modified: nutch/branches/2.x/CHANGES.txt Modified: nutch/branches/2.x

svn commit: r1401459 - in /nutch/trunk: CHANGES.txt src/plugin/urlnormalizer-regex/src/java/org/apache/nutch/net/urlnormalizer/regex/RegexURLNormalizer.java

2012-10-23 Thread snagel
Author: snagel Date: Tue Oct 23 20:51:35 2012 New Revision: 1401459 URL: http://svn.apache.org/viewvc?rev=1401459view=rev Log: NUTCH-1421 RegexURLNormalizer to only skip rules with invalid patterns Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/plugin/urlnormalizer-regex/src/java/org

svn commit: r1461854 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java src/java/org/apache/nutch/parse/ParserChecker.java

2013-03-27 Thread snagel
Author: snagel Date: Wed Mar 27 21:31:42 2013 New Revision: 1461854 URL: http://svn.apache.org/r1461854 Log: parsechecker and indexchecker to report truncated content Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java nutch

svn commit: r1461857 - in /nutch/branches/2.x: CHANGES.txt src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java src/java/org/apache/nutch/parse/ParserChecker.java

2013-03-27 Thread snagel
Author: snagel Date: Wed Mar 27 21:33:38 2013 New Revision: 1461857 URL: http://svn.apache.org/r1461857 Log: parsechecker and indexchecker to report truncated content Modified: nutch/branches/2.x/CHANGES.txt nutch/branches/2.x/src/java/org/apache/nutch/indexer

svn commit: r1480484 - in /nutch/branches/2.x: CHANGES.txt conf/schema-solr4.xml conf/schema.xml src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java

2013-05-08 Thread snagel
Author: snagel Date: Wed May 8 22:04:04 2013 New Revision: 1480484 URL: http://svn.apache.org/r1480484 Log: NUTCH-956 solrindex issues Modified: nutch/branches/2.x/CHANGES.txt nutch/branches/2.x/conf/schema-solr4.xml nutch/branches/2.x/conf/schema.xml nutch/branches/2.x/src

svn commit: r1480485 - in /nutch/trunk: CHANGES.txt conf/schema-solr4.xml conf/schema.xml src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java

2013-05-08 Thread snagel
Author: snagel Date: Wed May 8 22:04:53 2013 New Revision: 1480485 URL: http://svn.apache.org/r1480485 Log: NUTCH-956 solrindex issues Modified: nutch/trunk/CHANGES.txt nutch/trunk/conf/schema-solr4.xml nutch/trunk/conf/schema.xml nutch/trunk/src/plugin/index-more/src/java/org

svn commit: r1494776 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/crawl/AbstractFetchSchedule.java

2013-06-19 Thread snagel
Author: snagel Date: Wed Jun 19 21:26:07 2013 New Revision: 1494776 URL: http://svn.apache.org/r1494776 Log: NUTCH-1245 URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java

svn commit: r1494785 - /nutch/trunk/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java

2013-06-19 Thread snagel
Author: snagel Date: Wed Jun 19 22:22:00 2013 New Revision: 1494785 URL: http://svn.apache.org/r1494785 Log: NUTCH-1475 (fix after fix) fill field date with fetch time (as before) if modified time is unset Modified: nutch/trunk/src/plugin/index-more/src/java/org/apache/nutch/indexer/more

svn commit: r1497557 - in /nutch/trunk: ./ conf/ src/plugin/index-static/src/java/org/apache/nutch/indexer/staticfield/ src/plugin/index-static/src/test/org/apache/nutch/indexer/staticfield/

2013-06-27 Thread snagel
Author: snagel Date: Thu Jun 27 20:16:22 2013 New Revision: 1497557 URL: http://svn.apache.org/r1497557 Log: NUTCH-1580 index-static returns object instead of value for index.static Modified: nutch/trunk/CHANGES.txt nutch/trunk/conf/nutch-default.xml nutch/trunk/src/plugin/index

svn commit: r1507130 - in /nutch/trunk: CHANGES.txt conf/log4j.properties

2013-07-25 Thread snagel
Author: snagel Date: Thu Jul 25 21:14:45 2013 New Revision: 1507130 URL: http://svn.apache.org/r1507130 Log: NUTCH-1587 misspelled property threshold in conf/log4j.properties Modified: nutch/trunk/CHANGES.txt nutch/trunk/conf/log4j.properties Modified: nutch/trunk/CHANGES.txt URL: http

svn commit: r1507131 - in /nutch/branches/2.x: CHANGES.txt conf/log4j.properties

2013-07-25 Thread snagel
Author: snagel Date: Thu Jul 25 21:15:02 2013 New Revision: 1507131 URL: http://svn.apache.org/r1507131 Log: NUTCH-1587 misspelled property threshold in conf/log4j.properties Modified: nutch/branches/2.x/CHANGES.txt nutch/branches/2.x/conf/log4j.properties Modified: nutch/branches/2.x

svn commit: r1511479 - in /nutch/trunk: CHANGES.txt src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/File.java

2013-08-07 Thread snagel
Author: snagel Date: Wed Aug 7 20:44:01 2013 New Revision: 1511479 URL: http://svn.apache.org/r1511479 Log: NUTCH-911 protocol-file to return proper protocol status for notmodified, gone, access_denied Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/plugin/protocol-file/src/java/org

svn commit: r1511496 - in /nutch/branches/2.x: CHANGES.txt src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/File.java

2013-08-07 Thread snagel
Author: snagel Date: Wed Aug 7 21:10:17 2013 New Revision: 1511496 URL: http://svn.apache.org/r1511496 Log: NUTCH-911 protocol-file to return proper protocol status for notmodified, gone, access_denied Modified: nutch/branches/2.x/CHANGES.txt nutch/branches/2.x/src/plugin/protocol

svn commit: r1544341 - /nutch/branches/2.x/src/test/log4j.properties

2013-11-21 Thread snagel
Author: snagel Date: Thu Nov 21 22:04:13 2013 New Revision: 1544341 URL: http://svn.apache.org/r1544341 Log: NUTCH-1587 misspelled property threshold in log4j.properties Modified: nutch/branches/2.x/src/test/log4j.properties Modified: nutch/branches/2.x/src/test/log4j.properties URL: http

svn commit: r1544340 - /nutch/trunk/src/test/log4j.properties

2013-11-21 Thread snagel
Author: snagel Date: Thu Nov 21 22:03:18 2013 New Revision: 1544340 URL: http://svn.apache.org/r1544340 Log: NUTCH-1587 misspelled property threshold in log4j.properties Modified: nutch/trunk/src/test/log4j.properties Modified: nutch/trunk/src/test/log4j.properties URL: http

svn commit: r1560512 - in /nutch/trunk: CHANGES.txt conf/nutch-default.xml src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java

2014-01-22 Thread snagel
Author: snagel Date: Wed Jan 22 21:13:01 2014 New Revision: 1560512 URL: http://svn.apache.org/r1560512 Log: NUTCH-1413 Record response time Modified: nutch/trunk/CHANGES.txt nutch/trunk/conf/nutch-default.xml nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http

svn commit: r1575350 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/crawl/NutchWritable.java

2014-03-07 Thread snagel
Author: snagel Date: Fri Mar 7 18:13:20 2014 New Revision: 1575350 URL: http://svn.apache.org/r1575350 Log: removed HostDB from Nutch 1.8 trunk: fix build, remove HostDb related entries from change log Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/crawl

svn commit: r1575351 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/indexer/IndexerMapReduce.java

2014-03-07 Thread snagel
Author: snagel Date: Fri Mar 7 18:15:50 2014 New Revision: 1575351 URL: http://svn.apache.org/r1575351 Log: NUTCH-1706 IndexerMapReduce does not remove db_redir_temp Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java Modified: nutch

svn commit: r1578620 - in /nutch/branches/2.x: CHANGES.txt src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java

2014-03-17 Thread snagel
Author: snagel Date: Mon Mar 17 21:56:32 2014 New Revision: 1578620 URL: http://svn.apache.org/r1578620 Log: NUTCH-1671 indexchecker to add digest field Modified: nutch/branches/2.x/CHANGES.txt nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java Modified

svn commit: r1580046 - in /nutch/trunk: CHANGES.txt src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestHtmlParser

2014-03-21 Thread snagel
Author: snagel Date: Fri Mar 21 20:56:13 2014 New Revision: 1580046 URL: http://svn.apache.org/r1580046 Log: NUTCH-1733 parse-html to support HTML5 charset definitions Added: nutch/trunk/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestHtmlParser.java (with props) Modified

svn commit: r1580270 - in /nutch/site: forrest/src/documentation/content/xdocs/downloads.xml publish/downloads.html

2014-03-22 Thread snagel
Author: snagel Date: Sat Mar 22 18:04:10 2014 New Revision: 1580270 URL: http://svn.apache.org/r1580270 Log: NUTCH-1742 update remaining references of 1.7 - 1.8 Modified: nutch/site/forrest/src/documentation/content/xdocs/downloads.xml nutch/site/publish/downloads.html Modified: nutch

svn commit: r4777 - /release/nutch/1.7/

2014-03-22 Thread snagel
Author: snagel Date: Sat Mar 22 18:13:52 2014 New Revision: 4777 Log: NUTCH-1742 removed 1.7 packages from svn (svnpubsub) Removed: release/nutch/1.7/

svn commit: r1583193 - in /nutch/trunk: CHANGES.txt src/test/org/apache/nutch/crawl/TestAdaptiveFetchSchedule.java

2014-03-30 Thread snagel
Author: snagel Date: Sun Mar 30 19:58:59 2014 New Revision: 1583193 URL: http://svn.apache.org/r1583193 Log: NUTCH-1645 Junit Test Case for Adaptive Fetch Schedule class Added: nutch/trunk/src/test/org/apache/nutch/crawl/TestAdaptiveFetchSchedule.java (with props) Modified: nutch

svn commit: r1585144 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/fetcher/Fetcher.java

2014-04-05 Thread snagel
Author: snagel Date: Sat Apr 5 17:06:04 2014 New Revision: 1585144 URL: http://svn.apache.org/r1585144 Log: NUTCH-1735 code dedup fetcher queue redirects Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java Modified: nutch/trunk/CHANGES.txt URL

svn commit: r1590315 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/crawl/CrawlDbReader.java

2014-04-26 Thread snagel
Author: snagel Date: Sat Apr 26 22:12:46 2014 New Revision: 1590315 URL: http://svn.apache.org/r1590315 Log: NUTCH-1764 readdb to show command-line help if no action (-stats, -dump, etc.) given Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/crawl

svn commit: r1592414 - in /nutch/branches/2.x: CHANGES.txt src/java/org/apache/nutch/fetcher/FetcherReducer.java

2014-05-04 Thread snagel
Author: snagel Date: Sun May 4 20:18:50 2014 New Revision: 1592414 URL: http://svn.apache.org/r1592414 Log: NUTCH-1182 fetcher to log hung threads Modified: nutch/branches/2.x/CHANGES.txt nutch/branches/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java Modified: nutch/branches

svn commit: r1594071 - in /nutch: branches/2.x/ branches/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/ trunk/ trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/a

2014-05-12 Thread snagel
Author: snagel Date: Mon May 12 19:39:43 2014 New Revision: 1594071 URL: http://svn.apache.org/r1594071 Log: NUTCH-1752 Cache robots.txt rules per protocol:host:port Modified: nutch/branches/2.x/CHANGES.txt nutch/branches/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http

svn commit: r1593595 - in /nutch/site: forrest/src/documentation/content/xdocs/index.xml publish/index.html

2014-05-15 Thread snagel
Author: snagel Date: Fri May 9 18:48:29 2014 New Revision: 1593595 URL: http://svn.apache.org/r1593595 Log: Nutch 1.8 includes Tika 1.5 Modified: nutch/site/forrest/src/documentation/content/xdocs/index.xml nutch/site/publish/index.html Modified: nutch/site/forrest/src/documentation

svn commit: r1604291 - in /nutch: branches/2.x/ branches/2.x/conf/ branches/2.x/src/java/org/apache/nutch/fetcher/ branches/2.x/src/java/org/apache/nutch/protocol/ trunk/ trunk/conf/ trunk/src/java/or

2014-06-20 Thread snagel
Author: snagel Date: Fri Jun 20 22:15:43 2014 New Revision: 1604291 URL: http://svn.apache.org/r1604291 Log: NUTCH-1718 redefine http.robots.agent as additional agent names Modified: nutch/branches/2.x/CHANGES.txt nutch/branches/2.x/conf/nutch-default.xml nutch/branches/2.x/src/java

svn commit: r1604298 - in /nutch: branches/2.x/ branches/2.x/src/java/org/apache/nutch/util/ branches/2.x/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/ branches/2.x/src/plugin/parse-html

2014-06-20 Thread snagel
Author: snagel Date: Fri Jun 20 22:56:32 2014 New Revision: 1604298 URL: http://svn.apache.org/r1604298 Log: NUTCH-1767 remove special treatment of params in relative links Modified: nutch/branches/2.x/CHANGES.txt nutch/branches/2.x/src/java/org/apache/nutch/util/URLUtil.java nutch

svn commit: r1605204 [3/3] - in /nutch: branches/2.x/ branches/2.x/src/java/org/apache/nutch/api/ branches/2.x/src/java/org/apache/nutch/api/impl/ branches/2.x/src/java/org/apache/nutch/crawl/ branche

2014-06-24 Thread snagel
Modified: nutch/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java URL:

svn commit: r1607929 - /nutch/trunk/build.xml

2014-07-04 Thread snagel
Author: snagel Date: Fri Jul 4 20:15:12 2014 New Revision: 1607929 URL: http://svn.apache.org/r1607929 Log: add dependency init (calling ivy-init) to compile-core-test to fix nightly build failures introduced with NUTCH-1803 Modified: nutch/trunk/build.xml Modified: nutch/trunk/build.xml

svn commit: r1608130 - in /nutch: branches/2.x/ branches/2.x/src/java/org/apache/nutch/util/ branches/2.x/src/test/org/apache/nutch/util/ branches/2.x/src/testresources/test-mime-util/ trunk/ trunk/sr

2014-07-05 Thread snagel
Author: snagel Date: Sat Jul 5 20:36:33 2014 New Revision: 1608130 URL: http://svn.apache.org/r1608130 Log: NUTCH-1605 MIME type detector recognizes xlsx as zip file Added: nutch/branches/2.x/src/test/org/apache/nutch/util/TestMimeUtil.java (with props) nutch/branches/2.x/src

svn commit: r1608135 - in /nutch: branches/2.x/CHANGES.txt branches/2.x/src/bin/crawl branches/2.x/src/bin/nutch trunk/CHANGES.txt trunk/src/bin/crawl trunk/src/bin/nutch

2014-07-05 Thread snagel
Author: snagel Date: Sat Jul 5 21:13:19 2014 New Revision: 1608135 URL: http://svn.apache.org/r1608135 Log: NUTCH-1566 bin/nutch to allow whitespace in paths Modified: nutch/branches/2.x/CHANGES.txt nutch/branches/2.x/src/bin/crawl nutch/branches/2.x/src/bin/nutch nutch/trunk

svn commit: r1608136 - in /nutch: branches/2.x/ branches/2.x/src/java/org/apache/nutch/plugin/ trunk/ trunk/src/java/org/apache/nutch/plugin/

2014-07-05 Thread snagel
Author: snagel Date: Sat Jul 5 21:42:20 2014 New Revision: 1608136 URL: http://svn.apache.org/r1608136 Log: NUTCH-1776 Log incorrect plugin.folder file path Modified: nutch/branches/2.x/CHANGES.txt nutch/branches/2.x/src/java/org/apache/nutch/plugin/PluginManifestParser.java nutch

svn commit: r1609568 - in /nutch: branches/2.x/CHANGES.txt branches/2.x/src/bin/nutch trunk/CHANGES.txt trunk/src/bin/nutch

2014-07-10 Thread snagel
Author: snagel Date: Thu Jul 10 20:50:27 2014 New Revision: 1609568 URL: http://svn.apache.org/r1609568 Log: NUTCH-1811 bin/nutch junit to use junit 4 test runner Modified: nutch/branches/2.x/CHANGES.txt nutch/branches/2.x/src/bin/nutch nutch/trunk/CHANGES.txt nutch/trunk/src/bin

svn commit: r1614375 - in /nutch: branches/2.x/ branches/2.x/conf/ branches/2.x/src/java/org/apache/nutch/indexer/ branches/2.x/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic

2014-07-29 Thread snagel
Author: snagel Date: Tue Jul 29 15:13:20 2014 New Revision: 1614375 URL: http://svn.apache.org/r1614375 Log: NUTCH-1708 use same id when indexing and deleting redirects Modified: nutch/branches/2.x/CHANGES.txt nutch/branches/2.x/conf/schema.xml nutch/branches/2.x/src/java/org/apache

svn commit: r1618521 - /nutch/cms_site/trunk/content/index.md

2014-08-17 Thread snagel
Author: snagel Date: Sun Aug 17 20:24:29 2014 New Revision: 1618521 URL: http://svn.apache.org/r1618521 Log: CMS commit to nutch by snagel Modified: nutch/cms_site/trunk/content/index.md Modified: nutch/cms_site/trunk/content/index.md URL: http://svn.apache.org/viewvc/nutch/cms_site/trunk

svn commit: r919651 - /websites/production/nutch/content/

2014-08-17 Thread snagel
Author: snagel Date: Sun Aug 17 20:26:24 2014 New Revision: 919651 Log: announce tutorial at ApacheCon Europe in Budapest Added: websites/production/nutch/content/ - copied from r919650, websites/staging/nutch/trunk/content/

svn commit: r1619934 - in /nutch: branches/2.x/ branches/2.x/src/java/org/apache/nutch/crawl/ trunk/ trunk/src/java/org/apache/nutch/crawl/

2014-08-22 Thread snagel
Author: snagel Date: Fri Aug 22 21:23:32 2014 New Revision: 1619934 URL: http://svn.apache.org/r1619934 Log: NUTCH-1409 remove deprecated properties db.{default,max}.fetch.interval, generate.max.per.host.by.ip Modified: nutch/branches/2.x/CHANGES.txt nutch/branches/2.x/src/java/org

svn commit: r1619942 - in /nutch: branches/2.x/ branches/2.x/src/java/org/apache/nutch/crawl/ branches/2.x/src/java/org/apache/nutch/parse/ trunk/ trunk/src/java/org/apache/nutch/crawl/

2014-08-22 Thread snagel
Author: snagel Date: Fri Aug 22 22:23:27 2014 New Revision: 1619942 URL: http://svn.apache.org/r1619942 Log: NUTCH-1693 TextMD5Signature computed on textual content Added: nutch/branches/2.x/src/java/org/apache/nutch/crawl/TextMD5Signature.java (with props) nutch/trunk/src/java/org

svn commit: r1619944 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/indexer/IndexingFilter.java

2014-08-22 Thread snagel
Author: snagel Date: Fri Aug 22 22:28:12 2014 New Revision: 1619944 URL: http://svn.apache.org/r1619944 Log: NUTCH-1775 IndexingFilter: document origin of passed CrawlDatum Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFilter.java Modified

svn commit: r1625821 - in /nutch/cms_site/trunk/content/apidocs/apidocs-1.9: ./ org/ org/apache/ org/apache/nutch/ org/apache/nutch/analysis/ org/apache/nutch/analysis/lang/ org/apache/nutch/analysis/

2014-09-17 Thread snagel
Author: snagel Date: Wed Sep 17 20:52:17 2014 New Revision: 1625821 URL: http://svn.apache.org/r1625821 Log: add 1.9 Java apidocs [This commit notification would consist of 137 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.]

svn commit: r1625826 - /nutch/cms_site/trunk/content/javadoc.md

2014-09-17 Thread snagel
Author: snagel Date: Wed Sep 17 21:07:29 2014 New Revision: 1625826 URL: http://svn.apache.org/r1625826 Log: add apidoc 1.9 Modified: nutch/cms_site/trunk/content/javadoc.md Modified: nutch/cms_site/trunk/content/javadoc.md URL: http://svn.apache.org/viewvc/nutch/cms_site/trunk/content

svn commit: r922601 - /websites/production/nutch/content/

2014-09-17 Thread snagel
Author: snagel Date: Wed Sep 17 21:08:05 2014 New Revision: 922601 Log: add Java apidoc 1.9 Added: websites/production/nutch/content/ - copied from r922599, websites/staging/nutch/trunk/content/

svn commit: r922608 - /websites/production/nutch/content/

2014-09-17 Thread snagel
Author: snagel Date: Wed Sep 17 21:32:43 2014 New Revision: 922608 Log: update Java apidoc 1.9 Added: websites/production/nutch/content/ - copied from r922607, websites/staging/nutch/trunk/content/

svn commit: r1626581 - in /nutch: branches/2.x/KEYS branches/2.x/ivy/mvn.template trunk/KEYS trunk/ivy/mvn.template

2014-09-21 Thread snagel
Author: snagel Date: Sun Sep 21 14:18:26 2014 New Revision: 1626581 URL: http://svn.apache.org/r1626581 Log: add committer snagel Modified: nutch/branches/2.x/KEYS nutch/branches/2.x/ivy/mvn.template nutch/trunk/KEYS nutch/trunk/ivy/mvn.template Modified: nutch/branches/2.x/KEYS

svn commit: r1629076 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java

2014-10-02 Thread snagel
Author: snagel Date: Thu Oct 2 21:37:04 2014 New Revision: 1629076 URL: http://svn.apache.org/r1629076 Log: NUTCH-1826 indexchecker fails if solr.server.url not configured Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java

svn commit: r1630565 - in /nutch/trunk: ./ src/plugin/ src/plugin/protocol-http/ src/plugin/protocol-http/jsp/ src/plugin/protocol-http/src/test/conf/ src/plugin/protocol-http/src/test/org/apache/nutc

2014-10-09 Thread snagel
Author: snagel Date: Thu Oct 9 19:20:51 2014 New Revision: 1630565 URL: http://svn.apache.org/r1630565 Log: NUTCH-1164 JUnit tests for protocol-http Added: nutch/trunk/src/plugin/protocol-http/jsp/ nutch/trunk/src/plugin/protocol-http/jsp/basic-http.jsp (with props) nutch/trunk

svn commit: r1633222 - in /nutch/branches/2.x: ./ conf/ src/java/org/apache/nutch/parse/ src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/ src/plugin/parse-html/src/java/org/apache

2014-10-20 Thread snagel
Author: snagel Date: Mon Oct 20 20:44:00 2014 New Revision: 1633222 URL: http://svn.apache.org/r1633222 Log: NUTCH-1827 Port issues 1467 and 1561 to 2.x Modified: nutch/branches/2.x/CHANGES.txt nutch/branches/2.x/conf/nutch-default.xml nutch/branches/2.x/src/java/org/apache/nutch

svn commit: r1633426 - in /nutch: branches/2.x/CHANGES.txt branches/2.x/build.xml trunk/CHANGES.txt trunk/build.xml

2014-10-21 Thread snagel
Author: snagel Date: Tue Oct 21 17:52:27 2014 New Revision: 1633426 URL: http://svn.apache.org/r1633426 Log: NUTCH-1882 ant eclipse target to add output path to src/test Modified: nutch/branches/2.x/CHANGES.txt nutch/branches/2.x/build.xml nutch/trunk/CHANGES.txt nutch/trunk

svn commit: r1634694 - in /nutch: branches/2.x/CHANGES.txt branches/2.x/src/bin/crawl trunk/CHANGES.txt trunk/src/bin/crawl

2014-10-27 Thread snagel
Author: snagel Date: Mon Oct 27 21:38:50 2014 New Revision: 1634694 URL: http://svn.apache.org/r1634694 Log: NUTCH-1883 bin/crawl: use function to run bin/nutch and check exit value Modified: nutch/branches/2.x/CHANGES.txt nutch/branches/2.x/src/bin/crawl nutch/trunk/CHANGES.txt

svn commit: r1638203 - in /nutch: branches/2.x/src/bin/crawl trunk/src/bin/crawl

2014-11-11 Thread snagel
Author: snagel Date: Tue Nov 11 16:20:01 2014 New Revision: 1638203 URL: http://svn.apache.org/r1638203 Log: NUTCH-1883 in case of generate: break loop and do not exit with error Modified: nutch/branches/2.x/src/bin/crawl nutch/trunk/src/bin/crawl Modified: nutch/branches/2.x/src/bin

svn commit: r1643412 - in /nutch: branches/2.x/CHANGES.txt branches/2.x/conf/suffix-urlfilter.txt.template trunk/CHANGES.txt trunk/conf/suffix-urlfilter.txt.template

2014-12-05 Thread snagel
Author: snagel Date: Fri Dec 5 19:53:35 2014 New Revision: 1643412 URL: http://svn.apache.org/r1643412 Log: NUTCH-1877 Suffix URL filter to ignore query string by default Modified: nutch/branches/2.x/CHANGES.txt nutch/branches/2.x/conf/suffix-urlfilter.txt.template nutch/trunk

svn commit: r1655169 - in /nutch/branches/2.x: CHANGES.txt src/plugin/parse-tika/ivy.xml src/plugin/parse-tika/plugin.xml

2015-01-27 Thread snagel
Author: snagel Date: Tue Jan 27 21:45:39 2015 New Revision: 1655169 URL: http://svn.apache.org/r1655169 Log: NUTCH-1893 Parse-tika failes to parse feed files Modified: nutch/branches/2.x/CHANGES.txt nutch/branches/2.x/src/plugin/parse-tika/ivy.xml nutch/branches/2.x/src/plugin/parse

svn commit: r1651193 - in /nutch/trunk: CHANGES.txt build.xml

2015-01-12 Thread snagel
Author: snagel Date: Mon Jan 12 20:45:16 2015 New Revision: 1651193 URL: http://svn.apache.org/r1651193 Log: NUTCH-1881 ant target resolve-default to keep test libs Modified: nutch/trunk/CHANGES.txt nutch/trunk/build.xml Modified: nutch/trunk/CHANGES.txt URL: http://svn.apache.org

svn commit: r1650181 - in /nutch/trunk: CHANGES.txt src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java src/plugin/index-more/src/test/org/apache/nutch/indexer/more/Te

2015-01-07 Thread snagel
Author: snagel Date: Wed Jan 7 22:25:18 2015 New Revision: 1650181 URL: http://svn.apache.org/r1650181 Log: NUTCH-1140 index-more plugin, resetTitle creates multiple values in title field Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/plugin/index-more/src/java/org/apache/nutch

svn commit: r1670442 - /nutch/trunk/src/test/org/apache/nutch/crawl/TestCrawlDbMerger.java

2015-03-31 Thread snagel
Author: snagel Date: Tue Mar 31 19:28:14 2015 New Revision: 1670442 URL: http://svn.apache.org/r1670442 Log: NUTCH-1979 CrawlDbReader to implement Tool: fix unit test Modified: nutch/trunk/src/test/org/apache/nutch/crawl/TestCrawlDbMerger.java Modified: nutch/trunk/src/test/org/apache/nutch

svn commit: r1669692 - in /nutch: branches/2.x/ branches/2.x/conf/ branches/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/ branches/2.x/src/plugin/protocol-httpclient/src/java/or

2015-03-27 Thread snagel
Author: snagel Date: Fri Mar 27 21:42:35 2015 New Revision: 1669692 URL: http://svn.apache.org/r1669692 Log: NUTCH-1941 Optional rolling http.agent.names Modified: nutch/branches/2.x/CHANGES.txt nutch/branches/2.x/conf/nutch-default.xml nutch/branches/2.x/src/plugin/lib-http/src

svn commit: r1678824 - in /nutch/trunk: CHANGES.txt src/test/org/apache/nutch/tools/TestCommonCrawlDataDumper.java

2015-05-11 Thread snagel
Author: snagel Date: Mon May 11 21:04:59 2015 New Revision: 1678824 URL: http://svn.apache.org/r1678824 Log: NUTCH-1998 Add support for user-defined file extension to CommonCrawlDataDumper: fix unit test Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/test/org/apache/nutch/tools

svn commit: r1680110 - in /nutch/trunk: CHANGES.txt conf/log4j.properties

2015-05-18 Thread snagel
Author: snagel Date: Mon May 18 21:39:23 2015 New Revision: 1680110 URL: http://svn.apache.org/r1680110 Log: NUTCH-2013 Fetcher: missing logs fetching ... on stdout Modified: nutch/trunk/CHANGES.txt nutch/trunk/conf/log4j.properties Modified: nutch/trunk/CHANGES.txt URL: http

svn commit: r1680109 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/fetcher/Fetcher.java

2015-05-18 Thread snagel
Author: snagel Date: Mon May 18 21:35:03 2015 New Revision: 1680109 URL: http://svn.apache.org/r1680109 Log: NUTCH-2014 Fetcher hang-up on completion Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java Modified: nutch/trunk/CHANGES.txt URL: http

svn commit: r1674399 - in /nutch/trunk: ./ conf/ src/java/org/apache/nutch/protocol/ src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/ src/plugin/protocol-ftp/src/java/org/apache/nutch/

2015-04-17 Thread snagel
Author: snagel Date: Fri Apr 17 20:49:19 2015 New Revision: 1674399 URL: http://svn.apache.org/r1674399 Log: NUTCH-1927 Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing Removed: nutch/trunk/src/java/org/apache/nutch/protocol/RobotRules.java Modified: nutch

svn commit: r1674581 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/parse/ParseSegment.java src/java/org/apache/nutch/segment/SegmentChecker.java

2015-04-18 Thread snagel
Author: snagel Date: Sat Apr 18 20:41:13 2015 New Revision: 1674581 URL: http://svn.apache.org/r1674581 Log: NUTCH-1854 bin/crawl fails with a parsing fetcher Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java nutch/trunk/src/java/org

svn commit: r1672939 - in /nutch: branches/2.x/CHANGES.txt branches/2.x/ivy/ivy.xml trunk/CHANGES.txt trunk/ivy/ivy.xml

2015-04-11 Thread snagel
Author: snagel Date: Sat Apr 11 22:07:52 2015 New Revision: 1672939 URL: http://svn.apache.org/r1672939 Log: NUTCH-1981 Upgrade to icu4j 55.1 Modified: nutch/branches/2.x/CHANGES.txt nutch/branches/2.x/ivy/ivy.xml nutch/trunk/CHANGES.txt nutch/trunk/ivy/ivy.xml Modified: nutch

svn commit: r1687604 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/crawl/LinkDb.java

2015-06-25 Thread snagel
Author: snagel Date: Thu Jun 25 18:41:26 2015 New Revision: 1687604 URL: http://svn.apache.org/r1687604 Log: NUTCH-2000 Link inversion fails with .locked already exists Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/crawl/LinkDb.java Modified: nutch/trunk

svn commit: r1682103 - in /nutch/trunk: CHANGES.txt src/bin/nutch

2015-05-27 Thread snagel
Author: snagel Date: Wed May 27 19:31:51 2015 New Revision: 1682103 URL: http://svn.apache.org/r1682103 Log: NUTCH-2007 add test libs to classpath of bin/nutch junit Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/bin/nutch Modified: nutch/trunk/CHANGES.txt URL: http://svn.apache.org

svn commit: r1691436 - /nutch/trunk/CHANGES.txt

2015-07-16 Thread snagel
Author: snagel Date: Thu Jul 16 19:52:00 2015 New Revision: 1691436 URL: http://svn.apache.org/r1691436 Log: remove duplicate entries Modified: nutch/trunk/CHANGES.txt Modified: nutch/trunk/CHANGES.txt URL: http://svn.apache.org/viewvc/nutch/trunk/CHANGES.txt?rev=1691436r1=1691435r2

svn commit: r1714655 - in /nutch/branches/2.x: CHANGES.txt conf/schema.xml

2015-11-16 Thread snagel
Author: snagel Date: Mon Nov 16 20:29:33 2015 New Revision: 1714655 URL: http://svn.apache.org/viewvc?rev=1714655=rev Log: NUTCH-2130 copyField rawcontent creates error within schema.xml Modified: nutch/branches/2.x/CHANGES.txt nutch/branches/2.x/conf/schema.xml Modified: nutch/branches

svn commit: r1707360 - in /nutch/trunk: CHANGES.txt src/java/org/apache/nutch/fetcher/FetcherThread.java

2015-10-07 Thread snagel
Author: snagel Date: Wed Oct 7 19:02:42 2015 New Revision: 1707360 URL: http://svn.apache.org/viewvc?rev=1707360=rev Log: NUTCH-2124 Fetcher following same redirect again and again Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/java/org/apache/nutch/fetcher/FetcherThread.java

svn commit: r1704425 - in /nutch/trunk: ./ src/plugin/lib-selenium/ src/plugin/protocol-interactiveselenium/ src/plugin/protocol-selenium/

2015-09-21 Thread snagel
Author: snagel Date: Mon Sep 21 21:14:55 2015 New Revision: 1704425 URL: http://svn.apache.org/viewvc?rev=1704425=rev Log: NUTCH-2106 Runtime to contain Selenium and dependencies only once Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/plugin/lib-selenium/build-ivy.xml nutch/trunk

svn commit: r1718678 - in /nutch/trunk: conf/nutch-default.xml default.properties src/bin/nutch

2015-12-08 Thread snagel
Author: snagel Date: Tue Dec 8 19:18:19 2015 New Revision: 1718678 URL: http://svn.apache.org/viewvc?rev=1718678=rev Log: Update Nutch trunk for new development: 1.11 -> 1.12 Modified: nutch/trunk/conf/nutch-default.xml nutch/trunk/default.properties nutch/trunk/src/bin/nu

svn commit: r1717537 - in /nutch/branches/2.x: CHANGES.txt src/plugin/subcollection/plugin.xml src/plugin/urlnormalizer-regex/plugin.xml

2015-12-01 Thread snagel
Author: snagel Date: Tue Dec 1 21:17:14 2015 New Revision: 1717537 URL: http://svn.apache.org/viewvc?rev=1717537=rev Log: NUTCH-2107 plugin.xml to validate against plugin.dtd Modified: nutch/branches/2.x/CHANGES.txt nutch/branches/2.x/src/plugin/subcollection/plugin.xml nutch

svn commit: r1717536 - in /nutch/trunk: CHANGES.txt src/plugin/subcollection/plugin.xml src/plugin/urlnormalizer-regex/plugin.xml

2015-12-01 Thread snagel
Author: snagel Date: Tue Dec 1 21:15:21 2015 New Revision: 1717536 URL: http://svn.apache.org/viewvc?rev=1717536=rev Log: NUTCH-2107 plugin.xml to validate against plugin.dtd Modified: nutch/trunk/CHANGES.txt nutch/trunk/src/plugin/subcollection/plugin.xml nutch/trunk/src/plugin

svn commit: r1718223 - in /nutch/trunk: CHANGES.txt conf/contenttype-mapping.txt.template src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java

2015-12-06 Thread snagel
Author: snagel Date: Sun Dec 6 21:14:06 2015 New Revision: 1718223 URL: http://svn.apache.org/viewvc?rev=1718223=rev Log: NUTCH-2172 index-more: document format of contenttype-mapping.txt Added: nutch/trunk/conf/contenttype-mapping.txt.template Modified: nutch/trunk/CHANGES.txt

svn commit: r1718718 - in /nutch: branches/2.x/CHANGES.txt branches/2.x/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java trunk/CHANGES.txt trunk/src/plugin/parse-html/src/jav

2015-12-08 Thread snagel
Author: snagel Date: Tue Dec 8 21:45:47 2015 New Revision: 1718718 URL: http://svn.apache.org/viewvc?rev=1718718=rev Log: NUTCH-2042 parse-html increase chunk size used to detect charset Modified: nutch/branches/2.x/CHANGES.txt nutch/branches/2.x/src/plugin/parse-html/src/java/org

svn commit: r1723851 - in /nutch/branches/2.x: CHANGES.txt src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java

2016-01-09 Thread snagel
Author: snagel Date: Sat Jan 9 13:01:31 2016 New Revision: 1723851 URL: http://svn.apache.org/viewvc?rev=1723851=rev Log: NUTCH-2168 Parse-tika fails to retrieve parser Modified: nutch/branches/2.x/CHANGES.txt nutch/branches/2.x/src/plugin/parse-tika/src/java/org/apache/nutch/parse

svn commit: r1723626 - in /nutch/branches/2.x: CHANGES.txt src/java/org/apache/nutch/crawl/GeneratorJob.java

2016-01-07 Thread snagel
Author: snagel Date: Thu Jan 7 20:57:13 2016 New Revision: 1723626 URL: http://svn.apache.org/viewvc?rev=1723626=rev Log: NUTCH-2143 GeneratorJob ignores batch id passed as argument Modified: nutch/branches/2.x/CHANGES.txt nutch/branches/2.x/src/java/org/apache/nutch/crawl

svn commit: r1716177 - in /nutch/trunk: CHANGES.txt conf/nutch-default.xml

2015-11-24 Thread snagel
Author: snagel Date: Tue Nov 24 15:37:32 2015 New Revision: 1716177 URL: http://svn.apache.org/viewvc?rev=1716177=rev Log: NUTCH-2175 Typos in property descriptions Modified: nutch/trunk/CHANGES.txt nutch/trunk/conf/nutch-default.xml Modified: nutch/trunk/CHANGES.txt URL: http

nutch git commit: NUTCH-2272 Index checker server to optionally keep client connection open - removed from change log for release 1.12 as it is not included

2016-06-23 Thread snagel
Repository: nutch Updated Branches: refs/heads/master af6d8763f -> d29be63bd NUTCH-2272 Index checker server to optionally keep client connection open - removed from change log for release 1.12 as it is not included Project: http://git-wip-us.apache.org/repos/asf/nutch/repo Commit:

[5/5] nutch git commit: fix unit test: CrawlDbFilter stil writes reduce output dirs as part-00000 (not part-r-00000)

2016-02-25 Thread snagel
fix unit test: CrawlDbFilter stil writes reduce output dirs as part-0 (not part-r-0) Project: http://git-wip-us.apache.org/repos/asf/nutch/repo Commit: http://git-wip-us.apache.org/repos/asf/nutch/commit/f5e430e5 Tree: http://git-wip-us.apache.org/repos/asf/nutch/tree/f5e430e5 Diff:

[1/5] nutch git commit: update tests to reflect change of reduce outputs by new API (part-nnnnn -> part-r-nnnnn): all unit tests pass now

2016-02-25 Thread snagel
Repository: nutch Updated Branches: refs/heads/master 25e879afc -> f5e430e55 update tests to reflect change of reduce outputs by new API (part-n -> part-r-n): all unit tests pass now Project: http://git-wip-us.apache.org/repos/asf/nutch/repo Commit:

[3/5] nutch git commit: NUTCH-1712 applied to current trunk; run first simple tests (inject + merge)

2016-02-25 Thread snagel
NUTCH-1712 applied to current trunk; run first simple tests (inject + merge) Project: http://git-wip-us.apache.org/repos/asf/nutch/repo Commit: http://git-wip-us.apache.org/repos/asf/nutch/commit/3c691eb2 Tree: http://git-wip-us.apache.org/repos/asf/nutch/tree/3c691eb2 Diff:

[4/5] nutch git commit: NUTCH-1712 Use MultipleInputs in Injector to make it a single mapreduce job, this closes #86

2016-02-25 Thread snagel
g +* NUTCH-1712 Use MultipleInputs in Injector to make it a single mapreduce job (tejasp, snagel) + * NUTCH-2231 Jexl support in generator job (markus) * NUTCH-2232 DeduplicationJob should decode URL's before length is compared (Ron van der Vegt via markus)

[2/5] nutch git commit: add unit tests based on MRUnit

2016-02-25 Thread snagel
add unit tests based on MRUnit Project: http://git-wip-us.apache.org/repos/asf/nutch/repo Commit: http://git-wip-us.apache.org/repos/asf/nutch/commit/288dceed Tree: http://git-wip-us.apache.org/repos/asf/nutch/tree/288dceed Diff: http://git-wip-us.apache.org/repos/asf/nutch/diff/288dceed

svn commit: r1726314 - in /nutch/trunk: CHANGES.txt conf/regex-normalize.xml.template ivy/ivy.xml

2016-01-22 Thread snagel
Author: snagel Date: Fri Jan 22 21:26:12 2016 New Revision: 1726314 URL: http://svn.apache.org/viewvc?rev=1726314=rev Log: NUTCH-2204 Remove junit lib from runtime Modified: nutch/trunk/CHANGES.txt nutch/trunk/conf/regex-normalize.xml.template nutch/trunk/ivy/ivy.xml Modified: nutch

nutch git commit: Inconsistent log level

2016-04-29 Thread snagel
+ b/CHANGES.txt @@ -10,6 +10,8 @@ in the release announcement and keep it on top in this CHANGES.txt for the Nutch Nutch Change Log +* NUTCH-2256 Inconsistent log level (songwanging via snagel) + * NUTCH-2254 Indexer: character set issue with -addBinaryContent and -base64 (Federico Bonelli, sna

nutch git commit: NUTCH-2254 Indexer: character set issue with -addBinaryContent and -base64 - generate base64 encoded string directly from content bytes (patch provided by Federico Bonelli) - add JUn

2016-04-27 Thread snagel
ent and -base64 (Federico Bonelli, snagel) + * NUTCH-2250 CommonCrawlDumper : Invalid format and skipped parts (Thamme Gowda N.,lewismc via mattmann) * NUTCH-2245 Developed the NGram Model on the existing Unigram Cosine Similarity Model (bhavyasanghavi via sujen) http://git-wip-us.apache.org/

nutch git commit: Inconsistent log level

2016-04-29 Thread snagel
GES.txt @@ -2,6 +2,8 @@ Nutch Change Log Nutch 2.4 Development + * NUTCH-2256 Inconsistent log level (songwanging via snagel) + * NUTCH-961 GitHub-92 Add the boilerpipe parsing adapted from NUTCH-961 (Jeremie Bourseaux <jeremie.bours...@xilopix.com> via mattmann) * GitHub-94 Fix

nutch git commit: fix for NUTCH-2191 - fixing Nutch build - contributed by karanjeets

2016-04-18 Thread snagel
Repository: nutch Updated Branches: refs/heads/master 044e8e77e -> 8572fd955 fix for NUTCH-2191 - fixing Nutch build - contributed by karanjeets Project: http://git-wip-us.apache.org/repos/asf/nutch/repo Commit: http://git-wip-us.apache.org/repos/asf/nutch/commit/8572fd95 Tree:

[2/2] nutch git commit: NUTCH-1553 Property 'indexer.delete.robots.noindex' not working when using parser-html - fix broken unit test (fix HTML markup, make test for meta data extraction obligatory) -

2016-07-01 Thread snagel
NUTCH-1553 Property 'indexer.delete.robots.noindex' not working when using parser-html - fix broken unit test (fix HTML markup, make test for meta data extraction obligatory) - add all values of general metadata to parse metadata Project: http://git-wip-us.apache.org/repos/asf/nutch/repo

[1/2] nutch git commit: NUTCH-2291 - Fix mrunit dependencies - remove classifier from dependency because pom file name on Maven repository does not contain a classifier

2016-07-01 Thread snagel
Repository: nutch Updated Branches: refs/heads/master cb6fbae51 -> 34050adae NUTCH-2291 - Fix mrunit dependencies - remove classifier from dependency because pom file name on Maven repository does not contain a classifier Project: http://git-wip-us.apache.org/repos/asf/nutch/repo Commit:

[3/4] nutch git commit: CrawlDb statistics: add fetch time (earliest, latest, average)

2016-07-02 Thread snagel
CrawlDb statistics: add fetch time (earliest, latest, average) Project: http://git-wip-us.apache.org/repos/asf/nutch/repo Commit: http://git-wip-us.apache.org/repos/asf/nutch/commit/ea2843b9 Tree: http://git-wip-us.apache.org/repos/asf/nutch/tree/ea2843b9 Diff:

[2/4] nutch git commit: CrawlDb statistics: add fetch interval (shortest, longest, average)

2016-07-02 Thread snagel
CrawlDb statistics: add fetch interval (shortest, longest, average) Project: http://git-wip-us.apache.org/repos/asf/nutch/repo Commit: http://git-wip-us.apache.org/repos/asf/nutch/commit/39f6c713 Tree: http://git-wip-us.apache.org/repos/asf/nutch/tree/39f6c713 Diff:

[1/2] nutch git commit: Remove obsolete properties protocol.plugin.check.blocking and protocol.plugin.check.robots

2016-08-16 Thread snagel
Repository: nutch Updated Branches: refs/heads/master d27c351f4 -> d37b7ce13 Remove obsolete properties protocol.plugin.check.blocking and protocol.plugin.check.robots Project: http://git-wip-us.apache.org/repos/asf/nutch/repo Commit:

[2/2] nutch git commit: Merge branch 'NUTCH-2299' of https://github.com/sebastian-nagel/nutch this closes #140 - Remove obsolete properties protocol.plugin.check.*

2016-08-16 Thread snagel
Merge branch 'NUTCH-2299' of https://github.com/sebastian-nagel/nutch this closes #140 - Remove obsolete properties protocol.plugin.check.* Project: http://git-wip-us.apache.org/repos/asf/nutch/repo Commit: http://git-wip-us.apache.org/repos/asf/nutch/commit/d37b7ce1 Tree:

nutch git commit: NUTCH-2349 urlnormalizer-basic: NPE for URLs without authority - check whether URL.getAuthority() returns null - recompose URLs without authority with empty authority/host

2017-02-01 Thread snagel
Repository: nutch Updated Branches: refs/heads/2.x 022ed5c03 -> 700857d16 NUTCH-2349 urlnormalizer-basic: NPE for URLs without authority - check whether URL.getAuthority() returns null - recompose URLs without authority with empty authority/host Project:

  1   2   3   4   5   6   7   8   9   >