I'm new to Nutch. I've been trying to get through the tutorials
(Nutch2tutorial and the older ones) but I'm getting an error when I try to
do a crawl:
==============================**============
^Ccocofan@cocofan-notebook-PC:**~/Dropbox/project/apache-**nutch-2.1/runtime/local$
bin/nutch crawl urls olr http://localhost:8983/solr/ -depth 3 -topN 5
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 1 records. Hit by time limit :0
fetching http://nutch.apache.org/
-finishing thread FetcherThread2, activeThreads=5
-finishing thread FetcherThread1, activeThreads=4
-finishing thread FetcherThread4, activeThreads=3
-finishing thread FetcherThread6, activeThreads=1
-finishing thread FetcherThread5, activeThreads=2
-finishing thread FetcherThread8, activeThreads=1
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread3, activeThreads=1
-finishing thread FetcherThread7, activeThreads=1
-finishing thread FetcherThread9, activeThreads=1
-finishing thread FetcherThread0, activeThreads=0
0/0 spinwaiting/active, 1 pages, 0 errors, 0.2 0.2 pages/s, 51 51 kb/s, 0
URLs in 0 queues
-activeThreads=0
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: parsing all
Parsing http://nutch.apache.org/
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
fetching
http://nutch.apache.org/about.**html<http://nutch.apache.org/about.html>
QueueFeeder finished: total 5 records. Hit by time limit :0
10/10 spinwaiting/active, 1 pages, 0 errors, 0.2 0.2 pages/s, 16 16 kb/s,
4 URLs in 1 queues
* queue: http://nutch.apache.org
maxThreads = 1
inProgress = 0
crawlDelay = 5000
minCrawlDelay = 0
nextFetchTime = 1351824493524
now = 1351824493242
0.
http://nutch.apache.org/**credits.html<http://nutch.apache.org/credits.html>
1.
http://nutch.apache.org/**apidocs-2.1/index.html<http://nutch.apache.org/apidocs-2.1/index.html>
2. http://nutch.apache.org/bot.**html <http://nutch.apache.org/bot.html>
3.
http://nutch.apache.org/**apidocs-1.5/index.html<http://nutch.apache.org/apidocs-1.5/index.html>
fetching
http://nutch.apache.org/**credits.html<http://nutch.apache.org/credits.html>
10/10 spinwaiting/active, 2 pages, 0 errors, 0.2 0.2 pages/s, 18 19 kb/s,
3 URLs in 1 queues
* queue: http://nutch.apache.org
maxThreads = 1
inProgress = 0
crawlDelay = 5000
minCrawlDelay = 0
nextFetchTime = 1351824498764
now = 1351824498244
0.
http://nutch.apache.org/**apidocs-2.1/index.html<http://nutch.apache.org/apidocs-2.1/index.html>
1. http://nutch.apache.org/bot.**html <http://nutch.apache.org/bot.html>
2.
http://nutch.apache.org/**apidocs-1.5/index.html<http://nutch.apache.org/apidocs-1.5/index.html>
fetching
http://nutch.apache.org/**apidocs-2.1/index.html<http://nutch.apache.org/apidocs-2.1/index.html>
10/10 spinwaiting/active, 3 pages, 0 errors, 0.2 0.2 pages/s, 12 2 kb/s, 2
URLs in 1 queues
* queue: http://nutch.apache.org
maxThreads = 1
inProgress = 0
crawlDelay = 5000
minCrawlDelay = 0
nextFetchTime = 1351824503930
now = 1351824503246
0. http://nutch.apache.org/bot.**html <http://nutch.apache.org/bot.html>
1.
http://nutch.apache.org/**apidocs-1.5/index.html<http://nutch.apache.org/apidocs-1.5/index.html>
fetching http://nutch.apache.org/bot.**html<http://nutch.apache.org/bot.html>
10/10 spinwaiting/active, 4 pages, 0 errors, 0.2 0.2 pages/s, 14 18 kb/s,
1 URLs in 1 queues
* queue: http://nutch.apache.org
maxThreads = 1
inProgress = 0
crawlDelay = 5000
minCrawlDelay = 0
nextFetchTime = 1351824509093
now = 1351824508247
0.
http://nutch.apache.org/**apidocs-1.5/index.html<http://nutch.apache.org/apidocs-1.5/index.html>
fetching
http://nutch.apache.org/**apidocs-1.5/index.html<http://nutch.apache.org/apidocs-1.5/index.html>
-finishing thread FetcherThread5, activeThreads=9
-finishing thread FetcherThread3, activeThreads=8
-finishing thread FetcherThread2, activeThreads=7
-finishing thread FetcherThread8, activeThreads=3
-finishing thread FetcherThread7, activeThreads=6
-finishing thread FetcherThread0, activeThreads=1
-finishing thread FetcherThread4, activeThreads=2
-finishing thread FetcherThread6, activeThreads=4
-finishing thread FetcherThread1, activeThreads=5
-finishing thread FetcherThread9, activeThreads=0
0/0 spinwaiting/active, 5 pages, 0 errors, 0.2 0.2 pages/s, 11 2 kb/s, 0
URLs in 0 queues
-activeThreads=0
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: parsing all
Skipping http://nutch.apache.org/; different batch id (null)
Parsing http://nutch.apache.org/about.**html<http://nutch.apache.org/about.html>
Parsing
http://nutch.apache.org/**apidocs-1.5/index.html<http://nutch.apache.org/apidocs-1.5/index.html>
Parsing
http://nutch.apache.org/**apidocs-2.1/index.html<http://nutch.apache.org/apidocs-2.1/index.html>
Parsing http://nutch.apache.org/bot.**html<http://nutch.apache.org/bot.html>
Parsing
http://nutch.apache.org/**credits.html<http://nutch.apache.org/credits.html>
Skipping http://nutch.apache.org/faq.**html<http://nutch.apache.org/faq.html>;
different batch id (null)
Skipping
http://nutch.apache.org/index.**html<http://nutch.apache.org/index.html>;
different batch id (null)
Skipping http://nutch.apache.org/index.**pdf<http://nutch.apache.org/index.pdf>;
different batch id (null)
Skipping
http://nutch.apache.org/issue_**tracking.html<http://nutch.apache.org/issue_tracking.html>;
different batch id (null)
Skipping
http://nutch.apache.org/**mailing_lists.html<http://nutch.apache.org/mailing_lists.html>;
different batch id (null)
Skipping
http://nutch.apache.org/**nightly.html<http://nutch.apache.org/nightly.html>;
different batch id (null)
Skipping
http://nutch.apache.org/old_**downloads.html<http://nutch.apache.org/old_downloads.html>;
different batch id (null)
Skipping
http://nutch.apache.org/sonar.**html<http://nutch.apache.org/sonar.html>;
different batch id (null)
Skipping
http://nutch.apache.org/**tutorial.html<http://nutch.apache.org/tutorial.html>;
different batch id (null)
Skipping
http://nutch.apache.org/**version_control.html<http://nutch.apache.org/version_control.html>;
different batch id (null)
Skipping http://nutch.apache.org/wiki.**html<http://nutch.apache.org/wiki.html>;
different batch id (null)
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
fetching
http://nutch.apache.org/**nightly.html<http://nutch.apache.org/nightly.html>
QueueFeeder finished: total 5 records. Hit by time limit :0
10/10 spinwaiting/active, 1 pages, 0 errors, 0.2 0.2 pages/s, 15 15 kb/s,
4 URLs in 1 queues
* queue: http://nutch.apache.org
maxThreads = 1
inProgress = 0
crawlDelay = 5000
minCrawlDelay = 0
nextFetchTime = 1351824544459
now = 1351824543944
0.
http://nutch.apache.org/**mailing_lists.html<http://nutch.apache.org/mailing_lists.html>
1. http://nutch.apache.org/faq.**html <http://nutch.apache.org/faq.html>
2. http://nutch.apache.org/index.**html<http://nutch.apache.org/index.html>
3.
http://nutch.apache.org/issue_**tracking.html<http://nutch.apache.org/issue_tracking.html>
fetching
http://nutch.apache.org/**mailing_lists.html<http://nutch.apache.org/mailing_lists.html>
10/10 spinwaiting/active, 2 pages, 0 errors, 0.2 0.2 pages/s, 18 21 kb/s,
3 URLs in 1 queues
* queue: http://nutch.apache.org
maxThreads = 1
inProgress = 0
crawlDelay = 5000
minCrawlDelay = 0
nextFetchTime = 1351824550073
now = 1351824548946
0. http://nutch.apache.org/faq.**html <http://nutch.apache.org/faq.html>
1. http://nutch.apache.org/index.**html<http://nutch.apache.org/index.html>
2.
http://nutch.apache.org/issue_**tracking.html<http://nutch.apache.org/issue_tracking.html>
fetching http://nutch.apache.org/faq.**html<http://nutch.apache.org/faq.html>
10/10 spinwaiting/active, 3 pages, 0 errors, 0.2 0.2 pages/s, 17 15 kb/s,
2 URLs in 1 queues
* queue: http://nutch.apache.org
maxThreads = 1
inProgress = 0
crawlDelay = 5000
minCrawlDelay = 0
nextFetchTime = 1351824555568
now = 1351824553948
0. http://nutch.apache.org/index.**html<http://nutch.apache.org/index.html>
1.
http://nutch.apache.org/issue_**tracking.html<http://nutch.apache.org/issue_tracking.html>
fetching
http://nutch.apache.org/index.**html<http://nutch.apache.org/index.html>
10/10 spinwaiting/active, 4 pages, 0 errors, 0.2 0.2 pages/s, 25 51 kb/s,
1 URLs in 1 queues
* queue: http://nutch.apache.org
maxThreads = 1
inProgress = 0
crawlDelay = 5000
minCrawlDelay = 0
nextFetchTime = 1351824561359
now = 1351824558949
0.
http://nutch.apache.org/issue_**tracking.html<http://nutch.apache.org/issue_tracking.html>
fetching
http://nutch.apache.org/issue_**tracking.html<http://nutch.apache.org/issue_tracking.html>
-finishing thread FetcherThread4, activeThreads=8
-finishing thread FetcherThread8, activeThreads=8
-finishing thread FetcherThread3, activeThreads=7
-finishing thread FetcherThread5, activeThreads=6
-finishing thread FetcherThread0, activeThreads=5
-finishing thread FetcherThread7, activeThreads=4
-finishing thread FetcherThread2, activeThreads=3
-finishing thread FetcherThread6, activeThreads=2
-finishing thread FetcherThread1, activeThreads=1
-finishing thread FetcherThread9, activeThreads=0
0/0 spinwaiting/active, 5 pages, 0 errors, 0.2 0.2 pages/s, 23 15 kb/s, 0
URLs in 0 queues
-activeThreads=0
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: parsing all
Skipping http://nutch.apache.org/; different batch id (null)
Skipping
http://nutch.apache.org/about.**html<http://nutch.apache.org/about.html>;
different batch id (null)
Skipping http://nutch.apache.org/about.**pdf<http://nutch.apache.org/about.pdf>;
different batch id (null)
Skipping
http://nutch.apache.org/**apidocs-1.5/allclasses-frame.**html<http://nutch.apache.org/apidocs-1.5/allclasses-frame.html>;
different batch id (null)
Skipping
http://nutch.apache.org/**apidocs-1.5/index.html<http://nutch.apache.org/apidocs-1.5/index.html>;
different batch id (null)
Skipping
http://nutch.apache.org/**apidocs-1.5/overview-frame.**html<http://nutch.apache.org/apidocs-1.5/overview-frame.html>;
different batch id (null)
Skipping
http://nutch.apache.org/**apidocs-1.5/overview-summary.**html<http://nutch.apache.org/apidocs-1.5/overview-summary.html>;
different batch id (null)
Skipping
http://nutch.apache.org/**apidocs-2.1/allclasses-frame.**html<http://nutch.apache.org/apidocs-2.1/allclasses-frame.html>;
different batch id (null)
Skipping
http://nutch.apache.org/**apidocs-2.1/index.html<http://nutch.apache.org/apidocs-2.1/index.html>;
different batch id (null)
Skipping
http://nutch.apache.org/**apidocs-2.1/overview-frame.**html<http://nutch.apache.org/apidocs-2.1/overview-frame.html>;
different batch id (null)
Skipping
http://nutch.apache.org/**apidocs-2.1/overview-summary.**html<http://nutch.apache.org/apidocs-2.1/overview-summary.html>;
different batch id (null)
Skipping http://nutch.apache.org/bot.**html<http://nutch.apache.org/bot.html>;
different batch id (null)
Skipping http://nutch.apache.org/bot.**pdf<http://nutch.apache.org/bot.pdf>;
different batch id (null)
Skipping
http://nutch.apache.org/**credits.html<http://nutch.apache.org/credits.html>;
different batch id (null)
Skipping
http://nutch.apache.org/**credits.pdf<http://nutch.apache.org/credits.pdf>;
different batch id (null)
Parsing http://nutch.apache.org/faq.**html<http://nutch.apache.org/faq.html>
Parsing http://nutch.apache.org/index.**html<http://nutch.apache.org/index.html>
Skipping http://nutch.apache.org/index.**pdf<http://nutch.apache.org/index.pdf>;
different batch id (null)
Parsing
http://nutch.apache.org/issue_**tracking.html<http://nutch.apache.org/issue_tracking.html>
Parsing
http://nutch.apache.org/**mailing_lists.html<http://nutch.apache.org/mailing_lists.html>
Parsing
http://nutch.apache.org/**nightly.html<http://nutch.apache.org/nightly.html>
Skipping
http://nutch.apache.org/old_**downloads.html<http://nutch.apache.org/old_downloads.html>;
different batch id (null)
Skipping
http://nutch.apache.org/sonar.**html<http://nutch.apache.org/sonar.html>;
different batch id (null)
Skipping
http://nutch.apache.org/**tutorial.html<http://nutch.apache.org/tutorial.html>;
different batch id (null)
Skipping
http://nutch.apache.org/**version_control.html<http://nutch.apache.org/version_control.html>;
different batch id (null)
Skipping http://nutch.apache.org/wiki.**html<http://nutch.apache.org/wiki.html>;
different batch id (null)
Exception in thread "main" java.lang.NullPointerException
at java.util.Hashtable.put(**Hashtable.java:411)
at java.util.Properties.**setProperty(Properties.java:**160)
at org.apache.hadoop.conf.**Configuration.set(**
Configuration.java:438)
at org.apache.nutch.indexer.**IndexerJob.createIndexJob(**
IndexerJob.java:128)
at org.apache.nutch.indexer.solr.**SolrIndexerJob.run(**
SolrIndexerJob.java:44)
at org.apache.nutch.crawl.**Crawler.runTool(Crawler.java:**68)
at org.apache.nutch.crawl.**Crawler.run(Crawler.java:192)
at org.apache.nutch.crawl.**Crawler.run(Crawler.java:250)
at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
at org.apache.nutch.crawl.**Crawler.main(Crawler.java:257)
==============================**======
I'm using HBase 90.6 because the latest didn't work for me. Also, I'm
using solr 3.6.1 instead of solr 4.0 for the same problem.
I was wondering what versions of Nutch, HBase and Solr other users who
have gotten Nutch to work. are using? I'm getting the feeling that only
the right version combinations of all parts works .
cocofan