I'm new to Nutch. I've been trying to get through the tutorials
(Nutch2tutorial and the older ones) but I'm getting an error when I try
to do a crawl:
==========================================
^Ccocofan@cocofan-notebook-PC:~/Dropbox/project/apache-nutch-2.1/runtime/local$
bin/nutch crawl urls olr http://localhost:8983/solr/ -depth 3 -topN 5
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 1 records. Hit by time limit :0
fetching http://nutch.apache.org/
-finishing thread FetcherThread2, activeThreads=5
-finishing thread FetcherThread1, activeThreads=4
-finishing thread FetcherThread4, activeThreads=3
-finishing thread FetcherThread6, activeThreads=1
-finishing thread FetcherThread5, activeThreads=2
-finishing thread FetcherThread8, activeThreads=1
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread3, activeThreads=1
-finishing thread FetcherThread7, activeThreads=1
-finishing thread FetcherThread9, activeThreads=1
-finishing thread FetcherThread0, activeThreads=0
0/0 spinwaiting/active, 1 pages, 0 errors, 0.2 0.2 pages/s, 51 51 kb/s,
0 URLs in 0 queues
-activeThreads=0
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: parsing all
Parsing http://nutch.apache.org/
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
fetching http://nutch.apache.org/about.html
QueueFeeder finished: total 5 records. Hit by time limit :0
10/10 spinwaiting/active, 1 pages, 0 errors, 0.2 0.2 pages/s, 16 16
kb/s, 4 URLs in 1 queues
* queue: http://nutch.apache.org
maxThreads = 1
inProgress = 0
crawlDelay = 5000
minCrawlDelay = 0
nextFetchTime = 1351824493524
now = 1351824493242
0. http://nutch.apache.org/credits.html
1. http://nutch.apache.org/apidocs-2.1/index.html
2. http://nutch.apache.org/bot.html
3. http://nutch.apache.org/apidocs-1.5/index.html
fetching http://nutch.apache.org/credits.html
10/10 spinwaiting/active, 2 pages, 0 errors, 0.2 0.2 pages/s, 18 19
kb/s, 3 URLs in 1 queues
* queue: http://nutch.apache.org
maxThreads = 1
inProgress = 0
crawlDelay = 5000
minCrawlDelay = 0
nextFetchTime = 1351824498764
now = 1351824498244
0. http://nutch.apache.org/apidocs-2.1/index.html
1. http://nutch.apache.org/bot.html
2. http://nutch.apache.org/apidocs-1.5/index.html
fetching http://nutch.apache.org/apidocs-2.1/index.html
10/10 spinwaiting/active, 3 pages, 0 errors, 0.2 0.2 pages/s, 12 2 kb/s,
2 URLs in 1 queues
* queue: http://nutch.apache.org
maxThreads = 1
inProgress = 0
crawlDelay = 5000
minCrawlDelay = 0
nextFetchTime = 1351824503930
now = 1351824503246
0. http://nutch.apache.org/bot.html
1. http://nutch.apache.org/apidocs-1.5/index.html
fetching http://nutch.apache.org/bot.html
10/10 spinwaiting/active, 4 pages, 0 errors, 0.2 0.2 pages/s, 14 18
kb/s, 1 URLs in 1 queues
* queue: http://nutch.apache.org
maxThreads = 1
inProgress = 0
crawlDelay = 5000
minCrawlDelay = 0
nextFetchTime = 1351824509093
now = 1351824508247
0. http://nutch.apache.org/apidocs-1.5/index.html
fetching http://nutch.apache.org/apidocs-1.5/index.html
-finishing thread FetcherThread5, activeThreads=9
-finishing thread FetcherThread3, activeThreads=8
-finishing thread FetcherThread2, activeThreads=7
-finishing thread FetcherThread8, activeThreads=3
-finishing thread FetcherThread7, activeThreads=6
-finishing thread FetcherThread0, activeThreads=1
-finishing thread FetcherThread4, activeThreads=2
-finishing thread FetcherThread6, activeThreads=4
-finishing thread FetcherThread1, activeThreads=5
-finishing thread FetcherThread9, activeThreads=0
0/0 spinwaiting/active, 5 pages, 0 errors, 0.2 0.2 pages/s, 11 2 kb/s, 0
URLs in 0 queues
-activeThreads=0
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: parsing all
Skipping http://nutch.apache.org/; different batch id (null)
Parsing http://nutch.apache.org/about.html
Parsing http://nutch.apache.org/apidocs-1.5/index.html
Parsing http://nutch.apache.org/apidocs-2.1/index.html
Parsing http://nutch.apache.org/bot.html
Parsing http://nutch.apache.org/credits.html
Skipping http://nutch.apache.org/faq.html; different batch id (null)
Skipping http://nutch.apache.org/index.html; different batch id (null)
Skipping http://nutch.apache.org/index.pdf; different batch id (null)
Skipping http://nutch.apache.org/issue_tracking.html; different batch id
(null)
Skipping http://nutch.apache.org/mailing_lists.html; different batch id
(null)
Skipping http://nutch.apache.org/nightly.html; different batch id (null)
Skipping http://nutch.apache.org/old_downloads.html; different batch id
(null)
Skipping http://nutch.apache.org/sonar.html; different batch id (null)
Skipping http://nutch.apache.org/tutorial.html; different batch id (null)
Skipping http://nutch.apache.org/version_control.html; different batch
id (null)
Skipping http://nutch.apache.org/wiki.html; different batch id (null)
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
fetching http://nutch.apache.org/nightly.html
QueueFeeder finished: total 5 records. Hit by time limit :0
10/10 spinwaiting/active, 1 pages, 0 errors, 0.2 0.2 pages/s, 15 15
kb/s, 4 URLs in 1 queues
* queue: http://nutch.apache.org
maxThreads = 1
inProgress = 0
crawlDelay = 5000
minCrawlDelay = 0
nextFetchTime = 1351824544459
now = 1351824543944
0. http://nutch.apache.org/mailing_lists.html
1. http://nutch.apache.org/faq.html
2. http://nutch.apache.org/index.html
3. http://nutch.apache.org/issue_tracking.html
fetching http://nutch.apache.org/mailing_lists.html
10/10 spinwaiting/active, 2 pages, 0 errors, 0.2 0.2 pages/s, 18 21
kb/s, 3 URLs in 1 queues
* queue: http://nutch.apache.org
maxThreads = 1
inProgress = 0
crawlDelay = 5000
minCrawlDelay = 0
nextFetchTime = 1351824550073
now = 1351824548946
0. http://nutch.apache.org/faq.html
1. http://nutch.apache.org/index.html
2. http://nutch.apache.org/issue_tracking.html
fetching http://nutch.apache.org/faq.html
10/10 spinwaiting/active, 3 pages, 0 errors, 0.2 0.2 pages/s, 17 15
kb/s, 2 URLs in 1 queues
* queue: http://nutch.apache.org
maxThreads = 1
inProgress = 0
crawlDelay = 5000
minCrawlDelay = 0
nextFetchTime = 1351824555568
now = 1351824553948
0. http://nutch.apache.org/index.html
1. http://nutch.apache.org/issue_tracking.html
fetching http://nutch.apache.org/index.html
10/10 spinwaiting/active, 4 pages, 0 errors, 0.2 0.2 pages/s, 25 51
kb/s, 1 URLs in 1 queues
* queue: http://nutch.apache.org
maxThreads = 1
inProgress = 0
crawlDelay = 5000
minCrawlDelay = 0
nextFetchTime = 1351824561359
now = 1351824558949
0. http://nutch.apache.org/issue_tracking.html
fetching http://nutch.apache.org/issue_tracking.html
-finishing thread FetcherThread4, activeThreads=8
-finishing thread FetcherThread8, activeThreads=8
-finishing thread FetcherThread3, activeThreads=7
-finishing thread FetcherThread5, activeThreads=6
-finishing thread FetcherThread0, activeThreads=5
-finishing thread FetcherThread7, activeThreads=4
-finishing thread FetcherThread2, activeThreads=3
-finishing thread FetcherThread6, activeThreads=2
-finishing thread FetcherThread1, activeThreads=1
-finishing thread FetcherThread9, activeThreads=0
0/0 spinwaiting/active, 5 pages, 0 errors, 0.2 0.2 pages/s, 23 15 kb/s,
0 URLs in 0 queues
-activeThreads=0
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: parsing all
Skipping http://nutch.apache.org/; different batch id (null)
Skipping http://nutch.apache.org/about.html; different batch id (null)
Skipping http://nutch.apache.org/about.pdf; different batch id (null)
Skipping http://nutch.apache.org/apidocs-1.5/allclasses-frame.html;
different batch id (null)
Skipping http://nutch.apache.org/apidocs-1.5/index.html; different batch
id (null)
Skipping http://nutch.apache.org/apidocs-1.5/overview-frame.html;
different batch id (null)
Skipping http://nutch.apache.org/apidocs-1.5/overview-summary.html;
different batch id (null)
Skipping http://nutch.apache.org/apidocs-2.1/allclasses-frame.html;
different batch id (null)
Skipping http://nutch.apache.org/apidocs-2.1/index.html; different batch
id (null)
Skipping http://nutch.apache.org/apidocs-2.1/overview-frame.html;
different batch id (null)
Skipping http://nutch.apache.org/apidocs-2.1/overview-summary.html;
different batch id (null)
Skipping http://nutch.apache.org/bot.html; different batch id (null)
Skipping http://nutch.apache.org/bot.pdf; different batch id (null)
Skipping http://nutch.apache.org/credits.html; different batch id (null)
Skipping http://nutch.apache.org/credits.pdf; different batch id (null)
Parsing http://nutch.apache.org/faq.html
Parsing http://nutch.apache.org/index.html
Skipping http://nutch.apache.org/index.pdf; different batch id (null)
Parsing http://nutch.apache.org/issue_tracking.html
Parsing http://nutch.apache.org/mailing_lists.html
Parsing http://nutch.apache.org/nightly.html
Skipping http://nutch.apache.org/old_downloads.html; different batch id
(null)
Skipping http://nutch.apache.org/sonar.html; different batch id (null)
Skipping http://nutch.apache.org/tutorial.html; different batch id (null)
Skipping http://nutch.apache.org/version_control.html; different batch
id (null)
Skipping http://nutch.apache.org/wiki.html; different batch id (null)
Exception in thread "main" java.lang.NullPointerException
at java.util.Hashtable.put(Hashtable.java:411)
at java.util.Properties.setProperty(Properties.java:160)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:438)
at
org.apache.nutch.indexer.IndexerJob.createIndexJob(IndexerJob.java:128)
at
org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:44)
at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:192)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)
====================================
I'm using HBase 90.6 because the latest didn't work for me. Also,
I'm using solr 3.6.1 instead of solr 4.0 for the same problem.
I was wondering what versions of Nutch, HBase and Solr other users
who have gotten Nutch to work. are using? I'm getting the feeling that
only the right version combinations of all parts works .
cocofan