Hello people,
I was trying to do an manual crawl like described in the nutch tutorial
on http://wiki.apache.org/nutch/NutchTutorial
First of all: If I do a crawl, with the same seed urls, using the "nutch
crawl" command, everything works fine.
Here's what I was trying to do:
1.) Trying to create a new crawlDB with:
./nutch inject crawl/crawldb seedUrls
The directory crawl was empty and in the directory seedUrls is
one file "urls" with this content:
http://www.uni-kassel.de
http://portal.uni-kassel.de
http://www.asta-kassel.de
http://www.uni-kassel.de/fb16
http://www.cs.uni-kassel.de
http://www.studentenwerk-kassel.de
The command runs without any error:
./nutch inject crawl/crawldb seedUrls
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: seedUrls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
After that a new directory with the name crawldb exists in crawl/
2.) Trying to generate new segments:
./nutch generate crawl/crawldb/ crawl/segments -noFilter
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: filtering: false
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
So I am wondering why the generator does not create segements. It says
that it had 0 records selected for fetching. It seems to me, that the
injector hadn't injected the urls into the db.
When I run:
./nutch readdb crawl/crawldb/ -stats
It outputs:
CrawlDb statistics start: crawl/crawldb/
Statistics for CrawlDb: crawl/crawldb/
Exception in thread "main" java.lang.NullPointerException
at
org.apache.nutch.crawl.CrawlDbReader.processStatJob(CrawlDbReader.java:352)
at
org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:502)
Anybody has an idea what am I doing wrong?
Is there any possibility to get more verbose output / logging from the
commands?