I see that too and it shouldn't dump an exception if there's nothing in the CrawlDB. This is, however, not your problem it seems. If you inject but there's nothing in the CrawlDB then you have some filters running that skip your seed URL's. Check your domain filter settings or other url filter settings, depening on the plugin's you defined.
On Monday 16 May 2011 15:56:26 Marek Bachmann wrote: > Hello people, > > I was trying to do an manual crawl like described in the nutch tutorial > on http://wiki.apache.org/nutch/NutchTutorial > > First of all: If I do a crawl, with the same seed urls, using the "nutch > crawl" command, everything works fine. > > Here's what I was trying to do: > > 1.) Trying to create a new crawlDB with: > > ./nutch inject crawl/crawldb seedUrls > > The directory crawl was empty and in the directory seedUrls is > one file "urls" with this content: > http://www.uni-kassel.de > http://portal.uni-kassel.de > http://www.asta-kassel.de > http://www.uni-kassel.de/fb16 > http://www.cs.uni-kassel.de > http://www.studentenwerk-kassel.de > > The command runs without any error: > ./nutch inject crawl/crawldb seedUrls > Injector: starting > Injector: crawlDb: crawl/crawldb > Injector: urlDir: seedUrls > Injector: Converting injected urls to crawl db entries. > Injector: Merging injected urls into crawl db. > Injector: done > > After that a new directory with the name crawldb exists in crawl/ > > 2.) Trying to generate new segments: > > ./nutch generate crawl/crawldb/ crawl/segments -noFilter > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: filtering: false > Generator: normalizing: true > Generator: jobtracker is 'local', generating exactly one partition. > Generator: 0 records selected for fetching, exiting ... > > So I am wondering why the generator does not create segements. It says > that it had 0 records selected for fetching. It seems to me, that the > injector hadn't injected the urls into the db. > > When I run: > ./nutch readdb crawl/crawldb/ -stats > > It outputs: > CrawlDb statistics start: crawl/crawldb/ > Statistics for CrawlDb: crawl/crawldb/ > Exception in thread "main" java.lang.NullPointerException > at > org.apache.nutch.crawl.CrawlDbReader.processStatJob(CrawlDbReader.java:352) > at > org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:502) > > Anybody has an idea what am I doing wrong? > > Is there any possibility to get more verbose output / logging from the > commands? -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

