Thank you very much for the help.
I checked ALL files in nutch/conf if there are an further expressions that would exclude my URLs.
I found nothing like this.
In fact, as I mentioned before, the ./nuch crawl command works fine on exactly the some input data.

One again if I delete all entries in my crawl directory and the run:

./nutch crawl seedUrls/ -dir crawl -threads 30 -depth 10

   crawl started in: crawl
   rootUrlDir = seedUrls
   threads = 30
   depth = 10
   indexer=lucene
   Injector: starting
   Injector: crawlDb: crawl/crawldb
   Injector: urlDir: seedUrls
   Injector: Converting injected urls to crawl db entries.
   Injector: Merging injected urls into crawl db.
   Injector: done
   Generator: Selecting best-scoring urls due for fetch.
   Generator: starting
   Generator: filtering: true
   Generator: normalizing: true
   Generator: jobtracker is 'local', generating exactly one partition.
   Generator: Partitioning selected urls for politeness.
   Generator: segment: crawl/segments/20110518111323
   Generator: done.
   Fetcher: Your 'http.agent.name' value should be listed first in
   'http.robots.agents' property.
   Fetcher: starting
   Fetcher: segment: crawl/segments/20110518111323
   Fetcher: threads: 30
   QueueFeeder finished: total 10 records + hit by time limit :0
   fetching http://portal.uni-kassel.de/
   fetching http://www.studentenwerk-kassel.de/
   fetching http://www.asta-kassel.de/
   fetching http://www.uni-kassel.de/fb16
   fetching http://www.uni-kassel.de/
   fetching http://www.uni-kassel.de/uni/studium/
   fetching http://www.uni-kassel.de/uni/fachbereiche/
   fetching http://www.uni-kassel.de/uni/
   fetching http://www.uni-kassel.de/uni/forschung/
   fetching http://www.cs.uni-kassel.de/

But if I try it manually (after deleting the crawldb once again):

./nutch inject crawl/crawldb seedUrls/

   Injector: starting
   Injector: crawlDb: crawl/crawldb
   Injector: urlDir: seedUrls
   Injector: Converting injected urls to crawl db entries.
   Injector: Merging injected urls into crawl db.
   Injector: done

./nutch generate crawl/crawldb/ crawl/segments

   Generator: Selecting best-scoring urls due for fetch.
   Generator: starting
   Generator: filtering: true
   Generator: normalizing: true
   Generator: jobtracker is 'local', generating exactly one partition.
   Generator: 0 records selected for fetching, exiting ...


So my conclusion is that the crawl command does the url injecting in some other way? I just don't get it why it works with the crawl command but doesn't work when manually injecting. Any further suggestions where I could find my failure would be great :-)


On 16.05.2011 16:14, Markus Jelsma wrote:
I see that too and it shouldn't dump an exception if there's nothing in the
CrawlDB.
This is, however, not your problem it seems. If you inject but there's nothing
in the CrawlDB then you have some filters running that skip your seed URL's.
Check your domain filter settings or other url filter settings, depening on
the plugin's you defined.

On Monday 16 May 2011 15:56:26 Marek Bachmann wrote:
Hello people,

I was trying to do an manual crawl like described in the nutch tutorial
on http://wiki.apache.org/nutch/NutchTutorial

First of all: If I do a crawl, with the same seed urls, using the "nutch
crawl" command, everything works fine.

Here's what I was trying to do:

1.) Trying to create a new crawlDB with:

      ./nutch inject crawl/crawldb seedUrls

          The directory crawl was empty and in the directory seedUrls is
one file "urls" with this content:
              http://www.uni-kassel.de
              http://portal.uni-kassel.de
              http://www.asta-kassel.de
              http://www.uni-kassel.de/fb16
              http://www.cs.uni-kassel.de
              http://www.studentenwerk-kassel.de

      The command runs without any error:
      ./nutch inject crawl/crawldb seedUrls
      Injector: starting
      Injector: crawlDb: crawl/crawldb
      Injector: urlDir: seedUrls
      Injector: Converting injected urls to crawl db entries.
      Injector: Merging injected urls into crawl db.
      Injector: done

      After that a new directory with the name crawldb exists in crawl/

2.) Trying to generate new segments:

      ./nutch generate crawl/crawldb/ crawl/segments -noFilter
      Generator: Selecting best-scoring urls due for fetch.
      Generator: starting
      Generator: filtering: false
      Generator: normalizing: true
      Generator: jobtracker is 'local', generating exactly one partition.
      Generator: 0 records selected for fetching, exiting ...

So I am wondering why the generator does not create segements. It says
that it had 0 records selected for fetching. It seems to me, that the
injector hadn't injected the urls into the db.

When I run:
      ./nutch readdb crawl/crawldb/ -stats

It outputs:
      CrawlDb statistics start: crawl/crawldb/
      Statistics for CrawlDb: crawl/crawldb/
      Exception in thread "main" java.lang.NullPointerException
          at
org.apache.nutch.crawl.CrawlDbReader.processStatJob(CrawlDbReader.java:352)
          at
org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:502)

Anybody has an idea what am I doing wrong?

Is there any possibility to get more verbose output / logging from the
commands?

Reply via email to