I have installed and successfully web crawled thousands of pages using
Nutch 2.3.1 with MongoDB.

But suddently, Nutch 2.3.1 Generator not generating any URLs. Seed
list URL are accepted (InjectorJob: total number of urls injected
after normalization and filtering: 3) and
./bin/nutch parsechecker  -dumpText http://xxx.com shows hundreds of URLs

Error as follows:

GeneratorJob: starting at 2016-06-09 07:26:15
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: false
GeneratorJob: normalizing: false
GeneratorJob: topN: 50000
GeneratorJob: finished at 2016-06-09 07:26:28, time elapsed: 00:00:13
GeneratorJob: generated batch id: 1465471572-2463 containing 0 URLs

What is interesting is that if I delete the webpage collection in the
mongodb nutch database, then the crawler works fine so I'm assuming
there's a record in the collection that is causing the issue. Can
anyone recommend how to fix this problem? (tried deleting any record
that doesn't have a status field but that did not help).

Many thanks,

Jean

Reply via email to