Hi Jean, On Mon, Jun 13, 2016 at 1:57 PM, <[email protected]> wrote:
> From: Jean Vence <[email protected]> > To: [email protected] > Cc: > Date: Mon, 13 Jun 2016 21:57:30 +0100 > Subject: Nutch 2.3.1 with MongoDB not generating any URLs > I have installed and successfully web crawled thousands of pages using > Nutch 2.3.1 with MongoDB. > > But suddently, Nutch 2.3.1 Generator not generating any URLs. Seed > list URL are accepted (InjectorJob: total number of urls injected > after normalization and filtering: 3) and > ./bin/nutch parsechecker -dumpText http://xxx.com shows hundreds of URLs > > Error as follows: > > GeneratorJob: starting at 2016-06-09 07:26:15 > GeneratorJob: Selecting best-scoring urls due for fetch. > GeneratorJob: starting > GeneratorJob: filtering: false > GeneratorJob: normalizing: false > GeneratorJob: topN: 50000 > GeneratorJob: finished at 2016-06-09 07:26:28, time elapsed: 00:00:13 > GeneratorJob: generated batch id: 1465471572-2463 containing 0 URLs > > What is interesting is that if I delete the webpage collection in the > mongodb nutch database, then the crawler works fine so I'm assuming > there's a record in the collection that is causing the issue. Can > anyone recommend how to fix this problem? (tried deleting any record > that doesn't have a status field but that did not help). > > Can you please read the Metadata of your records, as this will indicate if any outlinks have been extracted and are suitable for fetch. AFAIK, this is fixed in Nutch 2.X branch. It would be very helpful if you could please verify and get back to us here. Thanks Lewis

