Re: Nutch not crawling all pages

2019-10-30 Thread Dave Beckstrom
not having ambiguous URLs, redirects or 404s or otherwise bogus entries. > > Markus > > > -Original message- > > From:Bruno Osiek > > Sent: Wednesday 30th October 2019 23:51 > > To: user@nutch.apache.org > > Subject: Re: Nutch not crawling all pages >

RE: Nutch not crawling all pages

2019-10-30 Thread Markus Jelsma
entries. Markus -Original message- > From:Bruno Osiek > Sent: Wednesday 30th October 2019 23:51 > To: user@nutch.apache.org > Subject: Re: Nutch not crawling all pages > > What is the output of the inject command, ie, when you inject the 5 > seeds justo before

Re: Nutch not crawling all pages

2019-10-30 Thread Bruno Osiek
What is the output of the inject command, ie, when you inject the 5 seeds justo before generating the first segment? On Wed, Oct 30, 2019 at 3:18 PM Dave Beckstrom wrote: > Hi Markus, > > Thank you so much for the reply and the help! The seed URL list is > generated from a CMS. I'm

Re: Nutch not crawling all pages

2019-10-30 Thread Dave Beckstrom
Hi Markus, Thank you so much for the reply and the help! The seed URL list is generated from a CMS. I'm doubtful that many of the urls would be for redirects or missing pages as the CMS only writes out the urls for valid pages. It's got me stumped! Here is the result of the readdb. Not sure

RE: Nutch not crawling all pages

2019-10-30 Thread Markus Jelsma
Hello Dave, First you should check the CrawlDB using readdb -stats. My bet is that your set contains some redirects and gone (404), or transient errors. The number for fetched and notModified added up should be about the same as the number of documents indexed. Regards, Markus