ors, e.g.
> not having ambiguous URLs, redirects or 404s or otherwise bogus entries.
>
> Markus
>
>
> -Original message-
> > From:Bruno Osiek
> > Sent: Wednesday 30th October 2019 23:51
> > To: user@nutch.apache.org
> > Subject: Re: Nutch not crawling
bogus entries.
Markus
-Original message-
> From:Bruno Osiek
> Sent: Wednesday 30th October 2019 23:51
> To: user@nutch.apache.org
> Subject: Re: Nutch not crawling all pages
>
> What is the output of the inject command, ie, when you inject the 5
> seeds justo b
What is the output of the inject command, ie, when you inject the 5
seeds justo before generating the first segment?
On Wed, Oct 30, 2019 at 3:18 PM Dave Beckstrom
wrote:
> Hi Markus,
>
> Thank you so much for the reply and the help! The seed URL list is
> generated from a CMS. I'm doubtfu
Hi Markus,
Thank you so much for the reply and the help! The seed URL list is
generated from a CMS. I'm doubtful that many of the urls would be for
redirects or missing pages as the CMS only writes out the urls for valid
pages. It's got me stumped!
Here is the result of the readdb. Not sure w
Hello Dave,
First you should check the CrawlDB using readdb -stats. My bet is that your set
contains some redirects and gone (404), or transient errors. The number for
fetched and notModified added up should be about the same as the number of
documents indexed.
Regards,
Markus
-Origin
5 matches
Mail list logo