Hi Srini, > mark a page as DB_GONE if the server is busy and the page cannot be > fetched for 3 consecutive time within few minutes?
It may only happen ... for 3 consecutive time within 3 DAYS. The time for the next retry is set by the scheduler in setPageRetrySchedule(...) The default is to retry next after 24 hours. > Do we consider temporary redirects also as DB_GONE ? No. They become DB_REDIR_TEMP. You should also try to fetch the page with Nutch, e.g. using parsechecker or indexchecker or make curl use exactly the same request headers (agent name, etc.) Best, Sebastian On 3/15/19 3:56 AM, Srinivasan Ramaswamy wrote: > Hi Sebastin > > Is it possible for nutch to mark a page as DB_GONE if the server is busy and > the page cannot be > fetched for 3 consecutive time within few minutes? I do see a bunch of cases > where the page is > marked as DB_GONE and I don't see any robots directive or 4xx or 301. It's a > little puzzling though. > Do we consider temporary redirects also as DB_GONE ? > > Yes, I am checking the crawlDB record. For the logs, I am checking logs from > FetcherThread class. > Is that what you are referring to? > > Thanks > Srini > > On Thu, Mar 14, 2019 at 1:06 PM Sebastian Nagel <wastl.na...@googlemail.com > <mailto:wastl.na...@googlemail.com>> wrote: > > > remove from index, but later we found that some valid pages (when we > curl > > them we get 200) are also marked as DB_GONE. > > Also URLs forbidden in the robots.txt are marked as DB_GONE. > > Check the CrawlDb record and in doubt, also the logs. > > On 3/14/19 8:39 PM, Srinivasan Ramaswamy wrote: > > Hi All > > > > Looks like DB_GONE flag is set for pages that are 404 or for pages where > > fetch failed for 3 or more times. > > > > We are looking for a way to detect pages that are truly 404 or 301, to > > remove them from our index. Our initial plan was to use DB_GONE flag to > > remove from index, but later we found that some valid pages (when we > curl > > them we get 200) are also marked as DB_GONE. > > > > Any suggestions would be appreciated. > > > > Thanks > > Srini > > >