Hi Srini,

> mark a page as DB_GONE if the server is busy and the page cannot be
> fetched for 3 consecutive time within few minutes?

It may only happen
... for 3 consecutive time within 3 DAYS.
The time for the next retry is set by the scheduler in
 setPageRetrySchedule(...)
The default is to retry next after 24 hours.

> Do we consider temporary redirects also as DB_GONE ?

No. They become DB_REDIR_TEMP.


You should also try to fetch the page with Nutch,
e.g. using parsechecker or indexchecker or make curl
use exactly the same request headers (agent name, etc.)

Best,
Sebastian


On 3/15/19 3:56 AM, Srinivasan Ramaswamy wrote:
> Hi Sebastin
> 
> Is it possible for nutch to mark a page as DB_GONE if the server is busy and 
> the page cannot be
> fetched for 3 consecutive time within few minutes? I do see a bunch of cases 
> where the page is
> marked as DB_GONE and I don't see any robots directive or 4xx or 301. It's a 
> little puzzling though.
> Do we consider temporary redirects also as DB_GONE ? 
> 
> Yes, I am checking the crawlDB record. For the logs, I am checking logs from  
> FetcherThread class.
> Is that what you are referring to? 
> 
> Thanks
> Srini
> 
> On Thu, Mar 14, 2019 at 1:06 PM Sebastian Nagel <wastl.na...@googlemail.com
> <mailto:wastl.na...@googlemail.com>> wrote:
> 
>     > remove from index, but later we found that some valid pages (when we 
> curl
>     > them we get 200) are also marked as DB_GONE.
> 
>     Also URLs forbidden in the robots.txt are marked as DB_GONE.
> 
>     Check the CrawlDb record and in doubt, also the logs.
> 
>     On 3/14/19 8:39 PM, Srinivasan Ramaswamy wrote:
>     > Hi All
>     >
>     > Looks like DB_GONE flag is set for pages that are 404 or for pages where
>     > fetch failed for 3 or more times.
>     >
>     > We are looking for a way to detect pages that are truly 404 or 301, to
>     > remove them from our index. Our initial plan was to use DB_GONE flag to
>     > remove from index, but later we found that some valid pages (when we 
> curl
>     > them we get 200) are also marked as DB_GONE.
>     >
>     > Any suggestions would be appreciated.
>     >
>     > Thanks
>     > Srini
>     >
> 

Reply via email to