Re: how to find pages that are truly deleted/moved

2019-03-15 Thread Sebastian Nagel
Hi Srini,

> mark a page as DB_GONE if the server is busy and the page cannot be
> fetched for 3 consecutive time within few minutes?

It may only happen
... for 3 consecutive time within 3 DAYS.
The time for the next retry is set by the scheduler in
 setPageRetrySchedule(...)
The default is to retry next after 24 hours.

> Do we consider temporary redirects also as DB_GONE ?

No. They become DB_REDIR_TEMP.


You should also try to fetch the page with Nutch,
e.g. using parsechecker or indexchecker or make curl
use exactly the same request headers (agent name, etc.)

Best,
Sebastian


On 3/15/19 3:56 AM, Srinivasan Ramaswamy wrote:
> Hi Sebastin
> 
> Is it possible for nutch to mark a page as DB_GONE if the server is busy and 
> the page cannot be
> fetched for 3 consecutive time within few minutes? I do see a bunch of cases 
> where the page is
> marked as DB_GONE and I don't see any robots directive or 4xx or 301. It's a 
> little puzzling though.
> Do we consider temporary redirects also as DB_GONE ? 
> 
> Yes, I am checking the crawlDB record. For the logs, I am checking logs from  
> FetcherThread class.
> Is that what you are referring to? 
> 
> Thanks
> Srini
> 
> On Thu, Mar 14, 2019 at 1:06 PM Sebastian Nagel  > wrote:
> 
> > remove from index, but later we found that some valid pages (when we 
> curl
> > them we get 200) are also marked as DB_GONE.
> 
> Also URLs forbidden in the robots.txt are marked as DB_GONE.
> 
> Check the CrawlDb record and in doubt, also the logs.
> 
> On 3/14/19 8:39 PM, Srinivasan Ramaswamy wrote:
> > Hi All
> >
> > Looks like DB_GONE flag is set for pages that are 404 or for pages where
> > fetch failed for 3 or more times.
> >
> > We are looking for a way to detect pages that are truly 404 or 301, to
> > remove them from our index. Our initial plan was to use DB_GONE flag to
> > remove from index, but later we found that some valid pages (when we 
> curl
> > them we get 200) are also marked as DB_GONE.
> >
> > Any suggestions would be appreciated.
> >
> > Thanks
> > Srini
> >
> 



how to find pages that are truly deleted/moved

2019-03-14 Thread Srinivasan Ramaswamy
Hi All

Looks like DB_GONE flag is set for pages that are 404 or for pages where
fetch failed for 3 or more times.

We are looking for a way to detect pages that are truly 404 or 301, to
remove them from our index. Our initial plan was to use DB_GONE flag to
remove from index, but later we found that some valid pages (when we curl
them we get 200) are also marked as DB_GONE.

Any suggestions would be appreciated.

Thanks
Srini