But all the old segment data is still sitting there in hdfs.



 On Friday, March 23, 2018, 1:34:21 PM PDT, Sebastian Nagel <> wrote: 





Hi Michael,

when segments are merged only the most recent record of one URL is kept.

Sebastian


On 03/23/2018 09:25 PM, Michael Coffey wrote:
> Greetings Nutchlings,
> 
> How can I identify segments that are no longer useful, now that I have been 
> using AdaptiveFetchSchedule for several months?
> 
> I have db.fetch.interval.max = 31536000 (365 days), but I know that tons of 
> pages get re-fetched every 30-60 days because I have 
> db.fetch.interval.default at 60 days and 
> db.fetch.schedule.adaptive.min_interval at 30 days.
> 
> I have thousands of segments, with a total of about 48 million documents, so 
> I can't afford to inspect each one manually.
> 
> Can anyone suggest a strategy for this?
> 
> 

Reply via email to