But all the old segment data is still sitting there in hdfs.
On Friday, March 23, 2018, 1:34:21 PM PDT, Sebastian Nagel <> wrote: Hi Michael, when segments are merged only the most recent record of one URL is kept. Sebastian On 03/23/2018 09:25 PM, Michael Coffey wrote: > Greetings Nutchlings, > > How can I identify segments that are no longer useful, now that I have been > using AdaptiveFetchSchedule for several months? > > I have db.fetch.interval.max = 31536000 (365 days), but I know that tons of > pages get re-fetched every 30-60 days because I have > db.fetch.interval.default at 60 days and > db.fetch.schedule.adaptive.min_interval at 30 days. > > I have thousands of segments, with a total of about 48 million documents, so > I can't afford to inspect each one manually. > > Can anyone suggest a strategy for this? > >

