Re: Elastic Block Store as checkpoint storage
Using EBS as checkpoint storage doesn't work in a distributed environment if you need to move the state between TMs (e.g., for rescaling and non-local recovery). You'd need something along the lines of RW multi-attach and set up the volumes in a smart way; it won't be easy to set up; I'm not aware of anyone doing that. Best, D. On Wed, Jul 19, 2023 at 11:10 AM Prabhu Joseph wrote: > Thanks for sharing the information. > > I also observed the same, S3 (Primary Checkpoint Storage) + EBS (Task > Local Recovery) performs better than EBS as Primary Checkpoint storage. > > > > On Tue, Jul 18, 2023 at 12:21 PM Konstantin Knauf > wrote: > >> Hi Prabhu, >> >> this should be possible, but is quite expensive in comparison to AWS S3 >> and you have to remount the EBS volumes to the new Taskmanagers in case of >> a failure which takes some non-trivial time, which slows down recovery. So, >> overall I don't think its peferrable compared to S3. >> >> We do use EBS volumes, though, for the local RocksDB working directory. >> We don't remount them on failure though right now due to the additional >> latency that is introduced by that. >> >> Cheers, >> >> Konstantin >> >> Am Mi., 12. Juli 2023 um 18:55 Uhr schrieb Prabhu Joseph < >> prabhujose.ga...@gmail.com>: >> >>> Hi, >>> >>> We are investigating the feasibility of setting up an Elastic Block >>> Store (EBS) as checkpoint storage by mounting a volume (a shared local file >>> system path) to JobManager and all the TaskManager pods. I want to hear any >>> feedback on this approach if anyone has already tried it. >>> >>> >>> Thanks, >>> Prabhu Joseph >>> >> >> >> -- >> https://twitter.com/snntrable >> https://github.com/knaufk >> >
Re: Elastic Block Store as checkpoint storage
Thanks for sharing the information. I also observed the same, S3 (Primary Checkpoint Storage) + EBS (Task Local Recovery) performs better than EBS as Primary Checkpoint storage. On Tue, Jul 18, 2023 at 12:21 PM Konstantin Knauf wrote: > Hi Prabhu, > > this should be possible, but is quite expensive in comparison to AWS S3 > and you have to remount the EBS volumes to the new Taskmanagers in case of > a failure which takes some non-trivial time, which slows down recovery. So, > overall I don't think its peferrable compared to S3. > > We do use EBS volumes, though, for the local RocksDB working directory. We > don't remount them on failure though right now due to the additional > latency that is introduced by that. > > Cheers, > > Konstantin > > Am Mi., 12. Juli 2023 um 18:55 Uhr schrieb Prabhu Joseph < > prabhujose.ga...@gmail.com>: > >> Hi, >> >> We are investigating the feasibility of setting up an Elastic Block Store >> (EBS) as checkpoint storage by mounting a volume (a shared local file >> system path) to JobManager and all the TaskManager pods. I want to hear any >> feedback on this approach if anyone has already tried it. >> >> >> Thanks, >> Prabhu Joseph >> > > > -- > https://twitter.com/snntrable > https://github.com/knaufk >
Re: Elastic Block Store as checkpoint storage
Hi Prabhu, this should be possible, but is quite expensive in comparison to AWS S3 and you have to remount the EBS volumes to the new Taskmanagers in case of a failure which takes some non-trivial time, which slows down recovery. So, overall I don't think its peferrable compared to S3. We do use EBS volumes, though, for the local RocksDB working directory. We don't remount them on failure though right now due to the additional latency that is introduced by that. Cheers, Konstantin Am Mi., 12. Juli 2023 um 18:55 Uhr schrieb Prabhu Joseph < prabhujose.ga...@gmail.com>: > Hi, > > We are investigating the feasibility of setting up an Elastic Block Store > (EBS) as checkpoint storage by mounting a volume (a shared local file > system path) to JobManager and all the TaskManager pods. I want to hear any > feedback on this approach if anyone has already tried it. > > > Thanks, > Prabhu Joseph > -- https://twitter.com/snntrable https://github.com/knaufk
Elastic Block Store as checkpoint storage
Hi, We are investigating the feasibility of setting up an Elastic Block Store (EBS) as checkpoint storage by mounting a volume (a shared local file system path) to JobManager and all the TaskManager pods. I want to hear any feedback on this approach if anyone has already tried it. Thanks, Prabhu Joseph