Re: Elastic Block Store as checkpoint storage

2023-07-20 Thread David Morávek
Using EBS as checkpoint storage doesn't work in a distributed environment
if you need to move the state between TMs (e.g., for rescaling and
non-local recovery). You'd need something along the lines of RW
multi-attach and set up the volumes in a smart way; it won't be easy to set
up; I'm not aware of anyone doing that.

Best,
D.

On Wed, Jul 19, 2023 at 11:10 AM Prabhu Joseph 
wrote:

> Thanks for sharing the information.
>
> I also observed the same, S3 (Primary Checkpoint Storage) + EBS (Task
> Local Recovery) performs better than EBS as Primary Checkpoint storage.
>
>
>
> On Tue, Jul 18, 2023 at 12:21 PM Konstantin Knauf 
> wrote:
>
>> Hi Prabhu,
>>
>> this should be possible, but is quite expensive in comparison to AWS S3
>> and you have to remount the EBS volumes to the new Taskmanagers in case of
>> a failure which takes some non-trivial time, which slows down recovery. So,
>> overall I don't think its peferrable compared to S3.
>>
>> We do use EBS volumes, though, for the local RocksDB working directory.
>> We don't remount them on failure though right now due to the additional
>> latency that is introduced by that.
>>
>> Cheers,
>>
>> Konstantin
>>
>> Am Mi., 12. Juli 2023 um 18:55 Uhr schrieb Prabhu Joseph <
>> prabhujose.ga...@gmail.com>:
>>
>>> Hi,
>>>
>>> We are investigating the feasibility of setting up an Elastic Block
>>> Store (EBS) as checkpoint storage by mounting a volume (a shared local file
>>> system path) to JobManager and all the TaskManager pods. I want to hear any
>>> feedback on this approach if anyone has already tried it.
>>>
>>>
>>> Thanks,
>>> Prabhu Joseph
>>>
>>
>>
>> --
>> https://twitter.com/snntrable
>> https://github.com/knaufk
>>
>


Re: Elastic Block Store as checkpoint storage

2023-07-19 Thread Prabhu Joseph
Thanks for sharing the information.

I also observed the same, S3 (Primary Checkpoint Storage) + EBS (Task Local
Recovery) performs better than EBS as Primary Checkpoint storage.



On Tue, Jul 18, 2023 at 12:21 PM Konstantin Knauf  wrote:

> Hi Prabhu,
>
> this should be possible, but is quite expensive in comparison to AWS S3
> and you have to remount the EBS volumes to the new Taskmanagers in case of
> a failure which takes some non-trivial time, which slows down recovery. So,
> overall I don't think its peferrable compared to S3.
>
> We do use EBS volumes, though, for the local RocksDB working directory. We
> don't remount them on failure though right now due to the additional
> latency that is introduced by that.
>
> Cheers,
>
> Konstantin
>
> Am Mi., 12. Juli 2023 um 18:55 Uhr schrieb Prabhu Joseph <
> prabhujose.ga...@gmail.com>:
>
>> Hi,
>>
>> We are investigating the feasibility of setting up an Elastic Block Store
>> (EBS) as checkpoint storage by mounting a volume (a shared local file
>> system path) to JobManager and all the TaskManager pods. I want to hear any
>> feedback on this approach if anyone has already tried it.
>>
>>
>> Thanks,
>> Prabhu Joseph
>>
>
>
> --
> https://twitter.com/snntrable
> https://github.com/knaufk
>


Re: Elastic Block Store as checkpoint storage

2023-07-18 Thread Konstantin Knauf
Hi Prabhu,

this should be possible, but is quite expensive in comparison to AWS S3 and
you have to remount the EBS volumes to the new Taskmanagers in case of a
failure which takes some non-trivial time, which slows down recovery. So,
overall I don't think its peferrable compared to S3.

We do use EBS volumes, though, for the local RocksDB working directory. We
don't remount them on failure though right now due to the additional
latency that is introduced by that.

Cheers,

Konstantin

Am Mi., 12. Juli 2023 um 18:55 Uhr schrieb Prabhu Joseph <
prabhujose.ga...@gmail.com>:

> Hi,
>
> We are investigating the feasibility of setting up an Elastic Block Store
> (EBS) as checkpoint storage by mounting a volume (a shared local file
> system path) to JobManager and all the TaskManager pods. I want to hear any
> feedback on this approach if anyone has already tried it.
>
>
> Thanks,
> Prabhu Joseph
>


-- 
https://twitter.com/snntrable
https://github.com/knaufk


Elastic Block Store as checkpoint storage

2023-07-12 Thread Prabhu Joseph
Hi,

We are investigating the feasibility of setting up an Elastic Block Store
(EBS) as checkpoint storage by mounting a volume (a shared local file
system path) to JobManager and all the TaskManager pods. I want to hear any
feedback on this approach if anyone has already tried it.


Thanks,
Prabhu Joseph