That's really helpful, thanks Till!

On Thu, Apr 8, 2021 at 6:32 AM Till Rohrmann <trohrm...@apache.org> wrote:

> Hi Kevin,
>
> when decreasing the TaskManager count I assume that you also decrease the
> parallelism of the Flink job. There are three aspects which can then cause
> a slower recovery.
>
> 1) Each Task gets a larger key range assigned. Therefore, each TaskManager
> has to download more data in order to restart the Task. Moreover, there are
> fewer nodes downloading larger portions of the data (less parallelization).
> 2) If you rescaled the parallelism, then it can happen that a Task gets a
> key range assigned which requires downloading of multiple key range parts
> from the previous run/savepoint. The new key range might not need all the
> data from the savepoint parts and hence you download some data which is not
> really used in the end.
> 3) When rescaling the job, then Flink has to rebuild the RocksDB instance
> which is an expensive and slow operation. What happens is that Flink
> creates for every savepoint part which it needs for its key range a RocksDB
> instance and then extracts the part which is only relevant for its key
> range into a new RocksDB instance. This causes a lot of read and write
> amplification.
>
> Cheers,
> Till
>
> On Wed, Apr 7, 2021 at 4:07 PM Kevin Lam <kevin....@shopify.com> wrote:
>
>> Hi all,
>>
>> We are trying to benchmark savepoint size vs. restore time.
>>
>> One thing we've observed is that when we reduce the number of task
>> managers, the time to restore from a savepoint increases drastically:
>>
>> 1/ Restoring from 9.7tb savepoint onto 156 task managers takes 28 minutes
>> 2/ Restoring from the save savepoint onto 30 task managers takes over 3
>> hours
>>
>> *Is this expected? How does the restore process work? Is this just a
>> matter of having lower restore parallelism for 30 task managers vs 156 task
>> managers? *
>>
>> Some details
>>
>> - Running on kubernetes
>> - Used Rocksdb with a local ssd for state backend
>> - Savepoint is hosted on GCS
>> - The smaller task manager case is important to us because we expect to
>> deploy our application with a high number of task managers, and downscale
>> once a backfill is completed
>>
>> Differences between 1/ and 2/:
>>
>> 2/ has decreased task manager count 156 -> 30
>> 2/ has decreased operator parallelism by a factor of ~10
>> 2/ uses a striped SSD (3 ssds mounted as a single logical volume) to hold
>> rocksdb files
>>
>> Thanks in advance for your help!
>>
>

Reply via email to