That's really helpful, thanks Till! On Thu, Apr 8, 2021 at 6:32 AM Till Rohrmann <trohrm...@apache.org> wrote:
> Hi Kevin, > > when decreasing the TaskManager count I assume that you also decrease the > parallelism of the Flink job. There are three aspects which can then cause > a slower recovery. > > 1) Each Task gets a larger key range assigned. Therefore, each TaskManager > has to download more data in order to restart the Task. Moreover, there are > fewer nodes downloading larger portions of the data (less parallelization). > 2) If you rescaled the parallelism, then it can happen that a Task gets a > key range assigned which requires downloading of multiple key range parts > from the previous run/savepoint. The new key range might not need all the > data from the savepoint parts and hence you download some data which is not > really used in the end. > 3) When rescaling the job, then Flink has to rebuild the RocksDB instance > which is an expensive and slow operation. What happens is that Flink > creates for every savepoint part which it needs for its key range a RocksDB > instance and then extracts the part which is only relevant for its key > range into a new RocksDB instance. This causes a lot of read and write > amplification. > > Cheers, > Till > > On Wed, Apr 7, 2021 at 4:07 PM Kevin Lam <kevin....@shopify.com> wrote: > >> Hi all, >> >> We are trying to benchmark savepoint size vs. restore time. >> >> One thing we've observed is that when we reduce the number of task >> managers, the time to restore from a savepoint increases drastically: >> >> 1/ Restoring from 9.7tb savepoint onto 156 task managers takes 28 minutes >> 2/ Restoring from the save savepoint onto 30 task managers takes over 3 >> hours >> >> *Is this expected? How does the restore process work? Is this just a >> matter of having lower restore parallelism for 30 task managers vs 156 task >> managers? * >> >> Some details >> >> - Running on kubernetes >> - Used Rocksdb with a local ssd for state backend >> - Savepoint is hosted on GCS >> - The smaller task manager case is important to us because we expect to >> deploy our application with a high number of task managers, and downscale >> once a backfill is completed >> >> Differences between 1/ and 2/: >> >> 2/ has decreased task manager count 156 -> 30 >> 2/ has decreased operator parallelism by a factor of ~10 >> 2/ uses a striped SSD (3 ssds mounted as a single logical volume) to hold >> rocksdb files >> >> Thanks in advance for your help! >> >