Hey Rex,

the second approach (spinning up a standby job and then doing a handover)
sounds more promising to implement, without rewriting half of the Flink
codebase ;)
What you need is a tool that orchestrates creating a savepoint, starting a
second job from the savepoint and then communicating with a custom sink
implementation that can be switched on/off in the two jobs.
With that approach, you should have almost no downtime, just increased
resource requirements during such a handover.

What you need to consider as well is that this handover process only works
for scheduled maintenance. If you have a system failure, you'll have
downtime until the last checkpoint is restored.
If you are trying to reduce the potential downtime overall, I would rather
recommend optimizing the checkpoint restore time, as this can cover both
scheduled maintenance and system failures.

Best,
Robert





On Wed, Nov 11, 2020 at 8:56 PM Rex Fenley <r...@remind101.com> wrote:

> Another thought, would it be possible to
> * Spin up new core or task nodes.
> * Run a new copy of the same job on these new nodes from a savepoint.
> * Have the new job *not* write to the sink until the other job is torn
> down?
>
> This would allow us to be eventually consistent and maintain writes going
> through without downtime. As long as whatever is buffering the sink doesn't
> run out of space it should work just fine.
>
> We're hoping to achieve consistency in less than 30s ideally.
>
> Again though, if we could get savepoints to restore in less than 30s that
> would likely be sufficient for our purposes.
>
> Thanks!
>
> On Wed, Nov 11, 2020 at 11:42 AM Rex Fenley <r...@remind101.com> wrote:
>
>> Hello,
>>
>> I'm trying to find a solution for auto scaling our Flink EMR cluster with
>> 0 downtime using RocksDB as state storage and S3 backing store.
>>
>> My current thoughts are like so:
>> * Scaling an Operator dynamically would require all keyed state to be
>> available to the set of subtasks for that operator, therefore a set of
>> subtasks must be reading to and writing from the same RocksDB. I.e. to
>> scale in and out subtasks in that set, they need to read from the same
>> Rocks.
>> * Since subtasks can run on different core nodes, is it possible to have
>> different core nodes read/write to the same RocksDB?
>> * When's the safe point to scale in and out an operator? Only right after
>> a checkpoint possibly?
>>
>> If the above is not possible then we'll have to use save points which
>> means some downtime, therefore:
>> * Scaling out during high traffic is arguably more important to react
>> quickly to than scaling in during low traffic. Is it possible to add more
>> core nodes to EMR without disturbing a job? If so then maybe we can
>> orchestrate running a new job on new nodes and then loading a savepoint
>> from a currently running job.
>>
>> Lastly
>> * Save Points for ~70Gib of data take on the order of minutes to tens of
>> minutes for us to restore from, is there any way to speed up restoration?
>>
>> Thanks!
>>
>> --
>>
>> Rex Fenley  |  Software Engineer - Mobile and Backend
>>
>>
>> Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>
>>  |  FOLLOW US <https://twitter.com/remindhq>  |  LIKE US
>> <https://www.facebook.com/remindhq>
>>
>
>
> --
>
> Rex Fenley  |  Software Engineer - Mobile and Backend
>
>
> Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>  |
>  FOLLOW US <https://twitter.com/remindhq>  |  LIKE US
> <https://www.facebook.com/remindhq>
>

Reply via email to