Re: [Flink 1.7.0] initial failures with starting high-parallelism job without checkpoint/savepoint

2019-01-24 Thread Steven Wu
Hi Andrey, Weird that I didn't see your reply in my email inbox. My colleague happened to see it in apache archive :) nope, we didn't experience it with 1.4 (previous version) Yes, we did use HA setup. high-availability: zookeeper high-availability.zookeeper.quorum: ...

Re: [Flink 1.7.0] initial failures with starting high-parallelism job without checkpoint/savepoint

2019-01-24 Thread Andrey Zagrebin
Hi Steven, Did you not experience this problem with previous Flink release (your marked topic with 1.7)? Do you use HA setup? Without HA setup, the blob data, which belongs to the job, will be distributed from job master node to all task executors. Depending on the size of the blob data (jars,

[Flink 1.7.0] initial failures with starting high-parallelism job without checkpoint/savepoint

2019-01-23 Thread Steven Wu
When we start a high-parallelism (1,600) job without any checkpoint/savepoint, the job struggled to be deployed. After a few restarts, it eventually got deployed and was running fine after the initial struggle. jobmanager was very busy. Web UI was very slow. I saw these two exceptions/failures