On Wed, Dec 27, 2017 at 4:41 PM, Hao Sun <[email protected]> wrote: > Somehow TM detected JM leadership loss from ZK and self disconnected? > And couple of seconds later, JM failed to connect to ZK? >
Yes, exactly as you describe. The TM noticed the loss of leadership before the JM did. > After all the cluster recovered nicely by its own, but I am wondering does > this break the exactly-once semantics? If yes, what should I take care? > Great :-) It does not break exactly-once guarantees *within* the Flink pipeline as the state of the latest completed checkpoint will be restored after recovery. This rewinds your job and might result in duplicate or changed output if you don't use an exactly once or idempotent sink. – Ufuk
