Re: [DISCUSS] Drop Jepsen tests

2022-02-09 Thread Yang Wang
@Austin We already have some e2e tests[1] that guards k8s deployment(both session and application, with or without HA). And I agree with you that network partition could be simulated by K8s network policy. [1]. https://github.com/apache/flink/blob/master/flink-end-to-end-tests/test-scripts/test_k

Re: [DISCUSS] Drop Jepsen tests

2022-02-09 Thread Austin Cawley-Edwards
Are there e2e tests that run on kubernetes? Perhaps k8s network policies[1] would be an option to simulate asymmetric network partitions without modifying iptables in a more approachable way? Austin [1]: https://kubernetes.io/docs/concepts/services-networking/network-policies/ On Wed, Feb 9, 20

Re: [DISCUSS] Drop Jepsen tests

2022-02-09 Thread David Morávek
Network partitions are trickier than simply crashing process. For example these can be asymmetric -> as a TM you're still able to talk to the JM, but you're not able to talk to other TMs. In general this could be achieved by manipulating iptables on the host machine (considering we spawn all the p

Re: [DISCUSS] Drop Jepsen tests

2022-02-09 Thread Chesnay Schepler
b/c are part of the same test. * We have a job running, * trigger a network partition (failing the job), * then crash HDFS (preventing checkpoints and access to the HA storageDir), * then the partition is resolved and HDFS is started again. Conceptually I would think we can replicate this

Re: [DISCUSS] Drop Jepsen tests

2022-02-09 Thread Chesnay Schepler
The jepsen tests cover 3 cases: a) JM/TM crashes b) HDFS namenode crash (aka, can't checkpoint because HDFS is down) c) network partitions a) can (and probably is) reasonably covered by existing ITCases and e2e tests b) We could probably figure this out ourselves if we wanted to. c) is the diff

Re: [DISCUSS] Drop Jepsen tests

2022-02-09 Thread Konstantin Knauf
Thank you for raising this issue. What risks do you see if we drop it? Do you see any cheaper alternative to (partially) mitigate those risks? On Wed, Feb 9, 2022 at 12:40 PM Chesnay Schepler wrote: > For a few years by now we had a set of Jepsen tests that verify the > correctness of Flinks coo

[DISCUSS] Drop Jepsen tests

2022-02-09 Thread Chesnay Schepler
For a few years by now we had a set of Jepsen tests that verify the correctness of Flinks coordination layer in the case of process crashes. In the past it has indeed found issues and thus provided value to the project, and in general the core idea of it (and Jepsen for that matter) is very soun