[jira] [Commented] (FLINK-9004) Cluster test: Run general purpose job with failures with Yarn session
[ https://issues.apache.org/jira/browse/FLINK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16543005#comment-16543005 ] ASF GitHub Bot commented on FLINK-9004: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/6240 > Cluster test: Run general purpose job with failures with Yarn session > - > > Key: FLINK-9004 > URL: https://issues.apache.org/jira/browse/FLINK-9004 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Gary Yao >Priority: Blocker > Labels: pull-request-available > Fix For: 1.6.0 > > > Similar to FLINK-8973, we should run the general purpose job (FLINK-8971) on > a Yarn session cluster and simulate failures. > The job jar should be ill-packaged, meaning that we include too many > dependencies in the user jar. We should include the Scala library, Hadoop and > Flink itself to verify that there are no class loading issues. > The general purpose job should run with misbehavior activated. Additionally, > we should simulate at least the following failure scenarios: > * Kill Flink processes > * Kill connection to storage system for checkpoints and jobs > * Simulate network partition > We should run the test at least with the following state backend: RocksDB > incremental async and checkpointing to S3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9004) Cluster test: Run general purpose job with failures with Yarn session
[ https://issues.apache.org/jira/browse/FLINK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542670#comment-16542670 ] ASF GitHub Bot commented on FLINK-9004: --- Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/6240 Nice, thank you. Merging this PR. > Cluster test: Run general purpose job with failures with Yarn session > - > > Key: FLINK-9004 > URL: https://issues.apache.org/jira/browse/FLINK-9004 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Gary Yao >Priority: Blocker > Labels: pull-request-available > Fix For: 1.6.0 > > > Similar to FLINK-8973, we should run the general purpose job (FLINK-8971) on > a Yarn session cluster and simulate failures. > The job jar should be ill-packaged, meaning that we include too many > dependencies in the user jar. We should include the Scala library, Hadoop and > Flink itself to verify that there are no class loading issues. > The general purpose job should run with misbehavior activated. Additionally, > we should simulate at least the following failure scenarios: > * Kill Flink processes > * Kill connection to storage system for checkpoints and jobs > * Simulate network partition > We should run the test at least with the following state backend: RocksDB > incremental async and checkpointing to S3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9004) Cluster test: Run general purpose job with failures with Yarn session
[ https://issues.apache.org/jira/browse/FLINK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542611#comment-16542611 ] ASF GitHub Bot commented on FLINK-9004: --- Github user GJL commented on the issue: https://github.com/apache/flink/pull/6240 I extended the regular `README.md`. > Cluster test: Run general purpose job with failures with Yarn session > - > > Key: FLINK-9004 > URL: https://issues.apache.org/jira/browse/FLINK-9004 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Gary Yao >Priority: Blocker > Labels: pull-request-available > Fix For: 1.6.0 > > > Similar to FLINK-8973, we should run the general purpose job (FLINK-8971) on > a Yarn session cluster and simulate failures. > The job jar should be ill-packaged, meaning that we include too many > dependencies in the user jar. We should include the Scala library, Hadoop and > Flink itself to verify that there are no class loading issues. > The general purpose job should run with misbehavior activated. Additionally, > we should simulate at least the following failure scenarios: > * Kill Flink processes > * Kill connection to storage system for checkpoints and jobs > * Simulate network partition > We should run the test at least with the following state backend: RocksDB > incremental async and checkpointing to S3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9004) Cluster test: Run general purpose job with failures with Yarn session
[ https://issues.apache.org/jira/browse/FLINK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538459#comment-16538459 ] ASF GitHub Bot commented on FLINK-9004: --- Github user GJL commented on a diff in the pull request: https://github.com/apache/flink/pull/6240#discussion_r201312394 --- Diff: flink-jepsen/src/jepsen/flink/db.clj --- @@ -175,7 +175,7 @@ (c/su (c/exec (c/lit (str "HADOOP_CLASSPATH=`" hadoop/install-dir "/bin/hadoop classpath` " "HADOOP_CONF_DIR=" hadoop/hadoop-conf-dir -" " install-dir "/bin/yarn-session.sh -d -jm 2048 -tm 2048"))) +" " install-dir "/bin/yarn-session.sh -d -jm 2048m -tm 2048m"))) --- End diff -- See https://issues.apache.org/jira/browse/FLINK-9777 > Cluster test: Run general purpose job with failures with Yarn session > - > > Key: FLINK-9004 > URL: https://issues.apache.org/jira/browse/FLINK-9004 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Gary Yao >Priority: Blocker > Labels: pull-request-available > Fix For: 1.6.0 > > > Similar to FLINK-8973, we should run the general purpose job (FLINK-8971) on > a Yarn session cluster and simulate failures. > The job jar should be ill-packaged, meaning that we include too many > dependencies in the user jar. We should include the Scala library, Hadoop and > Flink itself to verify that there are no class loading issues. > The general purpose job should run with misbehavior activated. Additionally, > we should simulate at least the following failure scenarios: > * Kill Flink processes > * Kill connection to storage system for checkpoints and jobs > * Simulate network partition > We should run the test at least with the following state backend: RocksDB > incremental async and checkpointing to S3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9004) Cluster test: Run general purpose job with failures with Yarn session
[ https://issues.apache.org/jira/browse/FLINK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534854#comment-16534854 ] ASF GitHub Bot commented on FLINK-9004: --- Github user GJL commented on a diff in the pull request: https://github.com/apache/flink/pull/6240#discussion_r200667325 --- Diff: jepsen-flink/.gitignore --- @@ -0,0 +1,17 @@ +*.class +*.iml +*.jar +*.retry +.DS_Store +.hg/ +.hgignore +.idea/ +/.lein-* +/.nrepl-port +/checkouts +/classes +/target +pom.xml +pom.xml.asc +store +bin/DataStreamAllroundTestProgram.jar --- End diff -- Good point. I fixed it. > Cluster test: Run general purpose job with failures with Yarn session > - > > Key: FLINK-9004 > URL: https://issues.apache.org/jira/browse/FLINK-9004 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Gary Yao >Priority: Blocker > Labels: pull-request-available > Fix For: 1.6.0 > > > Similar to FLINK-8973, we should run the general purpose job (FLINK-8971) on > a Yarn session cluster and simulate failures. > The job jar should be ill-packaged, meaning that we include too many > dependencies in the user jar. We should include the Scala library, Hadoop and > Flink itself to verify that there are no class loading issues. > The general purpose job should run with misbehavior activated. Additionally, > we should simulate at least the following failure scenarios: > * Kill Flink processes > * Kill connection to storage system for checkpoints and jobs > * Simulate network partition > We should run the test at least with the following state backend: RocksDB > incremental async and checkpointing to S3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9004) Cluster test: Run general purpose job with failures with Yarn session
[ https://issues.apache.org/jira/browse/FLINK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534853#comment-16534853 ] ASF GitHub Bot commented on FLINK-9004: --- Github user GJL commented on the issue: https://github.com/apache/flink/pull/6240 It seems that we have to stick with `0.1.8` for now. According to legal, there are no concerns so far: https://issues.apache.org/jira/browse/LEGAL-392 > Cluster test: Run general purpose job with failures with Yarn session > - > > Key: FLINK-9004 > URL: https://issues.apache.org/jira/browse/FLINK-9004 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Gary Yao >Priority: Blocker > Labels: pull-request-available > Fix For: 1.6.0 > > > Similar to FLINK-8973, we should run the general purpose job (FLINK-8971) on > a Yarn session cluster and simulate failures. > The job jar should be ill-packaged, meaning that we include too many > dependencies in the user jar. We should include the Scala library, Hadoop and > Flink itself to verify that there are no class loading issues. > The general purpose job should run with misbehavior activated. Additionally, > we should simulate at least the following failure scenarios: > * Kill Flink processes > * Kill connection to storage system for checkpoints and jobs > * Simulate network partition > We should run the test at least with the following state backend: RocksDB > incremental async and checkpointing to S3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9004) Cluster test: Run general purpose job with failures with Yarn session
[ https://issues.apache.org/jira/browse/FLINK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533042#comment-16533042 ] ASF GitHub Bot commented on FLINK-9004: --- Github user GJL commented on the issue: https://github.com/apache/flink/pull/6240 `0.1.9.` was problematic. Currently testing `0.1.10`. > Cluster test: Run general purpose job with failures with Yarn session > - > > Key: FLINK-9004 > URL: https://issues.apache.org/jira/browse/FLINK-9004 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Gary Yao >Priority: Blocker > Labels: pull-request-available > Fix For: 1.6.0 > > > Similar to FLINK-8973, we should run the general purpose job (FLINK-8971) on > a Yarn session cluster and simulate failures. > The job jar should be ill-packaged, meaning that we include too many > dependencies in the user jar. We should include the Scala library, Hadoop and > Flink itself to verify that there are no class loading issues. > The general purpose job should run with misbehavior activated. Additionally, > we should simulate at least the following failure scenarios: > * Kill Flink processes > * Kill connection to storage system for checkpoints and jobs > * Simulate network partition > We should run the test at least with the following state backend: RocksDB > incremental async and checkpointing to S3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9004) Cluster test: Run general purpose job with failures with Yarn session
[ https://issues.apache.org/jira/browse/FLINK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533037#comment-16533037 ] ASF GitHub Bot commented on FLINK-9004: --- Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/6240 Gary is reaching out to Apache legal to ask for advice how to handle the SOLIPSISTIC license. Apparently, we cannot easily upgrade to `0.1.10` because it contains a bug. > Cluster test: Run general purpose job with failures with Yarn session > - > > Key: FLINK-9004 > URL: https://issues.apache.org/jira/browse/FLINK-9004 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Gary Yao >Priority: Blocker > Labels: pull-request-available > Fix For: 1.6.0 > > > Similar to FLINK-8973, we should run the general purpose job (FLINK-8971) on > a Yarn session cluster and simulate failures. > The job jar should be ill-packaged, meaning that we include too many > dependencies in the user jar. We should include the Scala library, Hadoop and > Flink itself to verify that there are no class loading issues. > The general purpose job should run with misbehavior activated. Additionally, > we should simulate at least the following failure scenarios: > * Kill Flink processes > * Kill connection to storage system for checkpoints and jobs > * Simulate network partition > We should run the test at least with the following state backend: RocksDB > incremental async and checkpointing to S3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9004) Cluster test: Run general purpose job with failures with Yarn session
[ https://issues.apache.org/jira/browse/FLINK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533007#comment-16533007 ] ASF GitHub Bot commented on FLINK-9004: --- Github user zentol commented on the issue: https://github.com/apache/flink/pull/6240 Note that this may also imply that the files that we add cannot be apache licensed, but I'm not completely sure about that. > Cluster test: Run general purpose job with failures with Yarn session > - > > Key: FLINK-9004 > URL: https://issues.apache.org/jira/browse/FLINK-9004 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Gary Yao >Priority: Blocker > Labels: pull-request-available > Fix For: 1.6.0 > > > Similar to FLINK-8973, we should run the general purpose job (FLINK-8971) on > a Yarn session cluster and simulate failures. > The job jar should be ill-packaged, meaning that we include too many > dependencies in the user jar. We should include the Scala library, Hadoop and > Flink itself to verify that there are no class loading issues. > The general purpose job should run with misbehavior activated. Additionally, > we should simulate at least the following failure scenarios: > * Kill Flink processes > * Kill connection to storage system for checkpoints and jobs > * Simulate network partition > We should run the test at least with the following state backend: RocksDB > incremental async and checkpointing to S3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9004) Cluster test: Run general purpose job with failures with Yarn session
[ https://issues.apache.org/jira/browse/FLINK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532832#comment-16532832 ] ASF GitHub Bot commented on FLINK-9004: --- Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/6240 I think you are right @zentol because we are linking against Jepsen `0.1.8` and Jepsen was licensed under EPL only in version `0.1.10`. > Cluster test: Run general purpose job with failures with Yarn session > - > > Key: FLINK-9004 > URL: https://issues.apache.org/jira/browse/FLINK-9004 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Gary Yao >Priority: Blocker > Labels: pull-request-available > Fix For: 1.6.0, 1.5.1 > > > Similar to FLINK-8973, we should run the general purpose job (FLINK-8971) on > a Yarn session cluster and simulate failures. > The job jar should be ill-packaged, meaning that we include too many > dependencies in the user jar. We should include the Scala library, Hadoop and > Flink itself to verify that there are no class loading issues. > The general purpose job should run with misbehavior activated. Additionally, > we should simulate at least the following failure scenarios: > * Kill Flink processes > * Kill connection to storage system for checkpoints and jobs > * Simulate network partition > We should run the test at least with the following state backend: RocksDB > incremental async and checkpointing to S3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9004) Cluster test: Run general purpose job with failures with Yarn session
[ https://issues.apache.org/jira/browse/FLINK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532810#comment-16532810 ] ASF GitHub Bot commented on FLINK-9004: --- Github user zentol commented on the issue: https://github.com/apache/flink/pull/6240 My information is outdated, they updated the license roughly a month ago to EPL: https://github.com/jepsen-io/jepsen/commit/d9c8530b7a2deac2389d93639581dfa1ff0f08fc > Cluster test: Run general purpose job with failures with Yarn session > - > > Key: FLINK-9004 > URL: https://issues.apache.org/jira/browse/FLINK-9004 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Gary Yao >Priority: Blocker > Labels: pull-request-available > Fix For: 1.6.0, 1.5.1 > > > Similar to FLINK-8973, we should run the general purpose job (FLINK-8971) on > a Yarn session cluster and simulate failures. > The job jar should be ill-packaged, meaning that we include too many > dependencies in the user jar. We should include the Scala library, Hadoop and > Flink itself to verify that there are no class loading issues. > The general purpose job should run with misbehavior activated. Additionally, > we should simulate at least the following failure scenarios: > * Kill Flink processes > * Kill connection to storage system for checkpoints and jobs > * Simulate network partition > We should run the test at least with the following state backend: RocksDB > incremental async and checkpointing to S3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9004) Cluster test: Run general purpose job with failures with Yarn session
[ https://issues.apache.org/jira/browse/FLINK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532789#comment-16532789 ] ASF GitHub Bot commented on FLINK-9004: --- Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/6240 Where exactly do you see the license conflicts @zentol? We do not include Jepsen code in our repository and since it is licensed under EPL 1.0 we should be able to link against it. > Cluster test: Run general purpose job with failures with Yarn session > - > > Key: FLINK-9004 > URL: https://issues.apache.org/jira/browse/FLINK-9004 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Gary Yao >Priority: Blocker > Labels: pull-request-available > Fix For: 1.6.0, 1.5.1 > > > Similar to FLINK-8973, we should run the general purpose job (FLINK-8971) on > a Yarn session cluster and simulate failures. > The job jar should be ill-packaged, meaning that we include too many > dependencies in the user jar. We should include the Scala library, Hadoop and > Flink itself to verify that there are no class loading issues. > The general purpose job should run with misbehavior activated. Additionally, > we should simulate at least the following failure scenarios: > * Kill Flink processes > * Kill connection to storage system for checkpoints and jobs > * Simulate network partition > We should run the test at least with the following state backend: RocksDB > incremental async and checkpointing to S3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9004) Cluster test: Run general purpose job with failures with Yarn session
[ https://issues.apache.org/jira/browse/FLINK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532784#comment-16532784 ] ASF GitHub Bot commented on FLINK-9004: --- Github user zentol commented on the issue: https://github.com/apache/flink/pull/6240 The jepsen license is not compatible with the apache license and thus must be part of the release. > Cluster test: Run general purpose job with failures with Yarn session > - > > Key: FLINK-9004 > URL: https://issues.apache.org/jira/browse/FLINK-9004 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Gary Yao >Priority: Blocker > Labels: pull-request-available > Fix For: 1.6.0, 1.5.1 > > > Similar to FLINK-8973, we should run the general purpose job (FLINK-8971) on > a Yarn session cluster and simulate failures. > The job jar should be ill-packaged, meaning that we include too many > dependencies in the user jar. We should include the Scala library, Hadoop and > Flink itself to verify that there are no class loading issues. > The general purpose job should run with misbehavior activated. Additionally, > we should simulate at least the following failure scenarios: > * Kill Flink processes > * Kill connection to storage system for checkpoints and jobs > * Simulate network partition > We should run the test at least with the following state backend: RocksDB > incremental async and checkpointing to S3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9004) Cluster test: Run general purpose job with failures with Yarn session
[ https://issues.apache.org/jira/browse/FLINK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532770#comment-16532770 ] ASF GitHub Bot commented on FLINK-9004: --- Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/6240 Concerning @zentol's comment: We also include the `flink-end-to-end-tests` in the source release. Therefore, I think we could also include the jepsen-tests in the src release. Especially since one can execute them easily via `docker-compose`. > Cluster test: Run general purpose job with failures with Yarn session > - > > Key: FLINK-9004 > URL: https://issues.apache.org/jira/browse/FLINK-9004 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Gary Yao >Priority: Blocker > Labels: pull-request-available > Fix For: 1.6.0, 1.5.1 > > > Similar to FLINK-8973, we should run the general purpose job (FLINK-8971) on > a Yarn session cluster and simulate failures. > The job jar should be ill-packaged, meaning that we include too many > dependencies in the user jar. We should include the Scala library, Hadoop and > Flink itself to verify that there are no class loading issues. > The general purpose job should run with misbehavior activated. Additionally, > we should simulate at least the following failure scenarios: > * Kill Flink processes > * Kill connection to storage system for checkpoints and jobs > * Simulate network partition > We should run the test at least with the following state backend: RocksDB > incremental async and checkpointing to S3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9004) Cluster test: Run general purpose job with failures with Yarn session
[ https://issues.apache.org/jira/browse/FLINK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532768#comment-16532768 ] ASF GitHub Bot commented on FLINK-9004: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/6240#discussion_r200127094 --- Diff: jepsen-flink/.gitignore --- @@ -0,0 +1,17 @@ +*.class +*.iml +*.jar +*.retry +.DS_Store +.hg/ +.hgignore +.idea/ +/.lein-* +/.nrepl-port +/checkouts +/classes +/target +pom.xml +pom.xml.asc +store +bin/DataStreamAllroundTestProgram.jar --- End diff -- Maybe we could ignore the complete `bin/` folder. I put, for example, my `flink-dist.tgz` there and git reports it now as an untracked file. > Cluster test: Run general purpose job with failures with Yarn session > - > > Key: FLINK-9004 > URL: https://issues.apache.org/jira/browse/FLINK-9004 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Gary Yao >Priority: Blocker > Labels: pull-request-available > Fix For: 1.6.0, 1.5.1 > > > Similar to FLINK-8973, we should run the general purpose job (FLINK-8971) on > a Yarn session cluster and simulate failures. > The job jar should be ill-packaged, meaning that we include too many > dependencies in the user jar. We should include the Scala library, Hadoop and > Flink itself to verify that there are no class loading issues. > The general purpose job should run with misbehavior activated. Additionally, > we should simulate at least the following failure scenarios: > * Kill Flink processes > * Kill connection to storage system for checkpoints and jobs > * Simulate network partition > We should run the test at least with the following state backend: RocksDB > incremental async and checkpointing to S3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9004) Cluster test: Run general purpose job with failures with Yarn session
[ https://issues.apache.org/jira/browse/FLINK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531019#comment-16531019 ] ASF GitHub Bot commented on FLINK-9004: --- Github user GJL commented on the issue: https://github.com/apache/flink/pull/6239 @zentol No problem, I opened a new one. > Cluster test: Run general purpose job with failures with Yarn session > - > > Key: FLINK-9004 > URL: https://issues.apache.org/jira/browse/FLINK-9004 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Gary Yao >Priority: Blocker > Labels: pull-request-available > Fix For: 1.6.0, 1.5.1 > > > Similar to FLINK-8973, we should run the general purpose job (FLINK-8971) on > a Yarn session cluster and simulate failures. > The job jar should be ill-packaged, meaning that we include too many > dependencies in the user jar. We should include the Scala library, Hadoop and > Flink itself to verify that there are no class loading issues. > The general purpose job should run with misbehavior activated. Additionally, > we should simulate at least the following failure scenarios: > * Kill Flink processes > * Kill connection to storage system for checkpoints and jobs > * Simulate network partition > We should run the test at least with the following state backend: RocksDB > incremental async and checkpointing to S3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9004) Cluster test: Run general purpose job with failures with Yarn session
[ https://issues.apache.org/jira/browse/FLINK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531021#comment-16531021 ] ASF GitHub Bot commented on FLINK-9004: --- Github user GJL commented on the issue: https://github.com/apache/flink/pull/6240 As @zentol mentioned, we might want to exclude this from create_source_release.sh. > Cluster test: Run general purpose job with failures with Yarn session > - > > Key: FLINK-9004 > URL: https://issues.apache.org/jira/browse/FLINK-9004 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Gary Yao >Priority: Blocker > Labels: pull-request-available > Fix For: 1.6.0, 1.5.1 > > > Similar to FLINK-8973, we should run the general purpose job (FLINK-8971) on > a Yarn session cluster and simulate failures. > The job jar should be ill-packaged, meaning that we include too many > dependencies in the user jar. We should include the Scala library, Hadoop and > Flink itself to verify that there are no class loading issues. > The general purpose job should run with misbehavior activated. Additionally, > we should simulate at least the following failure scenarios: > * Kill Flink processes > * Kill connection to storage system for checkpoints and jobs > * Simulate network partition > We should run the test at least with the following state backend: RocksDB > incremental async and checkpointing to S3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9004) Cluster test: Run general purpose job with failures with Yarn session
[ https://issues.apache.org/jira/browse/FLINK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531018#comment-16531018 ] ASF GitHub Bot commented on FLINK-9004: --- GitHub user GJL opened a pull request: https://github.com/apache/flink/pull/6240 [FLINK-9004][tests] Implement Jepsen tests to test job availability. ## What is the purpose of the change *Use the Jepsen framework (https://github.com/jepsen-io/jepsen) to implement tests that verify Flink's HA capabilities under real-world faults, such as sudden TaskManager/JobManager termination, HDFS NameNode unavailability, network partitions, etc. The Flink cluster under test is automatically deployed on YARN (session & job mode) and Mesos.* Previous PR got closed accidentally: https://github.com/apache/flink/pull/6239 ## Brief change log - *Implement Jepsen tests.* ## Verifying this change This change added tests and can be verified as follows: - *The changes themselves are tests.* - *Run Jepsen tests in docker containers.* - *Run unit tests with `lein test`* ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (yes / **no** (at least not to Flink)) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (yes / **no**) - The serializers: (yes / **no** / don't know) - The runtime per-record code paths (performance sensitive): (yes / **no** / don't know) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / **no** (but it will as soon as test failures appear) / don't know) - The S3 file system connector: (yes / **no** / don't know) ## Documentation - Does this pull request introduce a new feature? (yes / **no**) - If yes, how is the feature documented? (**not applicable** / docs / JavaDocs / not documented) cc: @tillrohrmann @cewood @zentol @aljoscha You can merge this pull request into a Git repository by running: $ git pull https://github.com/GJL/flink FLINK-9004 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/6240.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6240 commit 063e4621a5982b55ee7f7b0935290bbc717a5a45 Author: gyao Date: 2018-03-05T21:23:33Z [FLINK-9004][tests] Implement Jepsen tests to test job availability. Use the Jepsen framework (https://github.com/jepsen-io/jepsen) to implement tests that verify Flink's HA capabilities under real-world faults, such as sudden TaskManager/JobManager termination, HDFS NameNode unavailability, network partitions, etc. The Flink cluster under test is automatically deployed on YARN (session & job mode) and Mesos. Provide Dockerfiles for local test development. commit 46f0ea7b14c9c59d6cc40903486978f4fd8354d3 Author: gyao Date: 2018-07-02T12:21:18Z fixup! [FLINK-9004][tests] Implement Jepsen tests to test job availability. > Cluster test: Run general purpose job with failures with Yarn session > - > > Key: FLINK-9004 > URL: https://issues.apache.org/jira/browse/FLINK-9004 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Gary Yao >Priority: Blocker > Labels: pull-request-available > Fix For: 1.6.0, 1.5.1 > > > Similar to FLINK-8973, we should run the general purpose job (FLINK-8971) on > a Yarn session cluster and simulate failures. > The job jar should be ill-packaged, meaning that we include too many > dependencies in the user jar. We should include the Scala library, Hadoop and > Flink itself to verify that there are no class loading issues. > The general purpose job should run with misbehavior activated. Additionally, > we should simulate at least the following failure scenarios: > * Kill Flink processes > * Kill connection to storage system for checkpoints and jobs > * Simulate network partition > We should run the test at least with the following state backend: RocksDB > incremental async and checkpointing to S3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9004) Cluster test: Run general purpose job with failures with Yarn session
[ https://issues.apache.org/jira/browse/FLINK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530985#comment-16530985 ] ASF GitHub Bot commented on FLINK-9004: --- Github user zentol commented on the issue: https://github.com/apache/flink/pull/6239 whoops, sorry I accidentally closed this PR, added the wrong PR ID to a commit :( I'm so sorry. > Cluster test: Run general purpose job with failures with Yarn session > - > > Key: FLINK-9004 > URL: https://issues.apache.org/jira/browse/FLINK-9004 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Gary Yao >Priority: Blocker > Labels: pull-request-available > Fix For: 1.6.0, 1.5.1 > > > Similar to FLINK-8973, we should run the general purpose job (FLINK-8971) on > a Yarn session cluster and simulate failures. > The job jar should be ill-packaged, meaning that we include too many > dependencies in the user jar. We should include the Scala library, Hadoop and > Flink itself to verify that there are no class loading issues. > The general purpose job should run with misbehavior activated. Additionally, > we should simulate at least the following failure scenarios: > * Kill Flink processes > * Kill connection to storage system for checkpoints and jobs > * Simulate network partition > We should run the test at least with the following state backend: RocksDB > incremental async and checkpointing to S3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9004) Cluster test: Run general purpose job with failures with Yarn session
[ https://issues.apache.org/jira/browse/FLINK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530983#comment-16530983 ] ASF GitHub Bot commented on FLINK-9004: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/6239 > Cluster test: Run general purpose job with failures with Yarn session > - > > Key: FLINK-9004 > URL: https://issues.apache.org/jira/browse/FLINK-9004 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Gary Yao >Priority: Blocker > Labels: pull-request-available > Fix For: 1.6.0, 1.5.1 > > > Similar to FLINK-8973, we should run the general purpose job (FLINK-8971) on > a Yarn session cluster and simulate failures. > The job jar should be ill-packaged, meaning that we include too many > dependencies in the user jar. We should include the Scala library, Hadoop and > Flink itself to verify that there are no class loading issues. > The general purpose job should run with misbehavior activated. Additionally, > we should simulate at least the following failure scenarios: > * Kill Flink processes > * Kill connection to storage system for checkpoints and jobs > * Simulate network partition > We should run the test at least with the following state backend: RocksDB > incremental async and checkpointing to S3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9004) Cluster test: Run general purpose job with failures with Yarn session
[ https://issues.apache.org/jira/browse/FLINK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530977#comment-16530977 ] ASF GitHub Bot commented on FLINK-9004: --- Github user zentol commented on the issue: https://github.com/apache/flink/pull/6239 As i understand it we don't want this to be part of the source release, as such we need an exclusion in [create_source_release.sh](https://github.com/apache/flink/blob/master/tools/releasing/create_source_release.sh). > Cluster test: Run general purpose job with failures with Yarn session > - > > Key: FLINK-9004 > URL: https://issues.apache.org/jira/browse/FLINK-9004 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Gary Yao >Priority: Blocker > Labels: pull-request-available > Fix For: 1.6.0, 1.5.1 > > > Similar to FLINK-8973, we should run the general purpose job (FLINK-8971) on > a Yarn session cluster and simulate failures. > The job jar should be ill-packaged, meaning that we include too many > dependencies in the user jar. We should include the Scala library, Hadoop and > Flink itself to verify that there are no class loading issues. > The general purpose job should run with misbehavior activated. Additionally, > we should simulate at least the following failure scenarios: > * Kill Flink processes > * Kill connection to storage system for checkpoints and jobs > * Simulate network partition > We should run the test at least with the following state backend: RocksDB > incremental async and checkpointing to S3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9004) Cluster test: Run general purpose job with failures with Yarn session
[ https://issues.apache.org/jira/browse/FLINK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530873#comment-16530873 ] ASF GitHub Bot commented on FLINK-9004: --- Github user GJL commented on a diff in the pull request: https://github.com/apache/flink/pull/6239#discussion_r199690915 --- Diff: jepsen-flink/README.md --- @@ -0,0 +1,60 @@ +# jepsen.flink + +A Clojure project based on the [Jepsen](https://github.com/jepsen-io/jepsen) framework to find bugs in the +distributed coordination of Apache Flink®. + +## Test Coverage +Jepsen is a framework built to test the behavior of distributed systems +under faults. The tests in this particular project deploy Flink on either YARN or Mesos, submit a +job, and examine the availability of the job after injecting faults. +A job is said to be available if all the tasks of the job are running. +The faults that can be currently introduced to the Flink cluster include: +* Killing of TaskManager/JobManager processes +* Stopping HDFS NameNode +* Network partitions + +There are many more properties other than job availability that could be +verified but are not yet covered by this test suite, e.g., end-to-end exactly-once processing +semantics. + +## Usage +See the [Jepsen documentation](https://github.com/jepsen-io/jepsen#setting-up-a-jepsen-environment) +for how to set up the environment to run tests. The `scripts/run-tests.sh` documents how to invoke +tests. The Flink job used for testing is located under +`flink-end-to-end-tests/flink-datastream-allround-test`. You have to build the job first and copy +the resulting jar (`DataStreamAllroundTestProgram.jar`) to the `./bin` directory of this project's +root. + +To simplify development, we have prepared Dockerfiles and a Docker Compose template +so that you can run the tests locally in containers. To build the images +and start the containers, simply run: + +$ cd docker +$ ./up.sh + +After the containers started, open a new terminal window and run `docker exec -it jepsen-control bash`. +This will allow you to run arbitrary commands on the control node. +To start the tests, you can use the `run-tests.sh` script in the `docker` directory, +which expects the number of test iterations, and a URI to a Flink distribution, e.g., + +./docker/run-tests.sh 1 https://example.com/flink-dist.tgz + +The project's root is mounted as a volume to all containers under the path `/jepsen`. +This means that changes to the test sources are immediately reflected in the control node container. +Moreover, this allows you to test locally built Flink distributions by copying the tarball to the +project's root and passing a URI with the `file://` scheme to the `run-tests.sh` script, e.g., +`file:///jepsen/flink-dist.tgz`. + +### Checking the output of tests + +Consult the `jepsen.log` file for the particular test run in the `store` folder. The final output of every test will be either + +Everything looks good! ヽ('ー`)ノ + +or + +Analysis invalid! (ノಥ益ಥ)ノ ┻━┻ --- End diff -- The Chinese character is the mouth and and the eyebrows of the emoji: https://github.com/jepsen-io/jepsen/blob/dd197bbce0b92c1ab3423709ac6bb0b2ee853365/jepsen/src/jepsen/core.clj#L532 > Cluster test: Run general purpose job with failures with Yarn session > - > > Key: FLINK-9004 > URL: https://issues.apache.org/jira/browse/FLINK-9004 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Gary Yao >Priority: Blocker > Labels: pull-request-available > Fix For: 1.6.0, 1.5.1 > > > Similar to FLINK-8973, we should run the general purpose job (FLINK-8971) on > a Yarn session cluster and simulate failures. > The job jar should be ill-packaged, meaning that we include too many > dependencies in the user jar. We should include the Scala library, Hadoop and > Flink itself to verify that there are no class loading issues. > The general purpose job should run with misbehavior activated. Additionally, > we should simulate at least the following failure scenarios: > * Kill Flink processes > * Kill connection to storage system for checkpoints and jobs > * Simulate network partition > We should run the test at least with the following state backend: RocksDB > incremental async and checkpointing to S3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9004) Cluster test: Run general purpose job with failures with Yarn session
[ https://issues.apache.org/jira/browse/FLINK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530673#comment-16530673 ] ASF GitHub Bot commented on FLINK-9004: --- Github user yanghua commented on a diff in the pull request: https://github.com/apache/flink/pull/6239#discussion_r199664502 --- Diff: jepsen-flink/README.md --- @@ -0,0 +1,60 @@ +# jepsen.flink + +A Clojure project based on the [Jepsen](https://github.com/jepsen-io/jepsen) framework to find bugs in the +distributed coordination of Apache Flink®. + +## Test Coverage +Jepsen is a framework built to test the behavior of distributed systems +under faults. The tests in this particular project deploy Flink on either YARN or Mesos, submit a +job, and examine the availability of the job after injecting faults. +A job is said to be available if all the tasks of the job are running. +The faults that can be currently introduced to the Flink cluster include: +* Killing of TaskManager/JobManager processes +* Stopping HDFS NameNode +* Network partitions + +There are many more properties other than job availability that could be +verified but are not yet covered by this test suite, e.g., end-to-end exactly-once processing +semantics. + +## Usage +See the [Jepsen documentation](https://github.com/jepsen-io/jepsen#setting-up-a-jepsen-environment) +for how to set up the environment to run tests. The `scripts/run-tests.sh` documents how to invoke +tests. The Flink job used for testing is located under +`flink-end-to-end-tests/flink-datastream-allround-test`. You have to build the job first and copy +the resulting jar (`DataStreamAllroundTestProgram.jar`) to the `./bin` directory of this project's +root. + +To simplify development, we have prepared Dockerfiles and a Docker Compose template +so that you can run the tests locally in containers. To build the images +and start the containers, simply run: + +$ cd docker +$ ./up.sh + +After the containers started, open a new terminal window and run `docker exec -it jepsen-control bash`. +This will allow you to run arbitrary commands on the control node. +To start the tests, you can use the `run-tests.sh` script in the `docker` directory, +which expects the number of test iterations, and a URI to a Flink distribution, e.g., + +./docker/run-tests.sh 1 https://example.com/flink-dist.tgz + +The project's root is mounted as a volume to all containers under the path `/jepsen`. +This means that changes to the test sources are immediately reflected in the control node container. +Moreover, this allows you to test locally built Flink distributions by copying the tarball to the +project's root and passing a URI with the `file://` scheme to the `run-tests.sh` script, e.g., +`file:///jepsen/flink-dist.tgz`. + +### Checking the output of tests + +Consult the `jepsen.log` file for the particular test run in the `store` folder. The final output of every test will be either + +Everything looks good! ヽ('ー`)ノ + +or + +Analysis invalid! (ノಥ益ಥ)ノ ┻━┻ --- End diff -- @GJL I really see a Chinese character, it shows `(ノಥ益ಥ)ノ ┻━┻` to me in the webpage. It contains a `益` which is a Chinese character. Maybe it's Github's prblem? > Cluster test: Run general purpose job with failures with Yarn session > - > > Key: FLINK-9004 > URL: https://issues.apache.org/jira/browse/FLINK-9004 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Gary Yao >Priority: Blocker > Labels: pull-request-available > Fix For: 1.6.0, 1.5.1 > > > Similar to FLINK-8973, we should run the general purpose job (FLINK-8971) on > a Yarn session cluster and simulate failures. > The job jar should be ill-packaged, meaning that we include too many > dependencies in the user jar. We should include the Scala library, Hadoop and > Flink itself to verify that there are no class loading issues. > The general purpose job should run with misbehavior activated. Additionally, > we should simulate at least the following failure scenarios: > * Kill Flink processes > * Kill connection to storage system for checkpoints and jobs > * Simulate network partition > We should run the test at least with the following state backend: RocksDB > incremental async and checkpointing to S3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9004) Cluster test: Run general purpose job with failures with Yarn session
[ https://issues.apache.org/jira/browse/FLINK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530445#comment-16530445 ] ASF GitHub Bot commented on FLINK-9004: --- Github user GJL commented on a diff in the pull request: https://github.com/apache/flink/pull/6239#discussion_r199622404 --- Diff: jepsen-flink/README.md --- @@ -0,0 +1,60 @@ +# jepsen.flink + +A Clojure project based on the [Jepsen](https://github.com/jepsen-io/jepsen) framework to find bugs in the +distributed coordination of Apache Flink®. + +## Test Coverage +Jepsen is a framework built to test the behavior of distributed systems +under faults. The tests in this particular project deploy Flink on either YARN or Mesos, submit a +job, and examine the availability of the job after injecting faults. +A job is said to be available if all the tasks of the job are running. +The faults that can be currently introduced to the Flink cluster include: +* Killing of TaskManager/JobManager processes +* Stopping HDFS NameNode +* Network partitions + +There are many more properties other than job availability that could be +verified but are not yet covered by this test suite, e.g., end-to-end exactly-once processing +semantics. + +## Usage +See the [Jepsen documentation](https://github.com/jepsen-io/jepsen#setting-up-a-jepsen-environment) +for how to set up the environment to run tests. The `scripts/run-tests.sh` documents how to invoke +tests. The Flink job used for testing is located under +`flink-end-to-end-tests/flink-datastream-allround-test`. You have to build the job first and copy +the resulting jar (`DataStreamAllroundTestProgram.jar`) to the `./bin` directory of this project's +root. + +To simplify development, we have prepared Dockerfiles and a Docker Compose template +so that you can run the tests locally in containers. To build the images +and start the containers, simply run: + +$ cd docker +$ ./up.sh + +After the containers started, open a new terminal window and run `docker exec -it jepsen-control bash`. +This will allow you to run arbitrary commands on the control node. +To start the tests, you can use the `run-tests.sh` script in the `docker` directory, +which expects the number of test iterations, and a URI to a Flink distribution, e.g., + +./docker/run-tests.sh 1 https://example.com/flink-dist.tgz + +The project's root is mounted as a volume to all containers under the path `/jepsen`. +This means that changes to the test sources are immediately reflected in the control node container. +Moreover, this allows you to test locally built Flink distributions by copying the tarball to the +project's root and passing a URI with the `file://` scheme to the `run-tests.sh` script, e.g., +`file:///jepsen/flink-dist.tgz`. + +### Checking the output of tests + +Consult the `jepsen.log` file for the particular test run in the `store` folder. The final output of every test will be either + +Everything looks good! ヽ('ー`)ノ + +or + +Analysis invalid! (ノಥ益ಥ)ノ ┻━┻ --- End diff -- it's a "flipping table emoji" > Cluster test: Run general purpose job with failures with Yarn session > - > > Key: FLINK-9004 > URL: https://issues.apache.org/jira/browse/FLINK-9004 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Gary Yao >Priority: Blocker > Labels: pull-request-available > Fix For: 1.6.0, 1.5.1 > > > Similar to FLINK-8973, we should run the general purpose job (FLINK-8971) on > a Yarn session cluster and simulate failures. > The job jar should be ill-packaged, meaning that we include too many > dependencies in the user jar. We should include the Scala library, Hadoop and > Flink itself to verify that there are no class loading issues. > The general purpose job should run with misbehavior activated. Additionally, > we should simulate at least the following failure scenarios: > * Kill Flink processes > * Kill connection to storage system for checkpoints and jobs > * Simulate network partition > We should run the test at least with the following state backend: RocksDB > incremental async and checkpointing to S3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9004) Cluster test: Run general purpose job with failures with Yarn session
[ https://issues.apache.org/jira/browse/FLINK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530310#comment-16530310 ] ASF GitHub Bot commented on FLINK-9004: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/6239#discussion_r199592285 --- Diff: jepsen-flink/README.md --- @@ -0,0 +1,60 @@ +# jepsen.flink + +A Clojure project based on the [Jepsen](https://github.com/jepsen-io/jepsen) framework to find bugs in the +distributed coordination of Apache Flink®. + +## Test Coverage +Jepsen is a framework built to test the behavior of distributed systems +under faults. The tests in this particular project deploy Flink on either YARN or Mesos, submit a +job, and examine the availability of the job after injecting faults. +A job is said to be available if all the tasks of the job are running. +The faults that can be currently introduced to the Flink cluster include: +* Killing of TaskManager/JobManager processes +* Stopping HDFS NameNode +* Network partitions + +There are many more properties other than job availability that could be +verified but are not yet covered by this test suite, e.g., end-to-end exactly-once processing +semantics. + +## Usage +See the [Jepsen documentation](https://github.com/jepsen-io/jepsen#setting-up-a-jepsen-environment) +for how to set up the environment to run tests. The `scripts/run-tests.sh` documents how to invoke +tests. The Flink job used for testing is located under +`flink-end-to-end-tests/flink-datastream-allround-test`. You have to build the job first and copy +the resulting jar (`DataStreamAllroundTestProgram.jar`) to the `./bin` directory of this project's +root. + +To simplify development, we have prepared Dockerfiles and a Docker Compose template +so that you can run the tests locally in containers. To build the images +and start the containers, simply run: + +$ cd docker +$ ./up.sh + +After the containers started, open a new terminal window and run `docker exec -it jepsen-control bash`. +This will allow you to run arbitrary commands on the control node. +To start the tests, you can use the `run-tests.sh` script in the `docker` directory, +which expects the number of test iterations, and a URI to a Flink distribution, e.g., + +./docker/run-tests.sh 1 https://example.com/flink-dist.tgz + +The project's root is mounted as a volume to all containers under the path `/jepsen`. +This means that changes to the test sources are immediately reflected in the control node container. +Moreover, this allows you to test locally built Flink distributions by copying the tarball to the +project's root and passing a URI with the `file://` scheme to the `run-tests.sh` script, e.g., +`file:///jepsen/flink-dist.tgz`. + +### Checking the output of tests + +Consult the `jepsen.log` file for the particular test run in the `store` folder. The final output of every test will be either + +Everything looks good! ヽ('ー`)ノ --- End diff -- it's a smiley... > Cluster test: Run general purpose job with failures with Yarn session > - > > Key: FLINK-9004 > URL: https://issues.apache.org/jira/browse/FLINK-9004 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Gary Yao >Priority: Blocker > Labels: pull-request-available > Fix For: 1.6.0, 1.5.1 > > > Similar to FLINK-8973, we should run the general purpose job (FLINK-8971) on > a Yarn session cluster and simulate failures. > The job jar should be ill-packaged, meaning that we include too many > dependencies in the user jar. We should include the Scala library, Hadoop and > Flink itself to verify that there are no class loading issues. > The general purpose job should run with misbehavior activated. Additionally, > we should simulate at least the following failure scenarios: > * Kill Flink processes > * Kill connection to storage system for checkpoints and jobs > * Simulate network partition > We should run the test at least with the following state backend: RocksDB > incremental async and checkpointing to S3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9004) Cluster test: Run general purpose job with failures with Yarn session
[ https://issues.apache.org/jira/browse/FLINK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530047#comment-16530047 ] ASF GitHub Bot commented on FLINK-9004: --- Github user yanghua commented on a diff in the pull request: https://github.com/apache/flink/pull/6239#discussion_r199527828 --- Diff: jepsen-flink/README.md --- @@ -0,0 +1,60 @@ +# jepsen.flink + +A Clojure project based on the [Jepsen](https://github.com/jepsen-io/jepsen) framework to find bugs in the +distributed coordination of Apache Flink®. + +## Test Coverage +Jepsen is a framework built to test the behavior of distributed systems +under faults. The tests in this particular project deploy Flink on either YARN or Mesos, submit a +job, and examine the availability of the job after injecting faults. +A job is said to be available if all the tasks of the job are running. +The faults that can be currently introduced to the Flink cluster include: +* Killing of TaskManager/JobManager processes +* Stopping HDFS NameNode +* Network partitions + +There are many more properties other than job availability that could be +verified but are not yet covered by this test suite, e.g., end-to-end exactly-once processing +semantics. + +## Usage +See the [Jepsen documentation](https://github.com/jepsen-io/jepsen#setting-up-a-jepsen-environment) +for how to set up the environment to run tests. The `scripts/run-tests.sh` documents how to invoke +tests. The Flink job used for testing is located under +`flink-end-to-end-tests/flink-datastream-allround-test`. You have to build the job first and copy +the resulting jar (`DataStreamAllroundTestProgram.jar`) to the `./bin` directory of this project's +root. + +To simplify development, we have prepared Dockerfiles and a Docker Compose template +so that you can run the tests locally in containers. To build the images +and start the containers, simply run: + +$ cd docker +$ ./up.sh + +After the containers started, open a new terminal window and run `docker exec -it jepsen-control bash`. +This will allow you to run arbitrary commands on the control node. +To start the tests, you can use the `run-tests.sh` script in the `docker` directory, +which expects the number of test iterations, and a URI to a Flink distribution, e.g., + +./docker/run-tests.sh 1 https://example.com/flink-dist.tgz + +The project's root is mounted as a volume to all containers under the path `/jepsen`. +This means that changes to the test sources are immediately reflected in the control node container. +Moreover, this allows you to test locally built Flink distributions by copying the tarball to the +project's root and passing a URI with the `file://` scheme to the `run-tests.sh` script, e.g., +`file:///jepsen/flink-dist.tgz`. + +### Checking the output of tests + +Consult the `jepsen.log` file for the particular test run in the `store` folder. The final output of every test will be either + +Everything looks good! ヽ('ー`)ノ + +or + +Analysis invalid! (ノಥ益ಥ)ノ ┻━┻ --- End diff -- the end of this line is gibberish? > Cluster test: Run general purpose job with failures with Yarn session > - > > Key: FLINK-9004 > URL: https://issues.apache.org/jira/browse/FLINK-9004 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Gary Yao >Priority: Blocker > Labels: pull-request-available > Fix For: 1.6.0, 1.5.1 > > > Similar to FLINK-8973, we should run the general purpose job (FLINK-8971) on > a Yarn session cluster and simulate failures. > The job jar should be ill-packaged, meaning that we include too many > dependencies in the user jar. We should include the Scala library, Hadoop and > Flink itself to verify that there are no class loading issues. > The general purpose job should run with misbehavior activated. Additionally, > we should simulate at least the following failure scenarios: > * Kill Flink processes > * Kill connection to storage system for checkpoints and jobs > * Simulate network partition > We should run the test at least with the following state backend: RocksDB > incremental async and checkpointing to S3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9004) Cluster test: Run general purpose job with failures with Yarn session
[ https://issues.apache.org/jira/browse/FLINK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530048#comment-16530048 ] ASF GitHub Bot commented on FLINK-9004: --- Github user yanghua commented on a diff in the pull request: https://github.com/apache/flink/pull/6239#discussion_r199527590 --- Diff: jepsen-flink/README.md --- @@ -0,0 +1,60 @@ +# jepsen.flink + +A Clojure project based on the [Jepsen](https://github.com/jepsen-io/jepsen) framework to find bugs in the +distributed coordination of Apache Flink®. + +## Test Coverage +Jepsen is a framework built to test the behavior of distributed systems +under faults. The tests in this particular project deploy Flink on either YARN or Mesos, submit a +job, and examine the availability of the job after injecting faults. +A job is said to be available if all the tasks of the job are running. +The faults that can be currently introduced to the Flink cluster include: +* Killing of TaskManager/JobManager processes +* Stopping HDFS NameNode +* Network partitions + +There are many more properties other than job availability that could be +verified but are not yet covered by this test suite, e.g., end-to-end exactly-once processing +semantics. + +## Usage +See the [Jepsen documentation](https://github.com/jepsen-io/jepsen#setting-up-a-jepsen-environment) +for how to set up the environment to run tests. The `scripts/run-tests.sh` documents how to invoke +tests. The Flink job used for testing is located under +`flink-end-to-end-tests/flink-datastream-allround-test`. You have to build the job first and copy +the resulting jar (`DataStreamAllroundTestProgram.jar`) to the `./bin` directory of this project's +root. + +To simplify development, we have prepared Dockerfiles and a Docker Compose template +so that you can run the tests locally in containers. To build the images +and start the containers, simply run: + +$ cd docker +$ ./up.sh + +After the containers started, open a new terminal window and run `docker exec -it jepsen-control bash`. +This will allow you to run arbitrary commands on the control node. +To start the tests, you can use the `run-tests.sh` script in the `docker` directory, +which expects the number of test iterations, and a URI to a Flink distribution, e.g., + +./docker/run-tests.sh 1 https://example.com/flink-dist.tgz + +The project's root is mounted as a volume to all containers under the path `/jepsen`. +This means that changes to the test sources are immediately reflected in the control node container. +Moreover, this allows you to test locally built Flink distributions by copying the tarball to the +project's root and passing a URI with the `file://` scheme to the `run-tests.sh` script, e.g., +`file:///jepsen/flink-dist.tgz`. + +### Checking the output of tests + +Consult the `jepsen.log` file for the particular test run in the `store` folder. The final output of every test will be either + +Everything looks good! ヽ('ー`)ノ --- End diff -- @GJL the end of this line is gibberish? > Cluster test: Run general purpose job with failures with Yarn session > - > > Key: FLINK-9004 > URL: https://issues.apache.org/jira/browse/FLINK-9004 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Gary Yao >Priority: Blocker > Labels: pull-request-available > Fix For: 1.6.0, 1.5.1 > > > Similar to FLINK-8973, we should run the general purpose job (FLINK-8971) on > a Yarn session cluster and simulate failures. > The job jar should be ill-packaged, meaning that we include too many > dependencies in the user jar. We should include the Scala library, Hadoop and > Flink itself to verify that there are no class loading issues. > The general purpose job should run with misbehavior activated. Additionally, > we should simulate at least the following failure scenarios: > * Kill Flink processes > * Kill connection to storage system for checkpoints and jobs > * Simulate network partition > We should run the test at least with the following state backend: RocksDB > incremental async and checkpointing to S3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9004) Cluster test: Run general purpose job with failures with Yarn session
[ https://issues.apache.org/jira/browse/FLINK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529959#comment-16529959 ] ASF GitHub Bot commented on FLINK-9004: --- Github user GJL commented on a diff in the pull request: https://github.com/apache/flink/pull/6239#discussion_r199507669 --- Diff: jepsen-flink/docker/nodes --- @@ -0,0 +1,3 @@ +n1 --- End diff -- file must be excluded from RAT plugin > Cluster test: Run general purpose job with failures with Yarn session > - > > Key: FLINK-9004 > URL: https://issues.apache.org/jira/browse/FLINK-9004 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Gary Yao >Priority: Blocker > Labels: pull-request-available > Fix For: 1.6.0, 1.5.1 > > > Similar to FLINK-8973, we should run the general purpose job (FLINK-8971) on > a Yarn session cluster and simulate failures. > The job jar should be ill-packaged, meaning that we include too many > dependencies in the user jar. We should include the Scala library, Hadoop and > Flink itself to verify that there are no class loading issues. > The general purpose job should run with misbehavior activated. Additionally, > we should simulate at least the following failure scenarios: > * Kill Flink processes > * Kill connection to storage system for checkpoints and jobs > * Simulate network partition > We should run the test at least with the following state backend: RocksDB > incremental async and checkpointing to S3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9004) Cluster test: Run general purpose job with failures with Yarn session
[ https://issues.apache.org/jira/browse/FLINK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529796#comment-16529796 ] ASF GitHub Bot commented on FLINK-9004: --- GitHub user GJL opened a pull request: https://github.com/apache/flink/pull/6239 [FLINK-9004][tests] Implement Jepsen tests to test job availability. ## What is the purpose of the change *Use the Jepsen framework (https://github.com/jepsen-io/jepsen) to implement tests that verify Flink's HA capabilities under real-world faults, such as sudden TaskManager/JobManager termination, HDFS NameNode unavailability, network partitions, etc. The Flink cluster under test is automatically deployed on YARN (session & job mode) and Mesos. Provide Dockerfiles for local test development.* ## Brief change log - *Implement Jepsen tests.* ## Verifying this change This change added tests and can be verified as follows: - *The changes themselves are tests.* - *Run Jepsen tests in docker containers.* - *Run unit tests with `lein test`* ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (yes / **no** (at least not to Flink)) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (yes / **no**) - The serializers: (yes / **no** / don't know) - The runtime per-record code paths (performance sensitive): (yes / **no** / don't know) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / **no** (but it will as soon as test failures appear) / don't know) - The S3 file system connector: (yes / **no** / don't know) ## Documentation - Does this pull request introduce a new feature? (yes / **no**) - If yes, how is the feature documented? (**not applicable** / docs / JavaDocs / not documented) cc: @tillrohrmann @cewood You can merge this pull request into a Git repository by running: $ git pull https://github.com/GJL/flink FLINK-9004 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/6239.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6239 commit 063e4621a5982b55ee7f7b0935290bbc717a5a45 Author: gyao Date: 2018-03-05T21:23:33Z [FLINK-9004][tests] Implement Jepsen tests to test job availability. Use the Jepsen framework (https://github.com/jepsen-io/jepsen) to implement tests that verify Flink's HA capabilities under real-world faults, such as sudden TaskManager/JobManager termination, HDFS NameNode unavailability, network partitions, etc. The Flink cluster under test is automatically deployed on YARN (session & job mode) and Mesos. Provide Dockerfiles for local test development. > Cluster test: Run general purpose job with failures with Yarn session > - > > Key: FLINK-9004 > URL: https://issues.apache.org/jira/browse/FLINK-9004 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Gary Yao >Priority: Blocker > Labels: pull-request-available > Fix For: 1.6.0, 1.5.1 > > > Similar to FLINK-8973, we should run the general purpose job (FLINK-8971) on > a Yarn session cluster and simulate failures. > The job jar should be ill-packaged, meaning that we include too many > dependencies in the user jar. We should include the Scala library, Hadoop and > Flink itself to verify that there are no class loading issues. > The general purpose job should run with misbehavior activated. Additionally, > we should simulate at least the following failure scenarios: > * Kill Flink processes > * Kill connection to storage system for checkpoints and jobs > * Simulate network partition > We should run the test at least with the following state backend: RocksDB > incremental async and checkpointing to S3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)