[jira] [Commented] (FLINK-14968) Kerberized YARN on Docker test (custom fs plugin) fails on Travis
[ https://issues.apache.org/jira/browse/FLINK-14968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16984565#comment-16984565 ] Aljoscha Krettek commented on FLINK-14968: -- Something very strange is going on. When I try this on a (dockerized) YARN cluster the job sometimes needs 3 slots to run and sometimes needs 4 slots. I run this job: {code} bin/flink run -m yarn-cluster -p 3 -yjm 2000 -ytm 2000 examples/streaming/WordCount.jar --input hdfs:///wc-in-1 --input hdfs:///wc-in-2 --output hdfs:///wc-out {code} The attached logs show the (DEBUG) jobmanager.log of two different runs. > Kerberized YARN on Docker test (custom fs plugin) fails on Travis > - > > Key: FLINK-14968 > URL: https://issues.apache.org/jira/browse/FLINK-14968 > Project: Flink > Issue Type: Bug > Components: FileSystems, Tests >Affects Versions: 1.10.0 >Reporter: Gary Yao >Priority: Blocker > Labels: test-stability > Fix For: 1.10.0 > > Attachments: run-with-3-slots.txt, run-with-4-slots.txt > > > This change made the test flaky: > https://github.com/apache/flink/commit/749965348170e4608ff2a23c9617f67b8c341df5. > It changes the job to have two sources instead of one which, under normal > circumstances, requires too many slots to run and therefore the job will fail. > The setup of this test is very intricate, we configure YARN to have two > NodeManagers with 2500mb memory each: > https://github.com/apache/flink/blob/413a77157caf25dbbfb8b0caaf2c9e12c7374d98/flink-end-to-end-tests/test-scripts/docker-hadoop-secure-cluster/config/yarn-site.xml#L39. > We run the job with parallelism 3 and configure Flink to use 1000mb as > TaskManager memory and 1000mb of JobManager memory. This means that the job > fits into the YARN memory budget but more TaskManagers would not fit. We also > don't simply increase the YARN resources because we want the Flink job to use > TMs on different NMs because we had a bug where Kerberos config file shipping > was not working correctly but the bug was not materialising if all TMs where > on the same NM. > https://api.travis-ci.org/v3/job/612782888/log.txt -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14968) Kerberized YARN on Docker test (custom fs plugin) fails on Travis
[ https://issues.apache.org/jira/browse/FLINK-14968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16984335#comment-16984335 ] Aljoscha Krettek commented on FLINK-14968: -- It's very strange, I'm running this on my machine against a standalone 3-TM cluster and it didn't fail so far. It is reproducible on my local machine using the yarn/kerberos docker test, though. But that's hard to debug. > Kerberized YARN on Docker test (custom fs plugin) fails on Travis > - > > Key: FLINK-14968 > URL: https://issues.apache.org/jira/browse/FLINK-14968 > Project: Flink > Issue Type: Bug > Components: FileSystems, Tests >Affects Versions: 1.10.0 >Reporter: Gary Yao >Priority: Blocker > Labels: test-stability > Fix For: 1.10.0 > > > This change made the test flaky: > https://github.com/apache/flink/commit/749965348170e4608ff2a23c9617f67b8c341df5. > It changes the job to have two sources instead of one which, under normal > circumstances, requires too many slots to run and therefore the job will fail. > The setup of this test is very intricate, we configure YARN to have two > NodeManagers with 2500mb memory each: > https://github.com/apache/flink/blob/413a77157caf25dbbfb8b0caaf2c9e12c7374d98/flink-end-to-end-tests/test-scripts/docker-hadoop-secure-cluster/config/yarn-site.xml#L39. > We run the job with parallelism 3 and configure Flink to use 1000mb as > TaskManager memory and 1000mb of JobManager memory. This means that the job > fits into the YARN memory budget but more TaskManagers would not fit. We also > don't simply increase the YARN resources because we want the Flink job to use > TMs on different NMs because we had a bug where Kerberos config file shipping > was not working correctly but the bug was not materialising if all TMs where > on the same NM. > https://api.travis-ci.org/v3/job/612782888/log.txt -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14968) Kerberized YARN on Docker test (custom fs plugin) fails on Travis
[ https://issues.apache.org/jira/browse/FLINK-14968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16984225#comment-16984225 ] Gary Yao commented on FLINK-14968: -- Does this also happen when running the WordCount job from the IDE or at least in standalone mode? Would be good to be able to set a breakpoint for debugging. > Kerberized YARN on Docker test (custom fs plugin) fails on Travis > - > > Key: FLINK-14968 > URL: https://issues.apache.org/jira/browse/FLINK-14968 > Project: Flink > Issue Type: Bug > Components: FileSystems, Tests >Affects Versions: 1.10.0 >Reporter: Gary Yao >Priority: Blocker > Labels: test-stability > Fix For: 1.10.0 > > > This change made the test flaky: > https://github.com/apache/flink/commit/749965348170e4608ff2a23c9617f67b8c341df5. > It changes the job to have two sources instead of one which, under normal > circumstances, requires too many slots to run and therefore the job will fail. > The setup of this test is very intricate, we configure YARN to have two > NodeManagers with 2500mb memory each: > https://github.com/apache/flink/blob/413a77157caf25dbbfb8b0caaf2c9e12c7374d98/flink-end-to-end-tests/test-scripts/docker-hadoop-secure-cluster/config/yarn-site.xml#L39. > We run the job with parallelism 3 and configure Flink to use 1000mb as > TaskManager memory and 1000mb of JobManager memory. This means that the job > fits into the YARN memory budget but more TaskManagers would not fit. We also > don't simply increase the YARN resources because we want the Flink job to use > TMs on different NMs because we had a bug where Kerberos config file shipping > was not working correctly but the bug was not materialising if all TMs where > on the same NM. > https://api.travis-ci.org/v3/job/612782888/log.txt -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14968) Kerberized YARN on Docker test (custom fs plugin) fails on Travis
[ https://issues.apache.org/jira/browse/FLINK-14968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16983776#comment-16983776 ] Aljoscha Krettek commented on FLINK-14968: -- [~fly_in_gis] You might be right? I'll keep investigating this myself. But if more people could look at it it would be good. Maybe the reason is the same as FLINK-14834 in the end. > Kerberized YARN on Docker test (custom fs plugin) fails on Travis > - > > Key: FLINK-14968 > URL: https://issues.apache.org/jira/browse/FLINK-14968 > Project: Flink > Issue Type: Bug > Components: FileSystems, Tests >Affects Versions: 1.10.0 >Reporter: Gary Yao >Priority: Blocker > Labels: test-stability > Fix For: 1.10.0 > > > This change made the test flaky: > https://github.com/apache/flink/commit/749965348170e4608ff2a23c9617f67b8c341df5. > It changes the job to have two sources instead of one which, under normal > circumstances, requires too many slots to run and therefore the job will fail. > The setup of this test is very intricate, we configure YARN to have two > NodeManagers with 2500mb memory each: > https://github.com/apache/flink/blob/413a77157caf25dbbfb8b0caaf2c9e12c7374d98/flink-end-to-end-tests/test-scripts/docker-hadoop-secure-cluster/config/yarn-site.xml#L39. > We run the job with parallelism 3 and configure Flink to use 1000mb as > TaskManager memory and 1000mb of JobManager memory. This means that the job > fits into the YARN memory budget but more TaskManagers would not fit. We also > don't simply increase the YARN resources because we want the Flink job to use > TMs on different NMs because we had a bug where Kerberos config file shipping > was not working correctly but the bug was not materialising if all TMs where > on the same NM. > https://api.travis-ci.org/v3/job/612782888/log.txt -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14968) Kerberized YARN on Docker test (custom fs plugin) fails on Travis
[ https://issues.apache.org/jira/browse/FLINK-14968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16983349#comment-16983349 ] Yang Wang commented on FLINK-14968: --- [~gjy] [~pnowojski] [~aljoscha] I think the flink job should not fail with not enough slots. We start 3 TaskManager with 1 slot for each, so each slot could run a complete pipeline. I have tested on a real yarn cluster, it always works as expected. Only 3 TaskManagers are started and the job finished successfully. {code:java} ./bin/flink run -d -p 3 -m yarn-cluster examples/streaming/WordCount.jar --input dummy://localhost/words --input anotherDummy://localhost/words {code} I have gone over the logs and find that the `SlotPoolImpl` allocates 4 slot. It should be only 3. May some bugs happen in Scheduler internally. > Kerberized YARN on Docker test (custom fs plugin) fails on Travis > - > > Key: FLINK-14968 > URL: https://issues.apache.org/jira/browse/FLINK-14968 > Project: Flink > Issue Type: Bug > Components: FileSystems, Tests >Affects Versions: 1.10.0 >Reporter: Gary Yao >Priority: Blocker > Labels: test-stability > Fix For: 1.10.0 > > > This change made the test flaky: > https://github.com/apache/flink/commit/749965348170e4608ff2a23c9617f67b8c341df5. > It changes the job to have two sources instead of one which, under normal > circumstances, requires too many slots to run and therefore the job will fail. > The setup of this test is very intricate, we configure YARN to have two > NodeManagers with 2500mb memory each: > https://github.com/apache/flink/blob/413a77157caf25dbbfb8b0caaf2c9e12c7374d98/flink-end-to-end-tests/test-scripts/docker-hadoop-secure-cluster/config/yarn-site.xml#L39. > We run the job with parallelism 3 and configure Flink to use 1000mb as > TaskManager memory and 1000mb of JobManager memory. This means that the job > fits into the YARN memory budget but more TaskManagers would not fit. We also > don't simply increase the YARN resources because we want the Flink job to use > TMs on different NMs because we had a bug where Kerberos config file shipping > was not working correctly but the bug was not materialising if all TMs where > on the same NM. > https://api.travis-ci.org/v3/job/612782888/log.txt -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14968) Kerberized YARN on Docker test (custom fs plugin) fails on Travis
[ https://issues.apache.org/jira/browse/FLINK-14968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16983286#comment-16983286 ] Piotr Nowojski commented on FLINK-14968: [~aljoscha] I would expect 1.9 branch to be affected by this as well. [~gjy], why adding another source exhausts the available slots? Shouldn't the slots be shared? Also do you know why it happens only "under normal circumstances" and not always? [~fly_in_gis] FYI > Kerberized YARN on Docker test (custom fs plugin) fails on Travis > - > > Key: FLINK-14968 > URL: https://issues.apache.org/jira/browse/FLINK-14968 > Project: Flink > Issue Type: Bug > Components: FileSystems, Tests >Affects Versions: 1.10.0 >Reporter: Gary Yao >Priority: Blocker > Labels: test-stability > Fix For: 1.10.0 > > > This change made the test flaky: > https://github.com/apache/flink/commit/749965348170e4608ff2a23c9617f67b8c341df5. > It changes the job to have two sources instead of one which, under normal > circumstances, requires too many slots to run and therefore the job will fail. > The setup of this test is very intricate, we configure YARN to have two > NodeManagers with 2500mb memory each: > https://github.com/apache/flink/blob/413a77157caf25dbbfb8b0caaf2c9e12c7374d98/flink-end-to-end-tests/test-scripts/docker-hadoop-secure-cluster/config/yarn-site.xml#L39. > We run the job with parallelism 3 and configure Flink to use 1000mb as > TaskManager memory and 1000mb of JobManager memory. This means that the job > fits into the YARN memory budget but more TaskManagers would not fit. We also > don't simply increase the YARN resources because we want the Flink job to use > TMs on different NMs because we had a bug where Kerberos config file shipping > was not working correctly but the bug was not materialising if all TMs where > on the same NM. > https://api.travis-ci.org/v3/job/612782888/log.txt -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14968) Kerberized YARN on Docker test (custom fs plugin) fails on Travis
[ https://issues.apache.org/jira/browse/FLINK-14968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16983279#comment-16983279 ] Aljoscha Krettek commented on FLINK-14968: -- I disabled the tests for now to get a better CI signal: https://github.com/apache/flink/commit/5374cfec2231fb14a70bd786424a18daee383231 > Kerberized YARN on Docker test (custom fs plugin) fails on Travis > - > > Key: FLINK-14968 > URL: https://issues.apache.org/jira/browse/FLINK-14968 > Project: Flink > Issue Type: Bug > Components: FileSystems, Tests >Affects Versions: 1.10.0 >Reporter: Gary Yao >Priority: Blocker > Labels: test-stability > Fix For: 1.10.0 > > > This change made the test flaky: > https://github.com/apache/flink/commit/749965348170e4608ff2a23c9617f67b8c341df5. > It changes the job to have two sources instead of one which, under normal > circumstances, requires too many slots to run and therefore the job will fail. > The setup of this test is very intricate, we configure YARN to have two > NodeManagers with 2500mb memory each: > https://github.com/apache/flink/blob/413a77157caf25dbbfb8b0caaf2c9e12c7374d98/flink-end-to-end-tests/test-scripts/docker-hadoop-secure-cluster/config/yarn-site.xml#L39. > We run the job with parallelism 3 and configure Flink to use 1000mb as > TaskManager memory and 1000mb of JobManager memory. This means that the job > fits into the YARN memory budget but more TaskManagers would not fit. We also > don't simply increase the YARN resources because we want the Flink job to use > TMs on different NMs because we had a bug where Kerberos config file shipping > was not working correctly but the bug was not materialising if all TMs where > on the same NM. > https://api.travis-ci.org/v3/job/612782888/log.txt -- This message was sent by Atlassian Jira (v8.3.4#803005)