[jira] [Commented] (FLINK-25749) YARNSessionFIFOSecuredITCase.testDetachedMode fails on AZP

2022-01-21 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17480219#comment-17480219
 ] 

Till Rohrmann commented on FLINK-25749:
---

The problem is indeed caused by 
https://github.com/apache/flink/commit/dd6069fabf8a7ff65fbd9ff8dd7b0c47f492288f#diff-5ff30e09fc23978573250c9d95969a549be12648c085bd581696cf0b84da3a0b
 because due to the introduce shut down hook, it can happen that the TM 
deregisters from the RM which will queue up an operation in the 
{{NMClientAsync}}. Now if the RM stops and closes the {{NMClientAsync}} this 
can lead to exceptions that are logged. Luckily, 
https://github.com/apache/flink/pull/18169 will solve this problem properly 
([~dmvk] correct me if I have told incorrect things).

> YARNSessionFIFOSecuredITCase.testDetachedMode fails on AZP
> --
>
> Key: FLINK-25749
> URL: https://issues.apache.org/jira/browse/FLINK-25749
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.15.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
>
> The test {{YARNSessionFIFOSecuredITCase.testDetachedMode}} fails on AZP:
> {code}
> 2022-01-21T03:28:18.3712993Z Jan 21 03:28:18 java.lang.AssertionError: 
> 2022-01-21T03:28:18.3715115Z Jan 21 03:28:18 Found a file 
> /__w/2/s/flink-yarn-tests/target/flink-yarn-tests-fifo-secured/flink-yarn-tests-fifo-secured-logDir-nm-0_0/application_1642735639007_0002/container_1642735639007_0002_01_01/jobmanager.log
>  with a prohibited string (one of [Exception, Started 
> SelectChannelConnector@0.0.0.0:8081]). Excerpts:
> 2022-01-21T03:28:18.3716389Z Jan 21 03:28:18 [
> 2022-01-21T03:28:18.3717531Z Jan 21 03:28:18 2022-01-21 03:27:56,921 INFO  
> org.apache.flink.runtime.resourcemanager.ResourceManagerServiceImpl [] - 
> Resource manager service is not running. Ignore revoking leadership.
> 2022-01-21T03:28:18.3720496Z Jan 21 03:28:18 2022-01-21 03:27:56,922 INFO  
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopped 
> dispatcher akka.tcp://flink@11c5f741db81:37697/user/rpc/dispatcher_0.
> 2022-01-21T03:28:18.3722401Z Jan 21 03:28:18 2022-01-21 03:27:56,922 INFO  
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl [] - 
> Interrupted while waiting for queue
> 2022-01-21T03:28:18.3723661Z Jan 21 03:28:18 java.lang.InterruptedException: 
> null
> 2022-01-21T03:28:18.3724529Z Jan 21 03:28:18  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>  ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3725450Z Jan 21 03:28:18  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
>  ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3726239Z Jan 21 03:28:18  at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) 
> ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3727618Z Jan 21 03:28:18  at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:323)
>  [hadoop-yarn-client-2.8.5.jar:?]
> 2022-01-21T03:28:18.3729147Z Jan 21 03:28:18 2022-01-21 03:27:56,927 WARN  
> org.apache.hadoop.ipc.Client [] - Failed to 
> connect to server: 11c5f741db81/172.25.0.2:39121: retries get failed due to 
> exceeded maximum allowed retries number: 0
> 2022-01-21T03:28:18.3730293Z Jan 21 03:28:18 
> java.nio.channels.ClosedByInterruptException: null
> 2022-01-21T03:28:18.3730834Z Jan 21 03:28:18 
> java.nio.channels.ClosedByInterruptException: null
> 2022-01-21T03:28:18.3731499Z Jan 21 03:28:18  at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>  ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3732203Z Jan 21 03:28:18  at 
> sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:658) 
> ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3733478Z Jan 21 03:28:18  at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
>  ~[hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3734470Z Jan 21 03:28:18  at 
> org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531) 
> ~[hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3735432Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:685) 
> [hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3736414Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:788) 
> [hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3737734Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:410) 
> [hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3738853Z Jan 21 03:2

[jira] [Commented] (FLINK-25749) YARNSessionFIFOSecuredITCase.testDetachedMode fails on AZP

2022-01-21 Thread Jira


[ 
https://issues.apache.org/jira/browse/FLINK-25749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17480214#comment-17480214
 ] 

David Morávek commented on FLINK-25749:
---

Merging https://github.com/apache/flink/pull/18446 after the CI passes should 
fix the issue

> YARNSessionFIFOSecuredITCase.testDetachedMode fails on AZP
> --
>
> Key: FLINK-25749
> URL: https://issues.apache.org/jira/browse/FLINK-25749
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.15.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
>
> The test {{YARNSessionFIFOSecuredITCase.testDetachedMode}} fails on AZP:
> {code}
> 2022-01-21T03:28:18.3712993Z Jan 21 03:28:18 java.lang.AssertionError: 
> 2022-01-21T03:28:18.3715115Z Jan 21 03:28:18 Found a file 
> /__w/2/s/flink-yarn-tests/target/flink-yarn-tests-fifo-secured/flink-yarn-tests-fifo-secured-logDir-nm-0_0/application_1642735639007_0002/container_1642735639007_0002_01_01/jobmanager.log
>  with a prohibited string (one of [Exception, Started 
> SelectChannelConnector@0.0.0.0:8081]). Excerpts:
> 2022-01-21T03:28:18.3716389Z Jan 21 03:28:18 [
> 2022-01-21T03:28:18.3717531Z Jan 21 03:28:18 2022-01-21 03:27:56,921 INFO  
> org.apache.flink.runtime.resourcemanager.ResourceManagerServiceImpl [] - 
> Resource manager service is not running. Ignore revoking leadership.
> 2022-01-21T03:28:18.3720496Z Jan 21 03:28:18 2022-01-21 03:27:56,922 INFO  
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopped 
> dispatcher akka.tcp://flink@11c5f741db81:37697/user/rpc/dispatcher_0.
> 2022-01-21T03:28:18.3722401Z Jan 21 03:28:18 2022-01-21 03:27:56,922 INFO  
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl [] - 
> Interrupted while waiting for queue
> 2022-01-21T03:28:18.3723661Z Jan 21 03:28:18 java.lang.InterruptedException: 
> null
> 2022-01-21T03:28:18.3724529Z Jan 21 03:28:18  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>  ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3725450Z Jan 21 03:28:18  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
>  ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3726239Z Jan 21 03:28:18  at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) 
> ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3727618Z Jan 21 03:28:18  at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:323)
>  [hadoop-yarn-client-2.8.5.jar:?]
> 2022-01-21T03:28:18.3729147Z Jan 21 03:28:18 2022-01-21 03:27:56,927 WARN  
> org.apache.hadoop.ipc.Client [] - Failed to 
> connect to server: 11c5f741db81/172.25.0.2:39121: retries get failed due to 
> exceeded maximum allowed retries number: 0
> 2022-01-21T03:28:18.3730293Z Jan 21 03:28:18 
> java.nio.channels.ClosedByInterruptException: null
> 2022-01-21T03:28:18.3730834Z Jan 21 03:28:18 
> java.nio.channels.ClosedByInterruptException: null
> 2022-01-21T03:28:18.3731499Z Jan 21 03:28:18  at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>  ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3732203Z Jan 21 03:28:18  at 
> sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:658) 
> ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3733478Z Jan 21 03:28:18  at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
>  ~[hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3734470Z Jan 21 03:28:18  at 
> org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531) 
> ~[hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3735432Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:685) 
> [hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3736414Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:788) 
> [hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3737734Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:410) 
> [hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3738853Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client.getConnection(Client.java:1550) 
> [hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3739752Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client.call(Client.java:1381) 
> [hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3740638Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client.call(Client.java:1345) 
> [hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3741589Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(

[jira] [Commented] (FLINK-25749) YARNSessionFIFOSecuredITCase.testDetachedMode fails on AZP

2022-01-21 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17480212#comment-17480212
 ] 

Till Rohrmann commented on FLINK-25749:
---

David raised the suspicion that the instability could be caused by 
https://github.com/apache/flink/commit/dd6069fabf8a7ff65fbd9ff8dd7b0c47f492288f#diff-5ff30e09fc23978573250c9d95969a549be12648c085bd581696cf0b84da3a0b.
 Let me quickly double check whether I can reproduce it. Since I am responsible 
for this change, let me first try to clean up my mess.

> YARNSessionFIFOSecuredITCase.testDetachedMode fails on AZP
> --
>
> Key: FLINK-25749
> URL: https://issues.apache.org/jira/browse/FLINK-25749
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.15.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
>
> The test {{YARNSessionFIFOSecuredITCase.testDetachedMode}} fails on AZP:
> {code}
> 2022-01-21T03:28:18.3712993Z Jan 21 03:28:18 java.lang.AssertionError: 
> 2022-01-21T03:28:18.3715115Z Jan 21 03:28:18 Found a file 
> /__w/2/s/flink-yarn-tests/target/flink-yarn-tests-fifo-secured/flink-yarn-tests-fifo-secured-logDir-nm-0_0/application_1642735639007_0002/container_1642735639007_0002_01_01/jobmanager.log
>  with a prohibited string (one of [Exception, Started 
> SelectChannelConnector@0.0.0.0:8081]). Excerpts:
> 2022-01-21T03:28:18.3716389Z Jan 21 03:28:18 [
> 2022-01-21T03:28:18.3717531Z Jan 21 03:28:18 2022-01-21 03:27:56,921 INFO  
> org.apache.flink.runtime.resourcemanager.ResourceManagerServiceImpl [] - 
> Resource manager service is not running. Ignore revoking leadership.
> 2022-01-21T03:28:18.3720496Z Jan 21 03:28:18 2022-01-21 03:27:56,922 INFO  
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopped 
> dispatcher akka.tcp://flink@11c5f741db81:37697/user/rpc/dispatcher_0.
> 2022-01-21T03:28:18.3722401Z Jan 21 03:28:18 2022-01-21 03:27:56,922 INFO  
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl [] - 
> Interrupted while waiting for queue
> 2022-01-21T03:28:18.3723661Z Jan 21 03:28:18 java.lang.InterruptedException: 
> null
> 2022-01-21T03:28:18.3724529Z Jan 21 03:28:18  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>  ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3725450Z Jan 21 03:28:18  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
>  ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3726239Z Jan 21 03:28:18  at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) 
> ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3727618Z Jan 21 03:28:18  at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:323)
>  [hadoop-yarn-client-2.8.5.jar:?]
> 2022-01-21T03:28:18.3729147Z Jan 21 03:28:18 2022-01-21 03:27:56,927 WARN  
> org.apache.hadoop.ipc.Client [] - Failed to 
> connect to server: 11c5f741db81/172.25.0.2:39121: retries get failed due to 
> exceeded maximum allowed retries number: 0
> 2022-01-21T03:28:18.3730293Z Jan 21 03:28:18 
> java.nio.channels.ClosedByInterruptException: null
> 2022-01-21T03:28:18.3730834Z Jan 21 03:28:18 
> java.nio.channels.ClosedByInterruptException: null
> 2022-01-21T03:28:18.3731499Z Jan 21 03:28:18  at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>  ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3732203Z Jan 21 03:28:18  at 
> sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:658) 
> ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3733478Z Jan 21 03:28:18  at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
>  ~[hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3734470Z Jan 21 03:28:18  at 
> org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531) 
> ~[hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3735432Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:685) 
> [hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3736414Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:788) 
> [hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3737734Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:410) 
> [hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3738853Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client.getConnection(Client.java:1550) 
> [hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3739752Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client.call(Client.java:1381) 
> [hadoop-comm

[jira] [Commented] (FLINK-25749) YARNSessionFIFOSecuredITCase.testDetachedMode fails on AZP

2022-01-21 Thread Junfan Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17480118#comment-17480118
 ] 

Junfan Zhang commented on FLINK-25749:
--

All failure above looks lost connection with mini-yarn, maybe due to unstable 
network.

Maybe we could set config of ipc client to solve, like as follows:
{code:java}
yarnClusterConf.setInt("ipc.client.connection.maxidletime", 1000);
yarnClusterConf.setInt("ipc.client.connect.max.retries", 3);
yarnClusterConf.setInt("ipc.client.connect.retry.interval", 10);
yarnClusterConf.setInt("ipc.client.connect.timeout", 1000);
yarnClusterConf.setInt("ipc.client.connect.max.retries.on.timeouts", 3);
{code}
 

[~trohrmann] Do you think so? Maybe I can take over this ticket to improve test 
stability.

> YARNSessionFIFOSecuredITCase.testDetachedMode fails on AZP
> --
>
> Key: FLINK-25749
> URL: https://issues.apache.org/jira/browse/FLINK-25749
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.15.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
>
> The test {{YARNSessionFIFOSecuredITCase.testDetachedMode}} fails on AZP:
> {code}
> 2022-01-21T03:28:18.3712993Z Jan 21 03:28:18 java.lang.AssertionError: 
> 2022-01-21T03:28:18.3715115Z Jan 21 03:28:18 Found a file 
> /__w/2/s/flink-yarn-tests/target/flink-yarn-tests-fifo-secured/flink-yarn-tests-fifo-secured-logDir-nm-0_0/application_1642735639007_0002/container_1642735639007_0002_01_01/jobmanager.log
>  with a prohibited string (one of [Exception, Started 
> SelectChannelConnector@0.0.0.0:8081]). Excerpts:
> 2022-01-21T03:28:18.3716389Z Jan 21 03:28:18 [
> 2022-01-21T03:28:18.3717531Z Jan 21 03:28:18 2022-01-21 03:27:56,921 INFO  
> org.apache.flink.runtime.resourcemanager.ResourceManagerServiceImpl [] - 
> Resource manager service is not running. Ignore revoking leadership.
> 2022-01-21T03:28:18.3720496Z Jan 21 03:28:18 2022-01-21 03:27:56,922 INFO  
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopped 
> dispatcher akka.tcp://flink@11c5f741db81:37697/user/rpc/dispatcher_0.
> 2022-01-21T03:28:18.3722401Z Jan 21 03:28:18 2022-01-21 03:27:56,922 INFO  
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl [] - 
> Interrupted while waiting for queue
> 2022-01-21T03:28:18.3723661Z Jan 21 03:28:18 java.lang.InterruptedException: 
> null
> 2022-01-21T03:28:18.3724529Z Jan 21 03:28:18  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>  ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3725450Z Jan 21 03:28:18  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
>  ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3726239Z Jan 21 03:28:18  at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) 
> ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3727618Z Jan 21 03:28:18  at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:323)
>  [hadoop-yarn-client-2.8.5.jar:?]
> 2022-01-21T03:28:18.3729147Z Jan 21 03:28:18 2022-01-21 03:27:56,927 WARN  
> org.apache.hadoop.ipc.Client [] - Failed to 
> connect to server: 11c5f741db81/172.25.0.2:39121: retries get failed due to 
> exceeded maximum allowed retries number: 0
> 2022-01-21T03:28:18.3730293Z Jan 21 03:28:18 
> java.nio.channels.ClosedByInterruptException: null
> 2022-01-21T03:28:18.3730834Z Jan 21 03:28:18 
> java.nio.channels.ClosedByInterruptException: null
> 2022-01-21T03:28:18.3731499Z Jan 21 03:28:18  at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>  ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3732203Z Jan 21 03:28:18  at 
> sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:658) 
> ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3733478Z Jan 21 03:28:18  at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
>  ~[hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3734470Z Jan 21 03:28:18  at 
> org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531) 
> ~[hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3735432Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:685) 
> [hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3736414Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:788) 
> [hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3737734Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:410) 
> [hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3738853Z Jan 21 03:28:18  at 
> org.apache.

[jira] [Commented] (FLINK-25749) YARNSessionFIFOSecuredITCase.testDetachedMode fails on AZP

2022-01-21 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17480057#comment-17480057
 ] 

Till Rohrmann commented on FLINK-25749:
---

Here the {{YARNSessionFIFOITCase.testDetachedMode}} failed but it is probably 
the same reason.

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=29871&view=logs&j=fc5181b0-e452-5c8f-68de-1097947f6483&t=995c650b-6573-581c-9ce6-7ad4cc038461&l=30735

> YARNSessionFIFOSecuredITCase.testDetachedMode fails on AZP
> --
>
> Key: FLINK-25749
> URL: https://issues.apache.org/jira/browse/FLINK-25749
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.15.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
>
> The test {{YARNSessionFIFOSecuredITCase.testDetachedMode}} fails on AZP:
> {code}
> 2022-01-21T03:28:18.3712993Z Jan 21 03:28:18 java.lang.AssertionError: 
> 2022-01-21T03:28:18.3715115Z Jan 21 03:28:18 Found a file 
> /__w/2/s/flink-yarn-tests/target/flink-yarn-tests-fifo-secured/flink-yarn-tests-fifo-secured-logDir-nm-0_0/application_1642735639007_0002/container_1642735639007_0002_01_01/jobmanager.log
>  with a prohibited string (one of [Exception, Started 
> SelectChannelConnector@0.0.0.0:8081]). Excerpts:
> 2022-01-21T03:28:18.3716389Z Jan 21 03:28:18 [
> 2022-01-21T03:28:18.3717531Z Jan 21 03:28:18 2022-01-21 03:27:56,921 INFO  
> org.apache.flink.runtime.resourcemanager.ResourceManagerServiceImpl [] - 
> Resource manager service is not running. Ignore revoking leadership.
> 2022-01-21T03:28:18.3720496Z Jan 21 03:28:18 2022-01-21 03:27:56,922 INFO  
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopped 
> dispatcher akka.tcp://flink@11c5f741db81:37697/user/rpc/dispatcher_0.
> 2022-01-21T03:28:18.3722401Z Jan 21 03:28:18 2022-01-21 03:27:56,922 INFO  
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl [] - 
> Interrupted while waiting for queue
> 2022-01-21T03:28:18.3723661Z Jan 21 03:28:18 java.lang.InterruptedException: 
> null
> 2022-01-21T03:28:18.3724529Z Jan 21 03:28:18  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>  ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3725450Z Jan 21 03:28:18  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
>  ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3726239Z Jan 21 03:28:18  at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) 
> ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3727618Z Jan 21 03:28:18  at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:323)
>  [hadoop-yarn-client-2.8.5.jar:?]
> 2022-01-21T03:28:18.3729147Z Jan 21 03:28:18 2022-01-21 03:27:56,927 WARN  
> org.apache.hadoop.ipc.Client [] - Failed to 
> connect to server: 11c5f741db81/172.25.0.2:39121: retries get failed due to 
> exceeded maximum allowed retries number: 0
> 2022-01-21T03:28:18.3730293Z Jan 21 03:28:18 
> java.nio.channels.ClosedByInterruptException: null
> 2022-01-21T03:28:18.3730834Z Jan 21 03:28:18 
> java.nio.channels.ClosedByInterruptException: null
> 2022-01-21T03:28:18.3731499Z Jan 21 03:28:18  at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>  ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3732203Z Jan 21 03:28:18  at 
> sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:658) 
> ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3733478Z Jan 21 03:28:18  at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
>  ~[hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3734470Z Jan 21 03:28:18  at 
> org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531) 
> ~[hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3735432Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:685) 
> [hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3736414Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:788) 
> [hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3737734Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:410) 
> [hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3738853Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client.getConnection(Client.java:1550) 
> [hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3739752Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client.call(Client.java:1381) 
> [hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3740638Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client.call(Client.java:1345) 

[jira] [Commented] (FLINK-25749) YARNSessionFIFOSecuredITCase.testDetachedMode fails on AZP

2022-01-21 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17480056#comment-17480056
 ] 

Till Rohrmann commented on FLINK-25749:
---

Another instance: 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=29867&view=logs&j=fc5181b0-e452-5c8f-68de-1097947f6483&t=995c650b-6573-581c-9ce6-7ad4cc038461&l=30719

> YARNSessionFIFOSecuredITCase.testDetachedMode fails on AZP
> --
>
> Key: FLINK-25749
> URL: https://issues.apache.org/jira/browse/FLINK-25749
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.15.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
>
> The test {{YARNSessionFIFOSecuredITCase.testDetachedMode}} fails on AZP:
> {code}
> 2022-01-21T03:28:18.3712993Z Jan 21 03:28:18 java.lang.AssertionError: 
> 2022-01-21T03:28:18.3715115Z Jan 21 03:28:18 Found a file 
> /__w/2/s/flink-yarn-tests/target/flink-yarn-tests-fifo-secured/flink-yarn-tests-fifo-secured-logDir-nm-0_0/application_1642735639007_0002/container_1642735639007_0002_01_01/jobmanager.log
>  with a prohibited string (one of [Exception, Started 
> SelectChannelConnector@0.0.0.0:8081]). Excerpts:
> 2022-01-21T03:28:18.3716389Z Jan 21 03:28:18 [
> 2022-01-21T03:28:18.3717531Z Jan 21 03:28:18 2022-01-21 03:27:56,921 INFO  
> org.apache.flink.runtime.resourcemanager.ResourceManagerServiceImpl [] - 
> Resource manager service is not running. Ignore revoking leadership.
> 2022-01-21T03:28:18.3720496Z Jan 21 03:28:18 2022-01-21 03:27:56,922 INFO  
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopped 
> dispatcher akka.tcp://flink@11c5f741db81:37697/user/rpc/dispatcher_0.
> 2022-01-21T03:28:18.3722401Z Jan 21 03:28:18 2022-01-21 03:27:56,922 INFO  
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl [] - 
> Interrupted while waiting for queue
> 2022-01-21T03:28:18.3723661Z Jan 21 03:28:18 java.lang.InterruptedException: 
> null
> 2022-01-21T03:28:18.3724529Z Jan 21 03:28:18  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>  ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3725450Z Jan 21 03:28:18  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
>  ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3726239Z Jan 21 03:28:18  at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) 
> ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3727618Z Jan 21 03:28:18  at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:323)
>  [hadoop-yarn-client-2.8.5.jar:?]
> 2022-01-21T03:28:18.3729147Z Jan 21 03:28:18 2022-01-21 03:27:56,927 WARN  
> org.apache.hadoop.ipc.Client [] - Failed to 
> connect to server: 11c5f741db81/172.25.0.2:39121: retries get failed due to 
> exceeded maximum allowed retries number: 0
> 2022-01-21T03:28:18.3730293Z Jan 21 03:28:18 
> java.nio.channels.ClosedByInterruptException: null
> 2022-01-21T03:28:18.3730834Z Jan 21 03:28:18 
> java.nio.channels.ClosedByInterruptException: null
> 2022-01-21T03:28:18.3731499Z Jan 21 03:28:18  at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>  ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3732203Z Jan 21 03:28:18  at 
> sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:658) 
> ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3733478Z Jan 21 03:28:18  at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
>  ~[hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3734470Z Jan 21 03:28:18  at 
> org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531) 
> ~[hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3735432Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:685) 
> [hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3736414Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:788) 
> [hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3737734Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:410) 
> [hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3738853Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client.getConnection(Client.java:1550) 
> [hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3739752Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client.call(Client.java:1381) 
> [hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3740638Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client.call(Client.java:1345) 
> [hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3741589Z Jan 21 03:28:18 

[jira] [Commented] (FLINK-25749) YARNSessionFIFOSecuredITCase.testDetachedMode fails on AZP

2022-01-21 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479930#comment-17479930
 ] 

Till Rohrmann commented on FLINK-25749:
---

Another instance: 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=29841&view=logs&j=245e1f2e-ba5b-5570-d689-25ae21e5302f&t=d04c9862-880c-52f5-574b-a7a79fef8e0f

> YARNSessionFIFOSecuredITCase.testDetachedMode fails on AZP
> --
>
> Key: FLINK-25749
> URL: https://issues.apache.org/jira/browse/FLINK-25749
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.15.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
>
> The test {{YARNSessionFIFOSecuredITCase.testDetachedMode}} fails on AZP:
> {code}
> 2022-01-21T03:28:18.3712993Z Jan 21 03:28:18 java.lang.AssertionError: 
> 2022-01-21T03:28:18.3715115Z Jan 21 03:28:18 Found a file 
> /__w/2/s/flink-yarn-tests/target/flink-yarn-tests-fifo-secured/flink-yarn-tests-fifo-secured-logDir-nm-0_0/application_1642735639007_0002/container_1642735639007_0002_01_01/jobmanager.log
>  with a prohibited string (one of [Exception, Started 
> SelectChannelConnector@0.0.0.0:8081]). Excerpts:
> 2022-01-21T03:28:18.3716389Z Jan 21 03:28:18 [
> 2022-01-21T03:28:18.3717531Z Jan 21 03:28:18 2022-01-21 03:27:56,921 INFO  
> org.apache.flink.runtime.resourcemanager.ResourceManagerServiceImpl [] - 
> Resource manager service is not running. Ignore revoking leadership.
> 2022-01-21T03:28:18.3720496Z Jan 21 03:28:18 2022-01-21 03:27:56,922 INFO  
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopped 
> dispatcher akka.tcp://flink@11c5f741db81:37697/user/rpc/dispatcher_0.
> 2022-01-21T03:28:18.3722401Z Jan 21 03:28:18 2022-01-21 03:27:56,922 INFO  
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl [] - 
> Interrupted while waiting for queue
> 2022-01-21T03:28:18.3723661Z Jan 21 03:28:18 java.lang.InterruptedException: 
> null
> 2022-01-21T03:28:18.3724529Z Jan 21 03:28:18  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>  ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3725450Z Jan 21 03:28:18  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
>  ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3726239Z Jan 21 03:28:18  at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) 
> ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3727618Z Jan 21 03:28:18  at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:323)
>  [hadoop-yarn-client-2.8.5.jar:?]
> 2022-01-21T03:28:18.3729147Z Jan 21 03:28:18 2022-01-21 03:27:56,927 WARN  
> org.apache.hadoop.ipc.Client [] - Failed to 
> connect to server: 11c5f741db81/172.25.0.2:39121: retries get failed due to 
> exceeded maximum allowed retries number: 0
> 2022-01-21T03:28:18.3730293Z Jan 21 03:28:18 
> java.nio.channels.ClosedByInterruptException: null
> 2022-01-21T03:28:18.3730834Z Jan 21 03:28:18 
> java.nio.channels.ClosedByInterruptException: null
> 2022-01-21T03:28:18.3731499Z Jan 21 03:28:18  at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>  ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3732203Z Jan 21 03:28:18  at 
> sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:658) 
> ~[?:1.8.0_292]
> 2022-01-21T03:28:18.3733478Z Jan 21 03:28:18  at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
>  ~[hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3734470Z Jan 21 03:28:18  at 
> org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531) 
> ~[hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3735432Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:685) 
> [hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3736414Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:788) 
> [hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3737734Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:410) 
> [hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3738853Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client.getConnection(Client.java:1550) 
> [hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3739752Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client.call(Client.java:1381) 
> [hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3740638Z Jan 21 03:28:18  at 
> org.apache.hadoop.ipc.Client.call(Client.java:1345) 
> [hadoop-common-2.8.5.jar:?]
> 2022-01-21T03:28:18.3741589Z Jan 21 03:28:18  at 
> o