[jira] [Commented] (FLINK-14048) Flink client hangs after trying to kill Yarn Job during deployment
[ https://issues.apache.org/jira/browse/FLINK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957691#comment-16957691 ] Zili Chen commented on FLINK-14048: --- Thanks for your update [~gyfora]. I think you're right. I find another earlier report FLINK-10435. Closed this one as duplicated. FLINK-10435 has detailed message. > Flink client hangs after trying to kill Yarn Job during deployment > -- > > Key: FLINK-14048 > URL: https://issues.apache.org/jira/browse/FLINK-14048 > Project: Flink > Issue Type: Improvement > Components: Client / Job Submission, Deployment / YARN >Reporter: Gyula Fora >Priority: Major > Attachments: patch.diff > > > If we kill the flink client run command from the terminal while deploying to > YARN (let's say we realize we used the wrong parameters), the YARN > application will be killed immediately but the client won't shut down. > We get the following messages over and over: > 19/09/10 23:35:55 INFO retry.RetryInvocationHandler: java.io.IOException: The > client is stopped, while invoking > ApplicationClientProtocolPBClientImpl.forceKillApplication over null after 14 > failover attempts. Trying to failover after sleeping for 16296ms. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14048) Flink client hangs after trying to kill Yarn Job during deployment
[ https://issues.apache.org/jira/browse/FLINK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957670#comment-16957670 ] Gyula Fora commented on FLINK-14048: [~tison] I think your patch doesn't fix the problem. It looks like a bug in the AbstractYarnClusterDescriptor when it tries to kill the already failed app. 19/10/23 01:50:50 INFO yarn.AbstractYarnClusterDescriptor: Cancelling deployment from Deployment Failure Hook 19/10/23 01:50:50 INFO yarn.AbstractYarnClusterDescriptor: Killing YARN application 19/10/23 01:50:50 INFO retry.RetryInvocationHandler: java.io.IOException: The client is stopped, while invoking ApplicationClientProtocolPBClientImpl.forceKillApplication over null. Trying to failover immediately. 19/10/23 01:50:50 INFO retry.RetryInvocationHandler: java.io.IOException: The client is stopped, while invoking ApplicationClientProtocolPBClientImpl.forceKillApplication over null after 1 failover attempts. Trying to failover after sleeping for 40495ms. > Flink client hangs after trying to kill Yarn Job during deployment > -- > > Key: FLINK-14048 > URL: https://issues.apache.org/jira/browse/FLINK-14048 > Project: Flink > Issue Type: Improvement > Components: Client / Job Submission, Deployment / YARN >Reporter: Gyula Fora >Priority: Major > Attachments: patch.diff > > > If we kill the flink client run command from the terminal while deploying to > YARN (let's say we realize we used the wrong parameters), the YARN > application will be killed immediately but the client won't shut down. > We get the following messages over and over: > 19/09/10 23:35:55 INFO retry.RetryInvocationHandler: java.io.IOException: The > client is stopped, while invoking > ApplicationClientProtocolPBClientImpl.forceKillApplication over null after 14 > failover attempts. Trying to failover after sleeping for 16296ms. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14048) Flink client hangs after trying to kill Yarn Job during deployment
[ https://issues.apache.org/jira/browse/FLINK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927372#comment-16927372 ] TisonKun commented on FLINK-14048: -- [~gyfora] also it looks like a duplication of FLINK-13895. Could you please check if the root cause of two issues is the same? > Flink client hangs after trying to kill Yarn Job during deployment > -- > > Key: FLINK-14048 > URL: https://issues.apache.org/jira/browse/FLINK-14048 > Project: Flink > Issue Type: Improvement > Components: Client / Job Submission, Deployment / YARN >Reporter: Gyula Fora >Priority: Major > Attachments: patch.diff > > > If we kill the flink client run command from the terminal while deploying to > YARN (let's say we realize we used the wrong parameters), the YARN > application will be killed immediately but the client won't shut down. > We get the following messages over and over: > 19/09/10 23:35:55 INFO retry.RetryInvocationHandler: java.io.IOException: The > client is stopped, while invoking > ApplicationClientProtocolPBClientImpl.forceKillApplication over null after 14 > failover attempts. Trying to failover after sleeping for 16296ms. > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (FLINK-14048) Flink client hangs after trying to kill Yarn Job during deployment
[ https://issues.apache.org/jira/browse/FLINK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927350#comment-16927350 ] TisonKun commented on FLINK-14048: -- I try to refactor the code for a proper exception handling. Could you apply the patch attached to see if the issue addressed? > Flink client hangs after trying to kill Yarn Job during deployment > -- > > Key: FLINK-14048 > URL: https://issues.apache.org/jira/browse/FLINK-14048 > Project: Flink > Issue Type: Improvement > Components: Client / Job Submission, Deployment / YARN >Reporter: Gyula Fora >Priority: Major > Attachments: patch.diff > > > If we kill the flink client run command from the terminal while deploying to > YARN (let's say we realize we used the wrong parameters), the YARN > application will be killed immediately but the client won't shut down. > We get the following messages over and over: > 19/09/10 23:35:55 INFO retry.RetryInvocationHandler: java.io.IOException: The > client is stopped, while invoking > ApplicationClientProtocolPBClientImpl.forceKillApplication over null after 14 > failover attempts. Trying to failover after sleeping for 16296ms. > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (FLINK-14048) Flink client hangs after trying to kill Yarn Job during deployment
[ https://issues.apache.org/jira/browse/FLINK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927313#comment-16927313 ] Gyula Fora commented on FLINK-14048: yes it was in per-job mode > Flink client hangs after trying to kill Yarn Job during deployment > -- > > Key: FLINK-14048 > URL: https://issues.apache.org/jira/browse/FLINK-14048 > Project: Flink > Issue Type: Improvement > Components: Client / Job Submission, Deployment / YARN >Reporter: Gyula Fora >Priority: Major > > If we kill the flink client run command from the terminal while deploying to > YARN (let's say we realize we used the wrong parameters), the YARN > application will be killed immediately but the client won't shut down. > We get the following messages over and over: > 19/09/10 23:35:55 INFO retry.RetryInvocationHandler: java.io.IOException: The > client is stopped, while invoking > ApplicationClientProtocolPBClientImpl.forceKillApplication over null after 14 > failover attempts. Trying to failover after sleeping for 16296ms. > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (FLINK-14048) Flink client hangs after trying to kill Yarn Job during deployment
[ https://issues.apache.org/jira/browse/FLINK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927308#comment-16927308 ] TisonKun commented on FLINK-14048: -- [~gyfora] did you notice this problem when deploy per-job cluster? I find the relevant code snippet in {{CliFrontend#runProgram}} and it seems that when exception thrown(in this case, a signal cause exception) we don't close the {{ClusterClient}} properly. But it should only happen in per-job mode. > Flink client hangs after trying to kill Yarn Job during deployment > -- > > Key: FLINK-14048 > URL: https://issues.apache.org/jira/browse/FLINK-14048 > Project: Flink > Issue Type: Improvement > Components: Client / Job Submission, Deployment / YARN >Reporter: Gyula Fora >Priority: Major > > If we kill the flink client run command from the terminal while deploying to > YARN (let's say we realize we used the wrong parameters), the YARN > application will be killed immediately but the client won't shut down. > We get the following messages over and over: > 19/09/10 23:35:55 INFO retry.RetryInvocationHandler: java.io.IOException: The > client is stopped, while invoking > ApplicationClientProtocolPBClientImpl.forceKillApplication over null after 14 > failover attempts. Trying to failover after sleeping for 16296ms. > -- This message was sent by Atlassian Jira (v8.3.2#803003)