Hi everyone, we are running Flink 1.2.1 on YARN 2.4 (I know, way to old :(). Correlated with the last Flink Upgrade from 1.1.3 -> 1.2.1 we are experiencing regular TaskManager failures due to
[Taskmanager Logs] 2017-07-10 15:25:26,448 ERROR Remoting - Association to [akka.tcp://flink@<jobmanager>:45303] with UID [-382428140] irrecoverably failed. Quarantining address. java.lang.IllegalStateException: Error encountered while processing system message acknowledgement buffer: [1 {0, 1}] ack: ACK[3, {}] at akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoint.scala:289) at akka.actor.Actor$class.aroundReceive(Actor.scala:467) at ... As far as I understand https://issues.apache.org/jira/browse/FLINK-3345 the taskmanager should be restarted in this case. In our case YARN does not start a new taskmanager container, but the container is just missing indefinitely. Is it known, that this does not work on YARN 2.4? If it helps, I can also provide the full job and taskmanager logs... Cheers & Thanks, Konstantin -- Konstantin Knauf * konstantin.kn...@tngtech.com * +49-174-3413182 TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller Sitz: Unterföhring * Amtsgericht München * HRB 135082
signature.asc
Description: OpenPGP digital signature