Hi everyone,

we are running Flink 1.2.1 on YARN 2.4 (I know, way to old :().
Correlated with the last Flink Upgrade from 1.1.3 -> 1.2.1 we are
experiencing regular TaskManager failures due to

[Taskmanager Logs]
2017-07-10 15:25:26,448 ERROR Remoting
                   - Association to
[akka.tcp://flink@<jobmanager>:45303] with UID [-382428140]
irrecoverably failed. Quarantining address.
java.lang.IllegalStateException: Error encountered while processing
system message acknowledgement buffer: [1 {0, 1}] ack: ACK[3, {}]
        at
akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoint.scala:289)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
        at ...

As far as I understand https://issues.apache.org/jira/browse/FLINK-3345
the taskmanager should be restarted in this case. In our case YARN does
not start a new taskmanager container, but the container is just missing
indefinitely. Is it known, that this does not work on YARN 2.4?

If it helps, I can also provide the full job and taskmanager logs...

Cheers & Thanks,

Konstantin

-- 
Konstantin Knauf * konstantin.kn...@tngtech.com * +49-174-3413182
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller
Sitz: Unterföhring * Amtsgericht München * HRB 135082

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to