Hi Nico, thanks for the quick response! No, this was note enabled :( Since we are in the process of upgrading to 1.3.1: I did not find this option in 1.3, only 1.2. Is this the default behaviour in 1.3 or is this configuration just not documented?
Cheers, Konstantin On 03.08.2017 17:11, Nico Kruber wrote: > Hi Konstantin, > I digged through the linked pull requests (of https://issues.apache.org/jira/ > browse/FLINK-3347) a bit just to notice that the fix-version tag was wrong > (should have been 1.2.1, not 1.2.0) but you have that already. > > In there, it was also mentioned that the quarantine monitor is disabled by > default and can be enabled by setting `taskmanager.exit-on-fatal-akka-error` > to true. If enabled, it should detect a quarantined task manager and shut it > down. In that case, YARN should notice it and start a new one, if I'm not > mistaken. > > Are you already working with `taskmanager.exit-on-fatal-akka-error` enabled? > > > Nico > > On Thursday, 3 August 2017 10:53:00 CEST Konstantin Knauf wrote: >> Hi everyone, >> >> we are running Flink 1.2.1 on YARN 2.4 (I know, way to old :(). >> Correlated with the last Flink Upgrade from 1.1.3 -> 1.2.1 we are >> experiencing regular TaskManager failures due to >> >> [Taskmanager Logs] >> 2017-07-10 15:25:26,448 ERROR Remoting >> - Association to >> [akka.tcp://flink@<jobmanager>:45303] with UID [-382428140] >> irrecoverably failed. Quarantining address. >> java.lang.IllegalStateException: Error encountered while processing >> system message acknowledgement buffer: [1 {0, 1}] ack: ACK[3, {}] >> at >> akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoi >> nt.scala:289) at akka.actor.Actor$class.aroundReceive(Actor.scala:467) >> at ... >> >> As far as I understand https://issues.apache.org/jira/browse/FLINK-3345 >> the taskmanager should be restarted in this case. In our case YARN does >> not start a new taskmanager container, but the container is just missing >> indefinitely. Is it known, that this does not work on YARN 2.4? >> >> If it helps, I can also provide the full job and taskmanager logs... >> >> Cheers & Thanks, >> >> Konstantin > -- Konstantin Knauf * konstantin.kn...@tngtech.com * +49-174-3413182 TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller Sitz: Unterföhring * Amtsgericht München * HRB 135082
signature.asc
Description: OpenPGP digital signature