Hi Juho, Thanks for the follow up. Regarding the BlobServerConnection error, Nico (cc'ed) might have an idea.
Best, Gary On Thu, Mar 29, 2018 at 4:08 PM, Juho Autio <juho.au...@rovio.com> wrote: > Sorry, my bad. I checked the persisted jobmanager logs and can see that > job was still being restarted at 15:31 and then at 15:36. If I wouldn't > have terminated the cluster, I believe the flink job / yarn app would've > eventually exited as failed. > > > On Thu, Mar 29, 2018 at 4:49 PM, Juho Autio <juho.au...@rovio.com> wrote: > >> Thanks again, Gary. >> >> It's true that I only let the job remain in the stuck state for something >> between 10-15 minutes. Then I shut down the cluster. >> >> But: if restart strategy is being applied, shouldn't I have seen those >> messages in job manager log? In my case it kept all quiet since ~2018-03-28 >> 15:27 and I terminated it at ~28-03-2018 15:36. >> >> Do you happen to know about what that BlobServerConnection error means in >> the code? If it may lead into some unrecoverable state (where neither >> restart is attempted, nor job is failed for good).. >> >> On Thu, Mar 29, 2018 at 4:39 PM, Gary Yao <g...@data-artisans.com> wrote: >> >>> Hi Juho, >>> >>> The log message >>> >>> Could not allocate all requires slots within timeout of 300000 ms. >>> Slots required: 20, slots allocated: 8 >>> >>> indicates that you do not have enough resources in your cluster left. >>> Can you >>> verify that after you started the job submission the YARN cluster does >>> not reach >>> its maximum capacity? You can also try submitting the job with a lower >>> parallelism. >>> >>> I think the reason why the YARN application is not immediately shown as >>> failed >>> is that your restart strategy attempts to start the job 3 times. On every >>> attempt the job is blocked on the slot allocation timeout for at least >>> 300000 ms >>> (5 minutes). I have tried submitting examples/streaming/WordCount.jar >>> with the >>> same restart strategy on EMR, and the CLI only returns after around 20 >>> minutes. >>> >>> As a side note, beginning from Flink 1.5, you do not need to specify -yn >>> -ys >>> because resource allocations are dynamic by default (FLIP-6). The >>> parameter -yst >>> is deprecated and should not be needed either. >>> >>> Best, >>> Gary >>> >>> On Thu, Mar 29, 2018 at 8:59 AM, Juho Autio <juho.au...@rovio.com> >>> wrote: >>> >>>> I built a new Flink distribution from release-1.5 branch yesterday. >>>> >>>> The first time I tried to run a job with it ended up in some stalled >>>> state so that the job didn't manage to (re)start but what makes it worse is >>>> that it didn't exit as failed either. >>>> >>>> Next time I tried running the same job (but new EMR cluster & all from >>>> scratch) it just worked normally. >>>> >>>> On the problematic run, The YARN job was started and Flink UI was being >>>> served, but Flink UI kept showing status CREATED for all sub-tasks and >>>> nothing seemed to be happening. >>>> >>>> I found this in Job manager log first (could be unrelated) : >>>> >>>> 2018-03-28 15:26:17,449 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph >>>> - Job UniqueIdStream (43ed4ace55974d3c486452a45ee5db93) switched >>>> from state RUNNING to FAILING. >>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: >>>> Could not allocate all requires slots within timeout of 300000 ms. Slots >>>> required: 20, slots allocated: 8 >>>> at org.apache.flink.runtime.executiongraph.ExecutionGraph.lambd >>>> a$scheduleEager$36(ExecutionGraph.java:984) >>>> at java.util.concurrent.CompletableFuture.uniExceptionally(Comp >>>> letableFuture.java:870) >>>> at java.util.concurrent.CompletableFuture$UniExceptionally.tryF >>>> ire(CompletableFuture.java:852) >>>> at java.util.concurrent.CompletableFuture.postComplete(Completa >>>> bleFuture.java:474) >>>> at java.util.concurrent.CompletableFuture.completeExceptionally >>>> (CompletableFuture.java:1977) >>>> at org.apache.flink.runtime.concurrent.FutureUtils$ResultConjun >>>> ctFuture.handleCompletedFuture(FutureUtils.java:551) >>>> at java.util.concurrent.CompletableFuture.uniWhenComplete(Compl >>>> etableFuture.java:760) >>>> at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFi >>>> re(CompletableFuture.java:736) >>>> at java.util.concurrent.CompletableFuture.postComplete(Completa >>>> bleFuture.java:474) >>>> at java.util.concurrent.CompletableFuture.completeExceptionally >>>> (CompletableFuture.java:1977) >>>> at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete >>>> (FutureUtils.java:789) >>>> at akka.dispatch.OnComplete.internal(Future.scala:258) >>>> at akka.dispatch.OnComplete.internal(Future.scala:256) >>>> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186) >>>> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183) >>>> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) >>>> at org.apache.flink.runtime.concurrent.Executors$DirectExecutio >>>> nContext.execute(Executors.java:83) >>>> at scala.concurrent.impl.CallbackRunnable.executeWithValue(Prom >>>> ise.scala:44) >>>> at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Pro >>>> mise.scala:252) >>>> at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupp >>>> ort.scala:603) >>>> at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126) >>>> at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedE >>>> xecute(Future.scala:601) >>>> at scala.concurrent.BatchingExecutor$class.execute(BatchingExec >>>> utor.scala:109) >>>> at scala.concurrent.Future$InternalCallbackExecutor$.execute(Fu >>>> ture.scala:599) >>>> at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTas >>>> k(LightArrayRevolverScheduler.scala:329) >>>> at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket >>>> $1(LightArrayRevolverScheduler.scala:280) >>>> at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(Ligh >>>> tArrayRevolverScheduler.scala:284) >>>> at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArra >>>> yRevolverScheduler.scala:236) >>>> at java.lang.Thread.run(Thread.java:748) >>>> >>>> >>>> After this there was: >>>> >>>> 2018-03-28 15:26:17,521 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph >>>> - Restarting the job UniqueIdStream (43ed4ace55974d3c486452a45ee5d >>>> b93). >>>> >>>> >>>> And some time after that: >>>> >>>> 2018-03-28 15:27:39,125 ERROR >>>> org.apache.flink.runtime.blob.BlobServerConnection >>>> - GET operation failed >>>> java.io.EOFException: Premature end of GET request >>>> at org.apache.flink.runtime.blob.BlobServerConnection.get(BlobS >>>> erverConnection.java:275) >>>> at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobS >>>> erverConnection.java:117) >>>> >>>> >>>> Task manager logs didn't have any errors. >>>> >>>> >>>> Is that error about BlobServerConnection severe enough to make the job >>>> get stuck like this? Seems like a Flink bug, at least that it just gets >>>> stuck and doesn't either restart or make the YARN app exit as failed? >>>> >>>> >>>> >>>> My launch command is basically: >>>> >>>> flink-${FLINK_VERSION}/bin/flink run -m yarn-cluster -yn ${NODE_COUNT} >>>> -ys ${SLOT_COUNT} -yjm ${JOB_MANAGER_MEMORY} -ytm ${TASK_MANAGER_MEMORY} >>>> -yst -yD restart-strategy=fixed-delay -yD >>>> restart-strategy.fixed-delay.attempts=3 >>>> -yD "restart-strategy.fixed-delay.delay=30 s" -p ${PARALLELISM} $@ >>>> >>>> >>>> I'm also setting this to fix some classloading error (with the previous >>>> build that still works) >>>> -yD.classloader.resolve-order=parent-first >>>> >>>> >>>> Cluster was AWS EMR, release 5.12.0. >>>> >>>> Thanks. >>>> >>> >>> >> >