Re: All sub-tasks stuck in CREATED state after "BlobServerConnection: GET operation failed"

Gary Yao Thu, 29 Mar 2018 07:25:13 -0700

Hi Juho,

Thanks for the follow up. Regarding the BlobServerConnection error, Nico
(cc'ed)
might have an idea.


Best,
Gary

On Thu, Mar 29, 2018 at 4:08 PM, Juho Autio <juho.au...@rovio.com> wrote:

> Sorry, my bad. I checked the persisted jobmanager logs and can see that
> job was still being restarted at 15:31 and then at 15:36. If I wouldn't
> have terminated the cluster, I believe the flink job / yarn app would've
> eventually exited as failed.
>
>
> On Thu, Mar 29, 2018 at 4:49 PM, Juho Autio <juho.au...@rovio.com> wrote:
>
>> Thanks again, Gary.
>>
>> It's true that I only let the job remain in the stuck state for something
>> between 10-15 minutes. Then I shut down the cluster.
>>
>> But: if restart strategy is being applied, shouldn't I have seen those
>> messages in job manager log? In my case it kept all quiet since ~2018-03-28
>> 15:27 and I terminated it at ~28-03-2018 15:36.
>>
>> Do you happen to know about what that BlobServerConnection error means in
>> the code? If it may lead into some unrecoverable state (where neither
>> restart is attempted, nor job is failed for good)..
>>
>> On Thu, Mar 29, 2018 at 4:39 PM, Gary Yao <g...@data-artisans.com> wrote:
>>
>>> Hi Juho,
>>>
>>> The log message
>>>
>>>   Could not allocate all requires slots within timeout of 300000 ms.
>>> Slots required: 20, slots allocated: 8
>>>
>>> indicates that you do not have enough resources in your cluster left.
>>> Can you
>>> verify that after you started the job submission the YARN cluster does
>>> not reach
>>> its maximum capacity? You can also try submitting the job with a lower
>>> parallelism.
>>>
>>> I think the reason why the YARN application is not immediately shown as
>>> failed
>>> is that your restart strategy attempts to start the job 3 times. On every
>>> attempt the job is blocked on the slot allocation timeout for at least
>>> 300000 ms
>>> (5 minutes). I have tried submitting examples/streaming/WordCount.jar
>>> with the
>>> same restart strategy on EMR, and the CLI only returns after around 20
>>> minutes.
>>>
>>> As a side note, beginning from Flink 1.5, you do not need to specify -yn
>>> -ys
>>> because resource allocations are dynamic by default (FLIP-6). The
>>> parameter -yst
>>> is deprecated and should not be needed either.
>>>
>>> Best,
>>> Gary
>>>
>>> On Thu, Mar 29, 2018 at 8:59 AM, Juho Autio <juho.au...@rovio.com>
>>> wrote:
>>>
>>>> I built a new Flink distribution from release-1.5 branch yesterday.
>>>>
>>>> The first time I tried to run a job with it ended up in some stalled
>>>> state so that the job didn't manage to (re)start but what makes it worse is
>>>> that it didn't exit as failed either.
>>>>
>>>> Next time I tried running the same job (but new EMR cluster & all from
>>>> scratch) it just worked normally.
>>>>
>>>> On the problematic run, The YARN job was started and Flink UI was being
>>>> served, but Flink UI kept showing status CREATED for all sub-tasks and
>>>> nothing seemed to be happening.
>>>>
>>>> I found this in Job manager log first (could be unrelated) :
>>>>
>>>> 2018-03-28 15:26:17,449 INFO  
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>>       - Job UniqueIdStream (43ed4ace55974d3c486452a45ee5db93) switched
>>>> from state RUNNING to FAILING.
>>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
>>>> Could not allocate all requires slots within timeout of 300000 ms. Slots
>>>> required: 20, slots allocated: 8
>>>> at org.apache.flink.runtime.executiongraph.ExecutionGraph.lambd
>>>> a$scheduleEager$36(ExecutionGraph.java:984)
>>>> at java.util.concurrent.CompletableFuture.uniExceptionally(Comp
>>>> letableFuture.java:870)
>>>> at java.util.concurrent.CompletableFuture$UniExceptionally.tryF
>>>> ire(CompletableFuture.java:852)
>>>> at java.util.concurrent.CompletableFuture.postComplete(Completa
>>>> bleFuture.java:474)
>>>> at java.util.concurrent.CompletableFuture.completeExceptionally
>>>> (CompletableFuture.java:1977)
>>>> at org.apache.flink.runtime.concurrent.FutureUtils$ResultConjun
>>>> ctFuture.handleCompletedFuture(FutureUtils.java:551)
>>>> at java.util.concurrent.CompletableFuture.uniWhenComplete(Compl
>>>> etableFuture.java:760)
>>>> at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFi
>>>> re(CompletableFuture.java:736)
>>>> at java.util.concurrent.CompletableFuture.postComplete(Completa
>>>> bleFuture.java:474)
>>>> at java.util.concurrent.CompletableFuture.completeExceptionally
>>>> (CompletableFuture.java:1977)
>>>> at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete
>>>> (FutureUtils.java:789)
>>>> at akka.dispatch.OnComplete.internal(Future.scala:258)
>>>> at akka.dispatch.OnComplete.internal(Future.scala:256)
>>>> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
>>>> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
>>>> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
>>>> at org.apache.flink.runtime.concurrent.Executors$DirectExecutio
>>>> nContext.execute(Executors.java:83)
>>>> at scala.concurrent.impl.CallbackRunnable.executeWithValue(Prom
>>>> ise.scala:44)
>>>> at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Pro
>>>> mise.scala:252)
>>>> at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupp
>>>> ort.scala:603)
>>>> at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
>>>> at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedE
>>>> xecute(Future.scala:601)
>>>> at scala.concurrent.BatchingExecutor$class.execute(BatchingExec
>>>> utor.scala:109)
>>>> at scala.concurrent.Future$InternalCallbackExecutor$.execute(Fu
>>>> ture.scala:599)
>>>> at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTas
>>>> k(LightArrayRevolverScheduler.scala:329)
>>>> at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket
>>>> $1(LightArrayRevolverScheduler.scala:280)
>>>> at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(Ligh
>>>> tArrayRevolverScheduler.scala:284)
>>>> at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArra
>>>> yRevolverScheduler.scala:236)
>>>> at java.lang.Thread.run(Thread.java:748)
>>>>
>>>>
>>>> After this there was:
>>>>
>>>> 2018-03-28 15:26:17,521 INFO  
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>>       - Restarting the job UniqueIdStream (43ed4ace55974d3c486452a45ee5d
>>>> b93).
>>>>
>>>>
>>>> And some time after that:
>>>>
>>>> 2018-03-28 15:27:39,125 ERROR 
>>>> org.apache.flink.runtime.blob.BlobServerConnection
>>>>           - GET operation failed
>>>> java.io.EOFException: Premature end of GET request
>>>> at org.apache.flink.runtime.blob.BlobServerConnection.get(BlobS
>>>> erverConnection.java:275)
>>>> at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobS
>>>> erverConnection.java:117)
>>>>
>>>>
>>>> Task manager logs didn't have any errors.
>>>>
>>>>
>>>> Is that error about BlobServerConnection severe enough to make the job
>>>> get stuck like this? Seems like a Flink bug, at least that it just gets
>>>> stuck and doesn't either restart or make the YARN app exit as failed?
>>>>
>>>>
>>>>
>>>> My launch command is basically:
>>>>
>>>> flink-${FLINK_VERSION}/bin/flink run -m yarn-cluster -yn ${NODE_COUNT}
>>>> -ys ${SLOT_COUNT} -yjm ${JOB_MANAGER_MEMORY} -ytm ${TASK_MANAGER_MEMORY}
>>>> -yst -yD restart-strategy=fixed-delay -yD 
>>>> restart-strategy.fixed-delay.attempts=3
>>>> -yD "restart-strategy.fixed-delay.delay=30 s" -p ${PARALLELISM} $@
>>>>
>>>>
>>>> I'm also setting this to fix some classloading error (with the previous
>>>> build that still works)
>>>> -yD.classloader.resolve-order=parent-first
>>>>
>>>>
>>>> Cluster was AWS EMR, release 5.12.0.
>>>>
>>>> Thanks.
>>>>
>>>
>>>
>>
>

Re: All sub-tasks stuck in CREATED state after "BlobServerConnection: GET operation failed"

Reply via email to