Docker and AWS taskmanager configuration

Colin Williams Mon, 20 Nov 2017 21:45:45 -0800

Hi,

We noticed that we couldn't parallelize our flink docker containers and
this looks like an issue that other have experienced. In our environment we
were not setting any hostname in the flink configuration. This worked for
the single node, but it looks like the taskmanagers would have the
exception also similar to others:


org.apache.flink.runtime.io.network.partition.PartitionNotFoundException:
Partition 2ec10eeac8969a16945b3713b63c0f4f@11052a989c7921423e44653285481e23
not found.
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.failPartitionRequest(RemoteInputChannel.java:203)
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.retriggerSubpartitionRequest(RemoteInputChannel.java:128)
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.retriggerPartitionRequest(SingleInputGate.java:345)
>       at 
> org.apache.flink.runtime.taskmanager.Task.onPartitionStateUpdate(Task.java:1286)
>       at org.apache.flink.runtime.taskmanager.Task$2.apply(Task.java:1123)
>       at org.apache.flink.runtime.taskmanager.Task$2.apply(Task.java:1118)
>       at 
> org.apache.flink.runtime.concurrent.impl.FlinkFuture$5.onComplete(FlinkFuture.java:272)
>       at akka.dispatch.OnComplete.internal(Future.scala:248)
>       at akka.dispatch.OnComplete.internal(Future.scala:245)
>       at akka.dispatch.japi$CallbackBridge.apply(Future.scala:175)
>       at akka.dispatch.japi$CallbackBridge.apply(Future.scala:172)
>       at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>       at 
> akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
>       at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
>       at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
>       at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
>       at 
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
>       at 
> akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)
>       at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
>       at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
>       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>       at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
>
In our AWS environment we are only running one container per EC2 instance,
and each instance has a "unique-dns-address" associated with it.
uniquie-dns-address is similar to ip-XXX-XX-XX-XXX.aws-region-X .

Then so that we don't have to do any additional DNS configuration, it would
be convenient to exploit this dns address for each taskmanager to talk to
each other.

I tested that I could reach each taskmanager from the unique-dns-address
via telnet to one of the taskmanager ports and I was able to connect. This
made me think that setting taskmanager.hostname to the address would solve
my issue.

However when I tried to set taskmanager.hostname :
unique-dns-address in flink-conf.yaml I ended up with a java.net.BindException:
Cannot assign requested address.I'm not entirely sure why this happened.

But I looked around and found some other list message that mentioned
https://doc.akka.io/docs/akka/2.4.1/additional/faq.html / RE: remote actor.

So I set *akka.remote.netty.tcp.hostname*: unique-dns-address again for
each instance. However I was not certain which taskmanager port to set
for akka.remote.netty.tcp.port
. Then I left it unset but tried again I had the same
PartitionNotFoundException

I realize this is a complicated issue which varies for each environment.
But I am asking for advice regarding other things I should try to tackle
the issue.

Furhtermore, if I'm on the right track, what taskmanager service port
should correspond to akka.remote.netty.tcp.port ?

Docker and AWS taskmanager configuration

Reply via email to