Hi, We noticed that we couldn't parallelize our flink docker containers and this looks like an issue that other have experienced. In our environment we were not setting any hostname in the flink configuration. This worked for the single node, but it looks like the taskmanagers would have the exception also similar to others:
org.apache.flink.runtime.io.network.partition.PartitionNotFoundException: Partition 2ec10eeac8969a16945b3713b63c0f4f@11052a989c7921423e44653285481e23 not found. > at > org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.failPartitionRequest(RemoteInputChannel.java:203) > at > org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.retriggerSubpartitionRequest(RemoteInputChannel.java:128) > at > org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.retriggerPartitionRequest(SingleInputGate.java:345) > at > org.apache.flink.runtime.taskmanager.Task.onPartitionStateUpdate(Task.java:1286) > at org.apache.flink.runtime.taskmanager.Task$2.apply(Task.java:1123) > at org.apache.flink.runtime.taskmanager.Task$2.apply(Task.java:1118) > at > org.apache.flink.runtime.concurrent.impl.FlinkFuture$5.onComplete(FlinkFuture.java:272) > at akka.dispatch.OnComplete.internal(Future.scala:248) > at akka.dispatch.OnComplete.internal(Future.scala:245) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:175) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:172) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at > akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55) > at > akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91) > at > akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91) > at > akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91) > at > scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) > at > akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90) > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > > In our AWS environment we are only running one container per EC2 instance, and each instance has a "unique-dns-address" associated with it. uniquie-dns-address is similar to ip-XXX-XX-XX-XXX.aws-region-X . Then so that we don't have to do any additional DNS configuration, it would be convenient to exploit this dns address for each taskmanager to talk to each other. I tested that I could reach each taskmanager from the unique-dns-address via telnet to one of the taskmanager ports and I was able to connect. This made me think that setting taskmanager.hostname to the address would solve my issue. However when I tried to set taskmanager.hostname : unique-dns-address in flink-conf.yaml I ended up with a java.net.BindException: Cannot assign requested address.I'm not entirely sure why this happened. But I looked around and found some other list message that mentioned https://doc.akka.io/docs/akka/2.4.1/additional/faq.html / RE: remote actor. So I set *akka.remote.netty.tcp.hostname*: unique-dns-address again for each instance. However I was not certain which taskmanager port to set for akka.remote.netty.tcp.port . Then I left it unset but tried again I had the same PartitionNotFoundException I realize this is a complicated issue which varies for each environment. But I am asking for advice regarding other things I should try to tackle the issue. Furhtermore, if I'm on the right track, what taskmanager service port should correspond to akka.remote.netty.tcp.port ?