Hi Arvid, I listened to ports with netcat and connected via telnet and each node can connect to the other and itself.
The `/etc/hosts` file looks like this ``` 127.0.0.1 localhost 127.0.1.1 node-2.example.com node-2 <ip-node-1> node-1 ``` Is the second line the reason it fails? I also replaced all hostnames with IP addresses in the config files (flink-conf, workers, masters) but without effect... Do you have any ideas what else I could try? Thanks again, Matthias On 2/24/21 2:17 PM, Arvid Heise wrote: > Hi Matthias, > > most of the debug statements are just noise. You can ignore that. > > Something with your network seems fishy to me. Either taskmanager 1 > cannot connect to taskmanager 2 (and vice versa), or the taskmanager > cannot connect locally. > > I found this fragment, which seems suspicious > > Failed to connect to /127.0.*1*.1:32797. Giving up. > > localhost is usually 127.0.0.1. Can you double check that you connect > from all machines to all machines (including themselves) by opening > trivial text sockets on random ports? > > On Fri, Feb 19, 2021 at 10:59 AM Matthias Seiler > <matthias.sei...@campus.tu-berlin.de > <mailto:matthias.sei...@campus.tu-berlin.de>> wrote: > > Hi Till, > > thanks for the hint, you seem about right. Setting the log level > to DEBUG reveals more information, but I don't know what to do > about it. > > All logs throw some Java related exceptions: > `java.lang.UnsupportedOperationException: Reflective > setAccessible(true) disabled` > and > `java.lang.IllegalAccessException: class > org.apache.flink.shaded.netty4.io.netty.util.internal.PlatformDependent0$6 > cannot access class jdk.internal.misc.Unsafe (in module java.base) > because module java.base does not export jdk.internal.misc to > unnamed module` > > The log of node-2's TaskManager reveals connection problems: > `org.apache.flink.runtime.net.ConnectionUtils [] - > Failed to connect from address 'node-2/127.0.1.1 > <http://127.0.1.1>': Invalid argument (connect failed)` > `java.net.ConnectException: Invalid argument (connect failed)` > > What's more, both TaskManagers (node-1 and node-2) are having > trouble to load > `org_apache_flink_shaded_netty4_netty_transport_native_epoll_x86_64`, > but load some version eventually. > > > There is quite a lot going on here that I don't understand. Can > you (or someone) shed some light on it and tell me what I could try? > > Some more information: > I appended the following to the `/etc/hosts` file: > ``` > <ip-node-1> node-1 > <ip-node-2> node-2 > ``` > And the `flink/conf/workers` consists of: > ``` > node-1 > node-2 > ``` > > Thanks, > Matthias > > P.S. I attached the logs for further reference. `<ip-node-1>` is > of course the real IP address instead. > > > On 2/16/21 1:56 PM, Till Rohrmann wrote: >> Hi Matthias, >> >> Can you make sure that node-1 and node-2 can talk to each other? >> It looks to me that node-2 fails to open a connection to the >> other TaskManager. Maybe the logs give some more insights. You >> can change the log level to DEBUG to gather more information. >> >> Cheers, >> Till >> >> On Tue, Feb 16, 2021 at 10:57 AM Matthias Seiler >> <matthias.sei...@campus.tu-berlin.de >> <mailto:matthias.sei...@campus.tu-berlin.de>> wrote: >> >> Hi Everyone, >> >> I'm trying to setup a Flink cluster in standealone mode with two >> machines. However, running a job throws the following exception: >> `org.apache.flink.runtime.io >> >> <http://org.apache.flink.runtime.io>.network.netty.exception.LocalTransportException: >> Sending the partition request to 'null' failed` >> >> Here is some background: >> >> Machines: >> - node-1: JobManager, TaskManager >> - node-2: TaskManager >> >> flink-conf-yaml looks like this: >> ``` >> jobmanager.rpc.address: node-1 >> taskmanager.numberOfTaskSlots: 8 >> parallelism.default: 2 >> cluster.evenly-spread-out-slots: true >> ``` >> >> Deploying the cluster works: I can see both TaskManagers in >> the WebUI. >> >> I ran the streaming WordCount example: `flink run >> flink-1.12.1/examples/streaming/WordCount.jar --input >> lorem-ipsum.txt` >> - the job has been submitted >> - job failed (with the above exception) >> - the log of the node-2 also shows the exception, the other >> logs are >> fine (graceful stop) >> >> I played around with the config and observed that >> - if parallelism is set to 1, node-1 gets all the slots and >> node-2 none >> - if parallelism is set to 2, each TaskManager occupies 1 >> TaskSlot (but >> fails because of node-2) >> >> I suspect, that the problem must be with the communication >> between >> TaskManagers >> - job runs successful if >> - node-1 is the **only** node with x TaskManagers (tested >> with x=1 >> and x=2) >> - node-2 is the **only** node with x TaskManagers (tested >> with x=1 >> and x=2) >> - job fails if >> - node-1 **and** node-2 have one TaskManager >> >> The full exception is: >> ``` >> org.apache.flink.client.program.ProgramInvocationException: >> The main >> method caused an error: >> org.apache.flink.client.program.ProgramInvocationException: >> Job failed >> (JobID: 547d4d29b3650883aa403cdb5eb1ba5c) >> // ... Job failed, Recovery is suppressed by >> NoRestartBackoffTimeStrategy, ... >> Caused by: >> org.apache.flink.runtime.io >> >> <http://org.apache.flink.runtime.io>.network.netty.exception.LocalTransportException: >> Sending the partition request to 'null' failed. >> at >> org.apache.flink.runtime.io >> >> <http://org.apache.flink.runtime.io>.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:137) >> at >> org.apache.flink.runtime.io >> >> <http://org.apache.flink.runtime.io>.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:125) >> at >> >> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) >> at >> >> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551) >> at >> >> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) >> at >> >> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) >> at >> >> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608) >> at >> >> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) >> at >> >> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:993) >> at >> >> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865) >> at >> >> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367) >> at >> >> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717) >> at >> >> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764) >> at >> >> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071) >> at >> >> org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) >> at >> >> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) >> at >> >> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384) >> at >> >> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) >> at >> >> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) >> at java.base/java.lang.Thread.run(Thread.java:834) >> Caused by: java.nio.channels.ClosedChannelException >> at >> >> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957) >> ... 11 more >> ``` >> >> Thanks in advance, >> Matthias >>