Hi Arvid,

I listened to ports with netcat and connected via telnet and each node
can connect to the other and itself.

The `/etc/hosts` file looks like this
```
127.0.0.1   localhost
127.0.1.1   node-2.example.com   node-2

<ip-node-1>   node-1
```
Is the second line the reason it fails? I also replaced all hostnames
with IP addresses in the config files (flink-conf, workers, masters) but
without effect...

Do you have any ideas what else I could try?

Thanks again,
Matthias

On 2/24/21 2:17 PM, Arvid Heise wrote:
> Hi Matthias,
>
> most of the debug statements are just noise. You can ignore that.
>
> Something with your network seems fishy to me. Either taskmanager 1
> cannot connect to taskmanager 2 (and vice versa), or the taskmanager
> cannot connect locally.
>
> I found this fragment, which seems suspicious
>
> Failed to connect to /127.0.*1*.1:32797. Giving up.
>
> localhost is usually 127.0.0.1. Can you double check that you connect
> from all machines to all machines (including themselves) by opening
> trivial text sockets on random ports?
>
> On Fri, Feb 19, 2021 at 10:59 AM Matthias Seiler
> <matthias.sei...@campus.tu-berlin.de
> <mailto:matthias.sei...@campus.tu-berlin.de>> wrote:
>
>     Hi Till,
>
>     thanks for the hint, you seem about right. Setting the log level
>     to DEBUG reveals more information, but I don't know what to do
>     about it.
>
>     All logs throw some Java related exceptions:
>     `java.lang.UnsupportedOperationException: Reflective
>     setAccessible(true) disabled`
>     and
>     `java.lang.IllegalAccessException: class
>     org.apache.flink.shaded.netty4.io.netty.util.internal.PlatformDependent0$6
>     cannot access class jdk.internal.misc.Unsafe (in module java.base)
>     because module java.base does not export jdk.internal.misc to
>     unnamed module`
>
>     The log of node-2's TaskManager reveals connection problems:
>     `org.apache.flink.runtime.net.ConnectionUtils                 [] -
>     Failed to connect from address 'node-2/127.0.1.1
>     <http://127.0.1.1>': Invalid argument (connect failed)`
>     `java.net.ConnectException: Invalid argument (connect failed)`
>
>     What's more, both TaskManagers (node-1 and node-2) are having
>     trouble to load
>     `org_apache_flink_shaded_netty4_netty_transport_native_epoll_x86_64`,
>     but load some version eventually.
>
>
>     There is quite a lot going on here that I don't understand. Can
>     you (or someone) shed some light on it and tell me what I could try?
>
>     Some more information:
>     I appended the following to the `/etc/hosts` file:
>     ```
>     <ip-node-1> node-1
>     <ip-node-2> node-2
>     ```
>     And the `flink/conf/workers` consists of:
>     ```
>     node-1
>     node-2
>     ```
>
>     Thanks,
>     Matthias
>
>     P.S. I attached the logs for further reference. `<ip-node-1>` is
>     of course the real IP address instead.
>
>
>     On 2/16/21 1:56 PM, Till Rohrmann wrote:
>>     Hi Matthias,
>>
>>     Can you make sure that node-1 and node-2 can talk to each other?
>>     It looks to me that node-2 fails to open a connection to the
>>     other TaskManager. Maybe the logs give some more insights. You
>>     can change the log level to DEBUG to gather more information.
>>
>>     Cheers,
>>     Till
>>
>>     On Tue, Feb 16, 2021 at 10:57 AM Matthias Seiler
>>     <matthias.sei...@campus.tu-berlin.de
>>     <mailto:matthias.sei...@campus.tu-berlin.de>> wrote:
>>
>>         Hi Everyone,
>>
>>         I'm trying to setup a Flink cluster in standealone mode with two
>>         machines. However, running a job throws the following exception:
>>         `org.apache.flink.runtime.io
>>         
>> <http://org.apache.flink.runtime.io>.network.netty.exception.LocalTransportException:
>>         Sending the partition request to 'null' failed`
>>
>>         Here is some background:
>>
>>         Machines:
>>         - node-1: JobManager, TaskManager
>>         - node-2: TaskManager
>>
>>         flink-conf-yaml looks like this:
>>         ```
>>         jobmanager.rpc.address: node-1
>>         taskmanager.numberOfTaskSlots: 8
>>         parallelism.default: 2
>>         cluster.evenly-spread-out-slots: true
>>         ```
>>
>>         Deploying the cluster works: I can see both TaskManagers in
>>         the WebUI.
>>
>>         I ran the streaming WordCount example: `flink run
>>         flink-1.12.1/examples/streaming/WordCount.jar --input
>>         lorem-ipsum.txt`
>>         - the job has been submitted
>>         - job failed (with the above exception)
>>         - the log of the node-2 also shows the exception, the other
>>         logs are
>>         fine (graceful stop)
>>
>>         I played around with the config and observed that
>>         - if parallelism is set to 1, node-1 gets all the slots and
>>         node-2 none
>>         - if parallelism is set to 2, each TaskManager occupies 1
>>         TaskSlot (but
>>         fails because of node-2)
>>
>>         I suspect, that the problem must be with the communication
>>         between
>>         TaskManagers
>>         - job runs successful if
>>             - node-1 is the **only** node with x TaskManagers (tested
>>         with x=1
>>         and x=2)
>>             - node-2 is the **only** node with x TaskManagers (tested
>>         with x=1
>>         and x=2)
>>         - job fails if
>>             - node-1 **and** node-2 have one TaskManager
>>
>>         The full exception is:
>>         ```
>>         org.apache.flink.client.program.ProgramInvocationException:
>>         The main
>>         method caused an error:
>>         org.apache.flink.client.program.ProgramInvocationException:
>>         Job failed
>>         (JobID: 547d4d29b3650883aa403cdb5eb1ba5c)
>>         // ... Job failed, Recovery is suppressed by
>>         NoRestartBackoffTimeStrategy, ...
>>         Caused by:
>>         org.apache.flink.runtime.io
>>         
>> <http://org.apache.flink.runtime.io>.network.netty.exception.LocalTransportException:
>>         Sending the partition request to 'null' failed.
>>             at
>>         org.apache.flink.runtime.io
>>         
>> <http://org.apache.flink.runtime.io>.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:137)
>>             at
>>         org.apache.flink.runtime.io
>>         
>> <http://org.apache.flink.runtime.io>.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:125)
>>             at
>>         
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
>>             at
>>         
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
>>             at
>>         
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
>>             at
>>         
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615)
>>             at
>>         
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608)
>>             at
>>         
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
>>             at
>>         
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:993)
>>             at
>>         
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865)
>>             at
>>         
>> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
>>             at
>>         
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
>>             at
>>         
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
>>             at
>>         
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
>>             at
>>         
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>>             at
>>         
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>>             at
>>         
>> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
>>             at
>>         
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>>             at
>>         
>> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>>             at java.base/java.lang.Thread.run(Thread.java:834)
>>         Caused by: java.nio.channels.ClosedChannelException
>>             at
>>         
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
>>             ... 11 more
>>         ```
>>
>>         Thanks in advance,
>>         Matthias
>>

Reply via email to