Hello Ray,

Without explicit errors in the log, it's not so easy to guess what was
that.
Because I don't see any errors, it should be a recoverable failure (even
taking a long time).
If you have such option, could you please enable DEBUG log level
for org.apache.ignite.internal.util.nio.GridTcpNioCommunicationClient
and org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi for server
nodes?
If such a long time of PME happens, again and again, debug logs from these
classes give us a lot of useful information to find the exact cause of such
a long communication process.

If a client node has a stable connection to the cluster it should wait for
PME till its end. My message about reconnecting was mostly about if client
connection to the cluster breaks.
But, If after the end of PME client doesn't send any data yet, thread dump
from the client will be very useful to analyze why it's happened.

2018-07-26 18:36 GMT+03:00 Ray <[email protected]>:

> Hello Pavel,
>
> Thanks for the explanation, it's been great help.
>
> Can you take a guess why PME has performed a long time due to communication
> issues between server nodes?
> From the logs, the "no route to host" exception happened because server
> can't connect to client's ports.
> But I didn't see any logs indicating the network issues between server
> nodes.
> I tested connectivity of communication SPI ports(47100 in this case) and
> discovery SPI ports(49500 in this case) between server nodes, it's all
> good.
>
> And on client(spark executor) side, there's no exception log when PME takes
> a long time to finish.
> It will hang forever.
> Spark.log
> <http://apache-ignite-users.70518.x6.nabble.com/file/t1346/Spark.log>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Reply via email to