Hello Ray, Without explicit errors in the log, it's not so easy to guess what was that. Because I don't see any errors, it should be a recoverable failure (even taking a long time). If you have such option, could you please enable DEBUG log level for org.apache.ignite.internal.util.nio.GridTcpNioCommunicationClient and org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi for server nodes? If such a long time of PME happens, again and again, debug logs from these classes give us a lot of useful information to find the exact cause of such a long communication process.
If a client node has a stable connection to the cluster it should wait for PME till its end. My message about reconnecting was mostly about if client connection to the cluster breaks. But, If after the end of PME client doesn't send any data yet, thread dump from the client will be very useful to analyze why it's happened. 2018-07-26 18:36 GMT+03:00 Ray <[email protected]>: > Hello Pavel, > > Thanks for the explanation, it's been great help. > > Can you take a guess why PME has performed a long time due to communication > issues between server nodes? > From the logs, the "no route to host" exception happened because server > can't connect to client's ports. > But I didn't see any logs indicating the network issues between server > nodes. > I tested connectivity of communication SPI ports(47100 in this case) and > discovery SPI ports(49500 in this case) between server nodes, it's all > good. > > And on client(spark executor) side, there's no exception log when PME takes > a long time to finish. > It will hang forever. > Spark.log > <http://apache-ignite-users.70518.x6.nabble.com/file/t1346/Spark.log> > > > > -- > Sent from: http://apache-ignite-users.70518.x6.nabble.com/ >
