Hello, Out cluster occasionally fails with "partition map exchange failure" errors, I have searched around and it seems that a lot of people have had a similar issue in the past. My high-level understanding is that when one of the nodes fails (out of memory, exception, GC etc.) nodes fail to exchange partition maps. However, I have a few questions 1) When does partition map exchange happen? Periodically, when a node joins, etc. 2) Is it done in the same thread as communication SPI, or is a separate worker? 3) How does the exchange happen? Via a coordinator, peer to peer, etc? 4) What does the exchange block? 5) When is the exchange retried? 5) How to resolve the error? The only thing I have seen online is to decrease failureDetectionTimeout
Our settings are - Zookeeper SPI - Persistence enabled Cheers, Eugene
