Hi all, We are using embedded Spark 1.6.2 in our analytics platform[1]. For the cluster communication we use hazel-cast clustering capabilities. From Hazelcast side we set the following configurations, in order to configure the hearbeat properties.
hazelcast.max.no.heartbeat.seconds=30 hazelcast.max.no.master.confirmation.seconds=45 So, if there is a network failure for more than 30 seconds, Hazelcast detects that this is a network outage. So each node sees that the other node left the cluster (in a 2 node cluster). What happens is, the current active node stays active and the passive master also becomes active because of the network outage. Following are the respective messages printed by hazelcast. *node 1:* TID: [-1] [] [2017-11-05 08:59:14,071] INFO {org.wso2.carbon.core.clustering.hazelcast.wka.WKABasedMembershipScheme} - Member left [ff9788ad-8570-465c-846a-82393c636ae5]: /10.36.239.70:4000 {org.wso2.carbon.core.clustering.hazelcast.wka.WKABasedMembershipScheme} *node 2:* TID: [-1] [] [2017-11-05 09:00:40,357] INFO {org.wso2.carbon.core.clustering.hazelcast.wka.WKABasedMembershipScheme} - Member left [d02e49f5-78ab-4bfb-ba2c-df24f7e5c058]: /10.36.239.67:4000 {org.wso2.carbon.core.clustering.hazelcast.wka.WKABasedMembershipScheme} When the network recovers, we see the following error in Spark, ERROR {org.apache.spark.deploy.worker.Worker} - Worker registration failed: Duplicate worker ID {org.apache.spark.deploy.worker.Worker} This also cause our server to shutdown(please note that we spawn a spark JVM from our server). Please see the following shutdown error. INFO {org.wso2.carbon.core.init.CarbonServerManager} - Shutdown hook triggered.... {org.wso2.carbon.core.init.CarbonServerManager} Do you guys have any idea what is happening to a Spark cluster when there is a network outage and how it handles the split brain situations? Is there a way to restore the previous cluster state, once both nodes join the cluster again? Also I am curious to know why the shudown hook is triggered. Thanks in advance! Gimantha