Yes you are right I initially started from master node but what happened
suddenly after 2 days that workers dies is what I am interested in knowing
, is it possible that workers got disconnected because of some network
issue and then they tried tried starting themselves but kept failing ?
On Sun,
Kartik,
Spark Workers won't start if SPARK_MASTER_IP is wrong, maybe you would
have used start_slaves.sh from Master node to start all worker nodes, where
Workers would have got correct SPARK_MASTER_IP initially. Later any restart
from slave nodes would have failed because of wrong SPARK_MAST
Thanks Prabhu ,
I had wrongly configured spark_master_ip in worker nodes to `hostname -f`
which is the worker and not master ,
but now the question is *why the cluster was up initially for 2 days* and
workers realized of this invalid configuration after 2 days ? And why other
workers are still up
Kartik,
The exception stack trace
*java.util.concurrent.RejectedExecutionException* will happen if
SPARK_MASTER_IP in worker nodes are configured wrongly like if
SPARK_MASTER_IP is a hostname of Master Node and workers trying to connect
to IP of master node. Check whether SPARK_MASTER_IP in Wor
on spark 1.5.2
I have a spark standalone cluster with 6 workers , I left the cluster idle
for 3 days and after 3 days I saw only 4 workers on the spark master UI , 2
workers died with the same exception -
Strange part is cluster was running stable for 2 days but on third day 2
workers abruptly die