Re: Workers disconnected from master sometimes and never reconnect back
Hi all, Regarding a post here a few months ago http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-tp6240.html Is there an answer to this? I saw workers being still active and not reconnecting after they lost connection to the master. Using Spark 1.1.0. What if a master server is restarted, should worker retry to register on it? Greetings, -- *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com Join the Customer Success Manifesto http://youtu.be/XvFi2Wh6wgU
Re: Workers disconnected from master sometimes and never reconnect back
Hi Romi, I've observed this many times as well. So much so that on some clusters I restart the workers every night in order to maintain these worker - master connections. I couldn't find an open SPARK ticket on it so filed https://issues.apache.org/jira/browse/SPARK-3736 with you and Piotr mentioned. Please discuss on that ticket what you think the proper fix should be! Cheers, Andrew On Mon, Sep 29, 2014 at 4:36 AM, Romi Kuntsman r...@totango.com wrote: Hi all, Regarding a post here a few months ago http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-tp6240.html Is there an answer to this? I saw workers being still active and not reconnecting after they lost connection to the master. Using Spark 1.1.0. What if a master server is restarted, should worker retry to register on it? Greetings, -- *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com Join the Customer Success Manifesto http://youtu.be/XvFi2Wh6wgU
Workers disconnected from master sometimes and never reconnect back
Hi, Another problem we observed that on a very heavily loaded cluster, if the worker fails to respond to the heartbeat within 60 seconds, it gets disconnected permanently from the master and never connects back again. It is very easy to reproduce - just setup a spark standalone cluster on a single machine, suspend it for a while and after waking up the cluster doesn't work anymore because all workers are lost. Is there any way to mitigate this? Thanks, Piotr -- Piotr Kolaczkowski, Lead Software Engineer pkola...@datastax.com http://www.datastax.com/ 777 Mariners Island Blvd., Suite 510 San Mateo, CA 94404