Re: Workers disconnected from master sometimes and never reconnect back

2014-09-29 Thread Romi Kuntsman
Hi all,

Regarding a post here a few months ago
http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-tp6240.html

Is there an answer to this?
I saw workers being still active and not reconnecting after they lost
connection to the master. Using Spark 1.1.0.

What if a master server is restarted, should worker retry to register on it?

Greetings,

-- 
*Romi Kuntsman*, *Big Data Engineer*
 http://www.totango.com
​Join the Customer Success Manifesto  http://youtu.be/XvFi2Wh6wgU


Re: Workers disconnected from master sometimes and never reconnect back

2014-09-29 Thread Andrew Ash
Hi Romi,

I've observed this many times as well.  So much so that on some clusters I
restart the workers every night in order to maintain these worker - master
connections.

I couldn't find an open SPARK ticket on it so filed
https://issues.apache.org/jira/browse/SPARK-3736 with you and Piotr
mentioned.  Please discuss on that ticket what you think the proper fix
should be!

Cheers,
Andrew

On Mon, Sep 29, 2014 at 4:36 AM, Romi Kuntsman r...@totango.com wrote:

 Hi all,

 Regarding a post here a few months ago

 http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-tp6240.html

 Is there an answer to this?
 I saw workers being still active and not reconnecting after they lost
 connection to the master. Using Spark 1.1.0.

 What if a master server is restarted, should worker retry to register on
 it?

 Greetings,

 --
 *Romi Kuntsman*, *Big Data Engineer*
  http://www.totango.com
 ​Join the Customer Success Manifesto  http://youtu.be/XvFi2Wh6wgU



Workers disconnected from master sometimes and never reconnect back

2014-05-22 Thread Piotr Kołaczkowski
Hi,

Another problem we observed that on a very heavily loaded cluster, if the
worker fails to respond to the heartbeat within 60 seconds, it gets
disconnected permanently from the master and never connects back again. It
is very easy to reproduce - just setup a spark standalone cluster on a
single machine, suspend it for a while and after waking up the cluster
doesn't work anymore because all workers are lost.

Is there any way to mitigate this?

Thanks,
Piotr

-- 
Piotr Kolaczkowski, Lead Software Engineer
pkola...@datastax.com

http://www.datastax.com/
777 Mariners Island Blvd., Suite 510
San Mateo, CA 94404