subject:"Workers disconnected from master sometimes and never reconnect back"

Re: Workers disconnected from master sometimes and never reconnect back

2014-09-29 Thread Romi Kuntsman

Hi all,

Regarding a post here a few months ago
http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-tp6240.html

Is there an answer to this?
I saw workers being still active and not reconnecting after they lost
connection to the master. Using Spark 1.1.0.

What if a master server is restarted, should worker retry to register on it?

Greetings,

-- 
*Romi Kuntsman*, *Big Data Engineer*
 http://www.totango.com
Join the Customer Success Manifesto  http://youtu.be/XvFi2Wh6wgU

Re: Workers disconnected from master sometimes and never reconnect back

2014-09-29 Thread Andrew Ash

Hi Romi,

I've observed this many times as well.  So much so that on some clusters I
restart the workers every night in order to maintain these worker - master
connections.

I couldn't find an open SPARK ticket on it so filed
https://issues.apache.org/jira/browse/SPARK-3736 with you and Piotr
mentioned.  Please discuss on that ticket what you think the proper fix
should be!

Cheers,
Andrew

On Mon, Sep 29, 2014 at 4:36 AM, Romi Kuntsman r...@totango.com wrote:

 Hi all,

 Regarding a post here a few months ago

 http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-tp6240.html

 Is there an answer to this?
 I saw workers being still active and not reconnecting after they lost
 connection to the master. Using Spark 1.1.0.

 What if a master server is restarted, should worker retry to register on
 it?

 Greetings,

 --
 *Romi Kuntsman*, *Big Data Engineer*
  http://www.totango.com
 Join the Customer Success Manifesto  http://youtu.be/XvFi2Wh6wgU

Workers disconnected from master sometimes and never reconnect back

2014-05-22 Thread Piotr Kołaczkowski

Hi,

Another problem we observed that on a very heavily loaded cluster, if the
worker fails to respond to the heartbeat within 60 seconds, it gets
disconnected permanently from the master and never connects back again. It
is very easy to reproduce - just setup a spark standalone cluster on a
single machine, suspend it for a while and after waking up the cluster
doesn't work anymore because all workers are lost.

Is there any way to mitigate this?

Thanks,
Piotr

-- 
Piotr Kolaczkowski, Lead Software Engineer
pkola...@datastax.com

http://www.datastax.com/
777 Mariners Island Blvd., Suite 510
San Mateo, CA 94404

Re: Workers disconnected from master sometimes and never reconnect back

Re: Workers disconnected from master sometimes and never reconnect back

Workers disconnected from master sometimes and never reconnect back

3 matches

Site Navigation

Mail list logo

Footer information