Workers disconnected from master sometimes and never reconnect back

Piotr Kołaczkowski Thu, 22 May 2014 01:40:23 -0700

Hi,

Another problem we observed that on a very heavily loaded cluster, if the
worker fails to respond to the heartbeat within 60 seconds, it gets
disconnected permanently from the master and never connects back again. It
is very easy to reproduce - just setup a spark standalone cluster on a
single machine, suspend it for a while and after waking up the cluster
doesn't work anymore because all workers are lost.


Is there any way to mitigate this?

Thanks,
Piotr

-- 
Piotr Kolaczkowski, Lead Software Engineer
pkola...@datastax.com

http://www.datastax.com/
777 Mariners Island Blvd., Suite 510
San Mateo, CA 94404

Workers disconnected from master sometimes and never reconnect back

Reply via email to