Hi, Another problem we observed that on a very heavily loaded cluster, if the worker fails to respond to the heartbeat within 60 seconds, it gets disconnected permanently from the master and never connects back again. It is very easy to reproduce - just setup a spark standalone cluster on a single machine, suspend it for a while and after waking up the cluster doesn't work anymore because all workers are lost.
Is there any way to mitigate this? Thanks, Piotr -- Piotr Kolaczkowski, Lead Software Engineer pkola...@datastax.com http://www.datastax.com/ 777 Mariners Island Blvd., Suite 510 San Mateo, CA 94404