Hi Jie, When you say firewall is closed does that mean ports are blocked between the worker nodes? I believe workers start up on a random port and send data directly between each other during shuffles. Your firewall may be blocking those connections. Can you try with the firewall temporarily disabled?
Andrew On Mon, Dec 16, 2013 at 9:58 AM, Jie Deng <[email protected]> wrote: > Hi, > Thanks for reading, > > I am trying to running a spark program on cluster. The program can > successfully running on local; > The standalone topology is working, I can see workers from master webUI; > Master and worker are different machine, and worker status is ALIVE; > The thing is no matter I start a program from eclipse or ./run-example, > they both stop at some point like: > Stage Id Description SubmittedDuration Tasks: Succeeded/TotalShuffle > ReadShuffle Write0 count > at > SparkExample.java:31<http://jie-optiplex-7010.local:4040/stages/stage?id=0>2013/12/16 > 14:50:367 m > 0/2 > And after a while, the worker's state become DEAD. > > Spark directory on worker is copy from master by ./make-distribution, > firewall is all closed. > > Has anyone has the same issue before? >
