OK, I did figure this out. I was running the app (avocado) using
spark-submit, when it was actually designed to take command line arguments
to connect to a spark cluster. Since I didn't provide any such arguments, it
started a nested local Spark cluster *inside* the YARN Spark executor and so
of course everything ran on one node. If I spin up a Spark cluster manually
and provide the spark master URI to avocado, it works fine.

Now, I've tried running a reasonable-sized job through (400GB of data on 10
HDFS/Spark nodes), and the partitioning is strange. Eight nodes get almost
nothing, and the other two nodes each get half the work. This happens
whether I use coalesce with shuffle=true or false before the work stage.
(Though if I use shuffle=true, it creates 3000 tasks to do the shuffle, and
still ends up with this skewed distribution!) Any suggestions on how to
figure out what's going on?

Thanks,

Ravi



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Memory-compute-intensive-tasks-tp9643p10868.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to