OK, I did figure this out. I was running the app (avocado) using spark-submit, when it was actually designed to take command line arguments to connect to a spark cluster. Since I didn't provide any such arguments, it started a nested local Spark cluster *inside* the YARN Spark executor and so of course everything ran on one node. If I spin up a Spark cluster manually and provide the spark master URI to avocado, it works fine.
Now, I've tried running a reasonable-sized job through (400GB of data on 10 HDFS/Spark nodes), and the partitioning is strange. Eight nodes get almost nothing, and the other two nodes each get half the work. This happens whether I use coalesce with shuffle=true or false before the work stage. (Though if I use shuffle=true, it creates 3000 tasks to do the shuffle, and still ends up with this skewed distribution!) Any suggestions on how to figure out what's going on? Thanks, Ravi -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Memory-compute-intensive-tasks-tp9643p10868.html Sent from the Apache Spark User List mailing list archive at Nabble.com.