Hi Phil, This sounds a lot like a deadlock in Hadoop's Configuration object that I ran into a while back. If you jstack the JVM and see a thread that looks like the below, it could be https://issues.apache.org/jira/browse/SPARK-2546
"Executor task launch worker-6" daemon prio=10 tid=0x00007f91f01fe000 nid=0x54b1 runnable [0x00007f92d74f1000] java.lang.Thread.State: RUNNABLE at java.util.HashMap.transfer(HashMap.java:601) at java.util.HashMap.resize(HashMap.java:581) at java.util.HashMap.addEntry(HashMap.java:879) at java.util.HashMap.put(HashMap.java:505) at org.apache.hadoop.conf.Configuration.set(Configuration.java:803) at org.apache.hadoop.conf.Configuration.set(Configuration.java:783) at org.apache.hadoop.conf.Configuration.setClass(Configuration.java:1662) The fix for this issue is hidden behind a flag because it might have performance implications, but if it is this problem then you can set spark.hadoop.cloneConf=true and see if that fixes things. Good luck! Andrew On Tue, Dec 23, 2014 at 9:40 AM, Phil Wills <otherp...@gmail.com> wrote: > I've been attempting to run a job based on MLlib's ALS implementation for > a while now and have hit an issue I'm having a lot of difficulty getting to > the bottom of. > > On a moderate size set of input data it works fine, but against larger > (still well short of what I'd think of as big) sets of data, I'll see one > or two workers get stuck spinning at 100% CPU and the job unable to > recover. > > I don't believe this is down to memory pressure as I seem to get the same > behaviour at about the same size of input data, even if the cluster is > twice as large. GC logs also suggest things are proceeding reasonably with > some Full GC's occurring, but no suggestion of the process being GC locked. > > After rebooting the instance that got into trouble, I can see the stderr > log for the task truncated in the middle of a log-line at the time CPU > shoots to and sticks at 100%, but no other signs of a problem. > > I've run into the same issue on 1.1.0 and 1.2.0 in standalone mode and > running on YARN. > > Any suggestions on further steps I could try to get a clearer diagnosis of > the issue would be much appreciated. > > Thanks, > > Phil >