Turns out that I was just being idiotic and had assigned so much memory to Spark that the O/S was ending up continually swapping. Apologies for the noise.
Phil On Wed, Dec 24, 2014 at 1:16 AM, Andrew Ash <and...@andrewash.com> wrote: > Hi Phil, > > This sounds a lot like a deadlock in Hadoop's Configuration object that I > ran into a while back. If you jstack the JVM and see a thread that looks > like the below, it could be > https://issues.apache.org/jira/browse/SPARK-2546 > > "Executor task launch worker-6" daemon prio=10 tid=0x00007f91f01fe000 > nid=0x54b1 runnable [0x00007f92d74f1000] > java.lang.Thread.State: RUNNABLE > at java.util.HashMap.transfer(HashMap.java:601) > at java.util.HashMap.resize(HashMap.java:581) > at java.util.HashMap.addEntry(HashMap.java:879) > at java.util.HashMap.put(HashMap.java:505) > at org.apache.hadoop.conf.Configuration.set(Configuration.java:803) > at org.apache.hadoop.conf.Configuration.set(Configuration.java:783) > at org.apache.hadoop.conf.Configuration.setClass(Configuration.java:1662) > > > The fix for this issue is hidden behind a flag because it might have > performance implications, but if it is this problem then you can set > spark.hadoop.cloneConf=true and see if that fixes things. > > Good luck! > Andrew > > On Tue, Dec 23, 2014 at 9:40 AM, Phil Wills <otherp...@gmail.com> wrote: > >> I've been attempting to run a job based on MLlib's ALS implementation for >> a while now and have hit an issue I'm having a lot of difficulty getting to >> the bottom of. >> >> On a moderate size set of input data it works fine, but against larger >> (still well short of what I'd think of as big) sets of data, I'll see one >> or two workers get stuck spinning at 100% CPU and the job unable to >> recover. >> >> I don't believe this is down to memory pressure as I seem to get the same >> behaviour at about the same size of input data, even if the cluster is >> twice as large. GC logs also suggest things are proceeding reasonably with >> some Full GC's occurring, but no suggestion of the process being GC locked. >> >> After rebooting the instance that got into trouble, I can see the stderr >> log for the task truncated in the middle of a log-line at the time CPU >> shoots to and sticks at 100%, but no other signs of a problem. >> >> I've run into the same issue on 1.1.0 and 1.2.0 in standalone mode and >> running on YARN. >> >> Any suggestions on further steps I could try to get a clearer diagnosis >> of the issue would be much appreciated. >> >> Thanks, >> >> Phil >> > >