Turns out that I was just being idiotic and had assigned so much memory to
Spark that the O/S was ending up continually swapping. Apologies for the
noise.
Phil
On Wed, Dec 24, 2014 at 1:16 AM, Andrew Ash wrote:
> Hi Phil,
>
> This sounds a lot like a deadlock in Hadoop's Configuration object that I
> ran into a while back. If you jstack the JVM and see a thread that looks
> like the below, it could be
> https://issues.apache.org/jira/browse/SPARK-2546
>
> "Executor task launch worker-6" daemon prio=10 tid=0x7f91f01fe000
> nid=0x54b1 runnable [0x7f92d74f1000]
>java.lang.Thread.State: RUNNABLE
> at java.util.HashMap.transfer(HashMap.java:601)
> at java.util.HashMap.resize(HashMap.java:581)
> at java.util.HashMap.addEntry(HashMap.java:879)
> at java.util.HashMap.put(HashMap.java:505)
> at org.apache.hadoop.conf.Configuration.set(Configuration.java:803)
> at org.apache.hadoop.conf.Configuration.set(Configuration.java:783)
> at org.apache.hadoop.conf.Configuration.setClass(Configuration.java:1662)
>
>
> The fix for this issue is hidden behind a flag because it might have
> performance implications, but if it is this problem then you can set
> spark.hadoop.cloneConf=true and see if that fixes things.
>
> Good luck!
> Andrew
>
> On Tue, Dec 23, 2014 at 9:40 AM, Phil Wills wrote:
>
>> I've been attempting to run a job based on MLlib's ALS implementation for
>> a while now and have hit an issue I'm having a lot of difficulty getting to
>> the bottom of.
>>
>> On a moderate size set of input data it works fine, but against larger
>> (still well short of what I'd think of as big) sets of data, I'll see one
>> or two workers get stuck spinning at 100% CPU and the job unable to
>> recover.
>>
>> I don't believe this is down to memory pressure as I seem to get the same
>> behaviour at about the same size of input data, even if the cluster is
>> twice as large. GC logs also suggest things are proceeding reasonably with
>> some Full GC's occurring, but no suggestion of the process being GC locked.
>>
>> After rebooting the instance that got into trouble, I can see the stderr
>> log for the task truncated in the middle of a log-line at the time CPU
>> shoots to and sticks at 100%, but no other signs of a problem.
>>
>> I've run into the same issue on 1.1.0 and 1.2.0 in standalone mode and
>> running on YARN.
>>
>> Any suggestions on further steps I could try to get a clearer diagnosis
>> of the issue would be much appreciated.
>>
>> Thanks,
>>
>> Phil
>>
>
>