Worker lost during processing large input

Bo Lu Thu, 31 Oct 2013 10:22:35 -0700

Hi spark users, 

I   just started to learn running spark standalone application on a   
standalone cluster and I am very impressed by how easy to program using   spark.
But when I run for a large data input (about 30G) on my cluster, I met errors 
like "Removing BlockManager" "worker lost".


The application is a Kmeans algorithm for just one iteration and the initial K 
= 16.

The cluster I have is: 15 nodes with 4G RAM and 4 cores each (one of the nodes 
behaves both master and slave)

I am running spark 0.8.0 and spark is built against hadoop 1.1.1 for accessing 
HDFS

In the spark-env.sh (on all the nodes and in the same directory):

export SPARK_WORKER_MEMORY=8g
export HADOOP_CONF_DIR="/share/hadoop-1.1.1/conf"
export SPARK_JAVA_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
-XX:+UseCompressedOops"

In the driver program:

System.setProperty("spark.default.parallelism", "160");
System.setProperty("spark.storage.memoryFraction", "0.1");
System.setProperty("spark.executor.memory", "8g");
System.setProperty("spark.worker.timeout", "6000");
System.setProperty("spark.akka.frameSize", "10000");
System.setProperty("spark.akka.timeout", "6000");

In   the program, I use groupByKey() which group all the input with respect   
to a cluster id (Key), it turns out that there is one key has 7.8G of   data, 
that is why I use System.setProperty("spark.executor.memory",   "8g"), if I 
lower spark.executor.memory, I will get OOM. But I need to   write all the data 
back to HDFS after clustering

I looked at the   Environment tab of the application UI and confirmed that the 
system   property are all set but one weird thing is that I get
"13/10/31 11:48:57 WARN master.Master: Removing 
worker-20131031105954-pen13.xmen.eti-34747 because we got no heartbeat in 60 
seconds"
But is this value supposed to be 6000, since I have set 
System.setProperty("spark.worker.timeout", "6000"); and 
System.setProperty("spark.akka.timeout", "6000");

I   also looked at the worker node and found out that there are a lot of   swap 
going on and also GC, maybe that is why the worker get lost?

If   anyone can give me a hint on how to configure the system for such a   
application and cluster to solve the problem that will be great.

Thanks.

Bo

Worker lost during processing large input

Reply via email to