Hi, all

I am running a 30GB Wikipedia dataset on a 7-server cluster. Using
WikipediaPageRank underexample/Bagel.

My Spark version is bae07e3 [behind 1] fix different versions of
commons-lang dependency and apache/spark#746 addendum

The problem is that the job will fail after several stages because of
OutofMemory
Error. The reason might be that the default executor's memory size is *512M*
.

I try to modify executor memory size via export
SPARK_JAVA_OPTS="-Dspark-cores-max=8 -Dspark.executor.memory=8g", but
SPARK_JAVA_OPTS is not recommended in Spark 1.0+. Log also tells ERROR
SparkConf.

   - Anyone knows the difference between executor memory / cores and worker
   memory /cores?
   - How to set the executor memory in Spark 1.0+?

spark-env.sh:

export SPARK_WORKER_MEMORY=2g
export SPARK_MASTER_IP=192.168.1.12
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_CORES=2
export SPARK_WORKER_INSTANCES=2

Each server has 8G men and 8-core CPU. But after several stages, the job
failed and outputs following logs:

14/05/19 22:29:32 WARN TaskSetManager: Loss was due to
java.lang.OutOfMemoryError
java.lang.OutOfMemoryError: Java heap space
14/05/19 22:29:32 INFO SparkDeploySchedulerBackend: Executor 10
disconnected, so removing it
14/05/19 22:29:32 ERROR TaskSchedulerImpl: Lost executor 10 on
host125: remote Akka client disassociat
...
14/05/19 22:29:33 ERROR SparkDeploySchedulerBackend: Application has
been killed. Reason: Master removed our application: FAILED
14/05/19 22:29:33 WARN TaskSetManager: Loss was due to fetch failure
from BlockManagerId(10, host125,
java.io.IOException: Filesystem closed
    at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:629)
    at 
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:735)
    at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:793)
    at java.io.DataInputStream.read(DataInputStream.java:100)
    at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:211)
    ...
14/05/19 22:29:33 INFO DAGScheduler: Failed to run foreach at Bagel.scala:251
Exception in thread "main" org.apache.spark.SparkException: Job
aborted due to stage failure: Master removed our application: FAILED
14/05/19 22:29:33 INFO TaskSchedulerImpl: Cancelling stage 4
14/05/19 22:29:33 INFO TaskSchedulerImpl: Stage 4 was cancelled
14/05/19 22:29:33 WARN TaskSetManager: Loss was due to java.io.IOException
java.io.IOException: Failed on local exception:
java.io.InterruptedIOException: Interruped while waiting for IO on
channel java.nio.chan
nels.SocketChannel[connected local=/192.168.1.123:54254
remote=/192.168.1.12:9000]. 59922 millis timeout left.; Host Details :
local hos
t is: "host123/192.168.1.123"; destination host is: "sing12":9000;
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
    ...


Regards,
Wang Hao(王灏)

CloudTeam | School of Software Engineering
Shanghai Jiao Tong University
Address:800 Dongchuan Road, Minhang District, Shanghai, 200240
Email:wh.s...@gmail.com

Reply via email to