Hi Guys

I am trying to run spark terasort benchmark provided by ehiggs
<https://github.com/ehiggs/spark-terasort> on github. Terasort on 1 gb, 10
gb and 100gb works fine. But when it comes to 1000 gb, the program seems to
run into problems. The 1000 gb terasort actually completes on single-node
in 5 hours or so. But in case of multi-node, it always fails.

The errors show that executors are being lost. And they keep on failing
till the job is automatically killed. Again, 1000 gb terasort completes
with single-node. Its multi-node which is the problem. I guess there are
some co-ordination and timeout issues between the nodes.

The command that I am using is:

time $SPARK_HOME/bin/spark-submit --master spark://master-ip:7077 --conf
"spark.akka.timeout=2400" --conf "spark.akka.askTimeout=2400" --conf
"spark.akka.frameSize=500" --conf
"spark.core.connection.ack.wait.timeout=2400" --conf
"spark.driver.maxResultSize=16g" --conf "spark.driver.cores=10" --conf
"spark.executor.memory=4g" --conf "spark.driver.memory=50g" --driver-memory
50g --conf "spark.eventLog.enabled=true" --conf
"spark.eventLog.dir=hdfs://master-ip:54310/sparkevents" --conf
"spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties"
--class com.github.ehiggs.spark.terasort.TeraSort
~/spark-terasort/target/spark-terasort-1.0-jar-with-dependencies.jar
hdfs://master-ip:54310/teragen-1t hdfs://master-ip:54310/terasort-1t

The 1 tb teragen that I ran prior to this had 20000 partitions
("spark.default.parallelism=20000").

And these are the specifications and configurations:

hardware:
1 master, 2 slaves
master -> 96 cores, 137 gb ram
slaves -> 192 cores, 237 gb ram

spark configuration:
slaves -> 64 workers, 3 cores for each worker, 3 gb RAM to each worker

I am running the program from another machine which is not the part of this
cluster but has SSH access to and from every machine.

I have tried it with a lot of configurations but every time it failed. The
one which is above is he latest one which is failing.

Can anyone help me in designing the configuration or set some properties
which will not result in executors failing and let the tersort complete?

-- 
Thank You

Regards

Punit Naik

Reply via email to