Hi Guys I am trying to run spark terasort benchmark provided by ehiggs <https://github.com/ehiggs/spark-terasort> on github. Terasort on 1 gb, 10 gb and 100gb works fine. But when it comes to 1000 gb, the program seems to run into problems. The 1000 gb terasort actually completes on single-node in 5 hours or so. But in case of multi-node, it always fails.
The errors show that executors are being lost. And they keep on failing till the job is automatically killed. Again, 1000 gb terasort completes with single-node. Its multi-node which is the problem. I guess there are some co-ordination and timeout issues between the nodes. The command that I am using is: time $SPARK_HOME/bin/spark-submit --master spark://master-ip:7077 --conf "spark.akka.timeout=2400" --conf "spark.akka.askTimeout=2400" --conf "spark.akka.frameSize=500" --conf "spark.core.connection.ack.wait.timeout=2400" --conf "spark.driver.maxResultSize=16g" --conf "spark.driver.cores=10" --conf "spark.executor.memory=4g" --conf "spark.driver.memory=50g" --driver-memory 50g --conf "spark.eventLog.enabled=true" --conf "spark.eventLog.dir=hdfs://master-ip:54310/sparkevents" --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties" --class com.github.ehiggs.spark.terasort.TeraSort ~/spark-terasort/target/spark-terasort-1.0-jar-with-dependencies.jar hdfs://master-ip:54310/teragen-1t hdfs://master-ip:54310/terasort-1t The 1 tb teragen that I ran prior to this had 20000 partitions ("spark.default.parallelism=20000"). And these are the specifications and configurations: hardware: 1 master, 2 slaves master -> 96 cores, 137 gb ram slaves -> 192 cores, 237 gb ram spark configuration: slaves -> 64 workers, 3 cores for each worker, 3 gb RAM to each worker I am running the program from another machine which is not the part of this cluster but has SSH access to and from every machine. I have tried it with a lot of configurations but every time it failed. The one which is above is he latest one which is failing. Can anyone help me in designing the configuration or set some properties which will not result in executors failing and let the tersort complete? -- Thank You Regards Punit Naik