Spark Executor dies in K8 cluster

Philipp Kraus Wed, 19 May 2021 05:24:17 -0700

Hello,

I have got the following first testing setup:




Kubernetes Cluster 1.20 (4 nodes, each node with 120 GB hard disk, 4 cpus, 40 
GB memory)

Spark installation by Binami Helm Charts 
https://artifacthub.io/packages/helm/bitnami/spark (Chart Version 5.4.2 / Spark 
3.1.1)

using GeoSpark version 1.3.2-SNAPSHOT (not Apache Sedona because of migration 
issues) with the setup of https://sedona.apache.org/download/cluster/ so 
spark.driver.memory 10g, spark.network.timeout 1000s, 
spark.driver.maxResultSize 5g

Creating a Fat-Jar Spring-Boot Application which runs some Spark algorithms on 
Java Adoptable JDK 1.8 (latest docker image) with Spark 3.1.1 and Scala 2.12

Using NFS Server Provisioner as Helm Chart 
https://artifacthub.io/packages/helm/kvaps/nfs-server-provisioner to create a 
ReadWriteMany Volume for the Spark-Workers and application. On all pods this 
volume is mounted under /sparkdir so the Fat-Jar file is stored there

Spark Workers are configured with Helm as a ReplicaSet so at 75% CPU usage new 
worker should be spawned on default 2 worker pods are running

The Spark master UI shows the workers with the correct memory and cpu resources 
(4 cores and 10 GB memory for each worker)

Application and Spark are running in the same namespace




I create in the Spring-Boot application (as docker image) a Spark config with 
(Help release name „test“):

Final String l_jar = "/sparkdir/myspringapp.jar"
 new SparkConf().setMaster( "spark://test--spark-master-svc:7077" )
                             .setAppName( "mySpringBootApp")
                             .setJars( Stream.of( l_jar ).toArray( 
String[]::new ) )
                             .set( "spark.jars", l_jar )
                             .set( "spark.driver.userClassPathFirst", l_jar )
                             .set( "spark.kubernetes.container.image", 
"bitnami/spark:3.1.1" )
                             .set( "spark.submit.deployMode", "cluster" )
                             .set( "spark.driver.memory", "10G" )
                             .set( "spark.executor.memory", "4G" )
                             .set( "spark.network.timeout", "1000s" )
                             .set( "spark.driver.maxResultSize", "5G" );

If I start the application and run the Spark execution, the master gets the job 
and pass it to the workers, this works fine, but on the worker I get an error 
on the executors, see the log of one worker:

This script is deprecated, use start-worker.sh
starting org.apache.spark.deploy.worker.Worker, logging to 
/opt/bitnami/spark/logs/spark--org.apache.spark.deploy.worker.Worker-1-test-spark-worker-0.out
Spark Command: /opt/bitnami/java/bin/java -cp 
/opt/bitnami/spark/conf/:/opt/bitnami/spark/jars/* -Xmx1g 
org.apache.spark.deploy.worker.Worker --webui-port 8081 
spark://test-spark-master-svc:7077
========================================

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/05/18 18:56:11 INFO Worker: Started daemon with process name: 
41@test-spark-worker-0
21/05/18 18:56:11 INFO SignalUtils: Registering signal handler for TERM
21/05/18 18:56:11 INFO SignalUtils: Registering signal handler for HUP
21/05/18 18:56:11 INFO SignalUtils: Registering signal handler for INT
21/05/18 18:56:11 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
21/05/18 18:56:11 INFO SecurityManager: Changing view acls to: spark
21/05/18 18:56:11 INFO SecurityManager: Changing modify acls to: spark
21/05/18 18:56:11 INFO SecurityManager: Changing view acls groups to: 
21/05/18 18:56:11 INFO SecurityManager: Changing modify acls groups to: 
21/05/18 18:56:11 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users  with view permissions: Set(spark); groups 
with view permissions: Set(); users  with modify permissions: Set(spark); 
groups with modify permissions: Set()
21/05/18 18:56:11 INFO Utils: Successfully started service 'sparkWorker' on 
port 35561.
21/05/18 18:56:11 INFO Worker: Worker decommissioning not enabled, SIGPWR will 
result in exiting.
21/05/18 18:56:12 INFO Worker: Starting Spark worker 10.223.130.87:35561 with 4 
cores, 10.0 GiB RAM
21/05/18 18:56:12 INFO Worker: Running Spark version 3.1.1
21/05/18 18:56:12 INFO Worker: Spark home: /opt/bitnami/spark
21/05/18 18:56:12 INFO ResourceUtils: 
==============================================================
21/05/18 18:56:12 INFO ResourceUtils: No custom resources configured for 
spark.worker.
21/05/18 18:56:12 INFO ResourceUtils: 
==============================================================
21/05/18 18:56:12 INFO Utils: Successfully started service 'WorkerUI' on port 
8081.
21/05/18 18:56:12 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started 
at 
http://test-spark-worker-0.test-spark-headless.workflow.svc.cluster.local:8081
21/05/18 18:56:12 INFO Worker: Connecting to master 
test-spark-master-svc:7077...
21/05/18 18:56:12 INFO TransportClientFactory: Successfully created connection 
to test-spark-master-svc/10.233.8.202:7077 after 31 ms (0 ms spent in 
bootstraps)
21/05/18 18:56:12 INFO Worker: Successfully registered with master 
spark://test-spark-master-0.test-spark-headless.workflow.svc.cluster.local:7077



---------- the next lines are shown on all workers in an infinity loop until I 
kill the application on the Spark master ----------

21/05/18 20:46:55 INFO Worker: Asked to launch executor 
app-20210518204655-0000/1 for f212b4b4-05df-4f22-a580-87cbe5fb9356
21/05/18 20:46:55 INFO SecurityManager: Changing view acls to: spark
21/05/18 20:46:55 INFO SecurityManager: Changing modify acls to: spark
21/05/18 20:46:55 INFO SecurityManager: Changing view acls groups to: 
21/05/18 20:46:55 INFO SecurityManager: Changing modify acls groups to: 
21/05/18 20:46:55 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users  with view permissions: Set(spark); groups 
with view permissions: Set(); users  with modify permissions: Set(spark); 
groups with modify permissions: Set()

21/05/18 20:46:55 INFO ExecutorRunner: Launch command: 
"/opt/bitnami/java/bin/java" "-cp" 
"/opt/bitnami/spark/conf/:/opt/bitnami/spark/jars/*" "-Xmx4096M" 
"-Dspark.network.timeout=1000s" "-Dspark.driver.port=41904" 
"org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
"spark://CoarseGrainedScheduler@test-workflowengine-567454667d-6c7p7:41904" 
"--executor-id" "1" "--hostname" "10.223.130.87" "--cores" "4" "--app-id" 
"app-20210518204655-0000" "--worker-url" "spark://[email protected]:35561“

21/05/18 20:46:56 INFO Worker: Executor app-20210518204655-0000/1 finished with 
state EXITED message Command exited with code 1 exitStatus 1
21/05/18 20:46:56 INFO ExternalShuffleBlockResolver: Clean up non-shuffle and 
non-RDD files associated with the finished executor 1
21/05/18 20:46:56 INFO ExternalShuffleBlockResolver: Executor is not registered 
(appId=app-20210518204655-0000, execId=1)




I have setup equal structure with Spark in a docker-compose, I’m using equal 
configuration values (I use the cluster mode on the docker-compose also), but 
on the K8 setup the executor fails and I don’t know how I can find out what 
goes wrong and how I can fix this issue. I need please some help to get more 
information what goes wrong and what I can do to fix this issue, I don’t know 
if this is an error on my K8 configuration, the application code for Spark 
initialization or an issue on my worker / spark configuration.

Thanks for help

Phil

Spark Executor dies in K8 cluster

Reply via email to