Hi Spark Community, We have a cluster running with Spark 3.3.1. All nodes are AWS EC2’s with an Ubuntu OS version 22.04.
One of the workers disconnected from the main node. When we run $SPARK_HOME/sbin/start-worker.sh spark://{main_host}:{cluster_port} it appears to run successfully; there is no stderr when we run the aforementioned command. Stdout returns the following: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/spark-3.3.1-bin-hadoop3/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ri-worker-1.out However, when we look at the log file defined in stdout, we see the following: Spark Command: /usr/lib/jvm/java-11-openjdk-amd64/bin/java -cp /opt/spark/spark-3.3.1-bin-hadoop3/conf/:/opt/spark/spark-3.3.1-bin-hadoop3/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark:// 10.113.62.58:7077 ======================================== Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties 23/08/23 20:01:02 INFO Worker: Started daemon with process name: 940835@ri-worker-1 23/08/23 20:01:02 INFO SignalUtils: Registering signal handler for TERM 23/08/23 20:01:02 INFO SignalUtils: Registering signal handler for HUP 23/08/23 20:01:02 INFO SignalUtils: Registering signal handler for INT 23/08/23 20:01:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 23/08/23 20:01:03 INFO SecurityManager: Changing view acls to: root 23/08/23 20:01:03 INFO SecurityManager: Changing modify acls to: root 23/08/23 20:01:03 INFO SecurityManager: Changing view acls groups to: 23/08/23 20:01:03 INFO SecurityManager: Changing modify acls groups to: 23/08/23 20:01:03 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set() 23/08/23 20:01:03 INFO Utils: Successfully started service 'sparkWorker' on port 43757. 23/08/23 20:01:03 INFO Worker: Worker decommissioning not enabled. 23/08/23 20:01:03 ERROR LevelDBProvider: error opening leveldb file file:/mnt/data_ebs/infrastructure/spark/tmp/registeredExecutors.ldb. Creating new file, will not be able to recover state for existing applications org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: /opt/spark/spark-3.3.1-bin-hadoop3/sbin/file:/mnt/data_ebs/infrastructure/spark/tmp/registeredExecutors.ldb/LOCK: No such file or directory at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.spark.network.util.LevelDBProvider.initLevelDB(LevelDBProvider.java:48) at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.<init>(ExternalShuffleBlockResolver.java:126) at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.<init>(ExternalShuffleBlockResolver.java:99) at org.apache.spark.network.shuffle.ExternalBlockHandler.<init>(ExternalBlockHandler.java:81) at org.apache.spark.deploy.ExternalShuffleService.newShuffleBlockHandler(ExternalShuffleService.scala:82) at org.apache.spark.deploy.ExternalShuffleService.<init>(ExternalShuffleService.scala:56) at org.apache.spark.deploy.worker.Worker.<init>(Worker.scala:183) at org.apache.spark.deploy.worker.Worker$.startRpcEnvAndEndpoint(Worker.scala:966) at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:934) at org.apache.spark.deploy.worker.Worker.main(Worker.scala) 23/08/23 20:01:03 WARN LevelDBProvider: error deleting file:/mnt/data_ebs/infrastructure/spark/tmp/registeredExecutors.ldb 23/08/23 20:01:03 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[main,5,main] java.io.IOException: Unable to create state store at org.apache.spark.network.util.LevelDBProvider.initLevelDB(LevelDBProvider.java:77) at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.<init>(ExternalShuffleBlockResolver.java:126) at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.<init>(ExternalShuffleBlockResolver.java:99) at org.apache.spark.network.shuffle.ExternalBlockHandler.<init>(ExternalBlockHandler.java:81) at org.apache.spark.deploy.ExternalShuffleService.newShuffleBlockHandler(ExternalShuffleService.scala:82) at org.apache.spark.deploy.ExternalShuffleService.<init>(ExternalShuffleService.scala:56) at org.apache.spark.deploy.worker.Worker.<init>(Worker.scala:183) at org.apache.spark.deploy.worker.Worker$.startRpcEnvAndEndpoint(Worker.scala:966) at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:934) at org.apache.spark.deploy.worker.Worker.main(Worker.scala) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: /opt/spark/spark-3.3.1-bin-hadoop3/sbin/file:/mnt/data_ebs/infrastructure/spark/tmp/registeredExecutors.ldb/LOCK: No such file or directory at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.spark.network.util.LevelDBProvider.initLevelDB(LevelDBProvider.java:75) ... 9 more 23/08/23 20:01:03 INFO ShutdownHookManager: Shutdown hook called When we check the Spark Master Server, the worker is not in our worker list. Has anyone seen the error shown in the stacktrace above / know a solution to fix this problem? All the best, -- Jeremy Brent Product Engineering Data Scientist Data Intelligence & Machine Learning Office: 732-562-6030 Cell: 732-336-0499