As per the error message, Spark Assembly Jar is not in the Spark Executor Classpath.
*Error: Could not find or load main class org.apache.spark.executor.CoarseGrainedExecutorBackend* 1. Pass Spark Assembly jar using --jars option while using spark-submit 2. Add Spark assembly jar in spark.jars config in spark-defaults.conf https://spark.apache.org/docs/latest/configuration.html On Thu, Oct 17, 2019 at 8:33 AM Manuel Sopena Ballesteros < manuel...@garvan.org.au> wrote: > Dear Hadoop community, > > > > I am testing my yarn cluster with zeppelin notebooks and using spark > engine to submit python code. The whole point of this exercise is to > analyze how yarn behaves in different situations. > > > > For instance I am trying to run a code that needs relatively big amount of > memory compared to the memory allocated to the spark driver. > > > > The job fails but I can’t easily see the reason of why this is happening > so I am wondering whether I have other problems (e.g error in cluster > configuration): > > > > This is the submitted code (the only thing I am trying to do is to > allocate a big amount of memory to see how yarn behaves): > > > > a = "bigword" > > aList = [] > > for i in range(1000): > > aList.append(i**i*a) > > #print aList > > > > for word in aList: > > print word > > > > The application fails and this is the error I can see in the logs: > > > > Exception 1: > > WARN [2019-10-17 09:28:58,567] ({Reporter} Logging.scala[logWarning]:66) - > Container marked as failed: container_e15_1570749574365_0027_01_000003 on > host: r940-1-1-mlx.mlx. Exit status: 1. Diagnostics: [2019-10-17 > 09:28:57.458]Exception from container-launch. > > Container id: container_e15_1570749574365_0027_01_000003 > > Exit code: 1 > > Exception message: Launch container failed > > Shell output: main : command provided 1 > > main : run as user is mansop > > main : requested yarn user is mansop > > Getting exit code file... > > Creating script paths... > > Writing pid file... > > Writing to tmp file > /d1/hadoop/yarn/local/nmPrivate/application_1570749574365_0027/container_e15_1570749574365_0027_01_000003/container_e15_1570749574365_0027_01_000003.pid.tmp > > Writing to cgroup task files... > > Creating local dirs... > > Launching container... > > Getting exit code file... > > Creating script paths... > > > > > > [2019-10-17 09:28:57.459]Container exited with a non-zero exit code 1. > Error file: prelaunch.err. > > Last 4096 bytes of prelaunch.err : > > Last 4096 bytes of stderr : > > Error: Could not find or load main class > org.apache.spark.executor.CoarseGrainedExecutorBackend > > > > > > [2019-10-17 09:28:57.462]Container exited with a non-zero exit code 1. > Error file: prelaunch.err. > > Last 4096 bytes of prelaunch.err : > > Last 4096 bytes of stderr : > > Error: Could not find or load main class > org.apache.spark.executor.CoarseGrainedExecutorBackend > > > > > > WARN [2019-10-17 09:28:58,571] ({dispatcher-event-loop-19} > Logging.scala[logWarning]:66) - Requesting driver to remove executor 2 for > reason Container marked as failed: > container_e15_1570749574365_0027_01_000003 on host: r940-1-1-mlx.mlx. Exit > status: 1. Diagnostics: [2019-10-17 09:28:57.458]Exception from > container-launch. > > Container id: container_e15_1570749574365_0027_01_000003 > > Exit code: 1 > > Exception message: Launch container failed > > Shell output: main : command provided 1 > > main : run as user is mansop > > main : requested yarn user is mansop > > Getting exit code file... > > Creating script paths... > > Writing pid file... > > Writing to tmp file > /d1/hadoop/yarn/local/nmPrivate/application_1570749574365_0027/container_e15_1570749574365_0027_01_000003/container_e15_1570749574365_0027_01_000003.pid.tmp > > Writing to cgroup task files... > > Creating local dirs... > > Launching container... > > Getting exit code file... > > Creating script paths... > > > > > > [2019-10-17 09:28:57.459]Container exited with a non-zero exit code 1. > Error file: prelaunch.err. > > Last 4096 bytes of prelaunch.err : > > Last 4096 bytes of stderr : > > Error: Could not find or load main class > org.apache.spark.executor.CoarseGrainedExecutorBackend > > > > > > [2019-10-17 09:28:57.462]Container exited with a non-zero exit code 1. > Error file: prelaunch.err. > > Last 4096 bytes of prelaunch.err : > > Last 4096 bytes of stderr : > > Error: Could not find or load main class > org.apache.spark.executor.CoarseGrainedExecutorBackend > > > > > > Exception 2: > > DEBUG [2019-10-17 09:29:20,583] ({pool-5-thread-1} > RemoteInterpreterServer.java[interpret]:426) - st: > > a = "bigword" > > aList = [] > > for i in range(1000): > > aList.append(i**i*a) > > #print aList > > > > for word in aList: > > print word > > DEBUG [2019-10-17 09:29:20,584] ({pool-5-thread-1} > RemoteInterpreterEventClient.java[sendEvent]:413) - Send Event: > RemoteInterpreterEvent(type:OUTPUT_UPDATE_ALL, > data:{"messages":[],"noteId":"2ENM9X82N","paragraphId":"20190926-163159_1153559848"}) > > DEBUG [2019-10-17 09:29:20,584] ({pool-5-thread-2} > RemoteInterpreterEventClient.java[pollEvent]:366) - Send event > OUTPUT_UPDATE_ALL > > INFO [2019-10-17 09:29:20,584] ({pool-6-thread-4} > SchedulerFactory.java[jobStarted]:109) - Job 20190926-163159_1153559848 > started by scheduler interpreter_1768035734 > > DEBUG [2019-10-17 09:29:20,585] ({pool-6-thread-4} > RemoteInterpreterServer.java[jobRun]:632) - Script after hooks: a = > "bigword" > > aList = [] > > for i in range(1000): > > aList.append(i**i*a) > > #print aList > > > > for word in aList: > > print word > > __zeppelin__._displayhook() > > DEBUG [2019-10-17 09:29:20,585] ({pool-6-thread-4} > RemoteInterpreterEventClient.java[sendEvent]:413) - Send Event: > RemoteInterpreterEvent(type:META_INFOS, data:{"message":"Spark UI > enabled","url":"http://r640-1-10-mlx.mlx:42304"}) > > DEBUG [2019-10-17 09:29:20,586] ({pool-5-thread-2} > RemoteInterpreterEventClient.java[pollEvent]:366) - Send event META_INFOS > > DEBUG [2019-10-17 09:29:20,684] ({pool-5-thread-3} > Interpreter.java[getProperty]:222) - key: zeppelin.spark.concurrentSQL, > value: false > > DEBUG [2019-10-17 09:29:20,684] ({pool-5-thread-3} > Interpreter.java[getProperty]:222) - key: zeppelin.spark.concurrentSQL, > value: false > > ERROR [2019-10-17 09:29:25,301] ({SIGTERM handler} > SignalUtils.scala[apply$mcZ$sp]:43) - RECEIVED SIGNAL TERM > > INFO [2019-10-17 09:29:25,308] ({shutdown-hook-0} > Logging.scala[logInfo]:54) - Invoking stop() from shutdown hook > > INFO [2019-10-17 09:29:25,327] ({shutdown-hook-0} > AbstractConnector.java[doStop]:318) - Stopped Spark@21f71c23 > {HTTP/1.1,[http/1.1]}{0.0.0.0:0} > > INFO [2019-10-17 09:29:25,331] ({shutdown-hook-0} > Logging.scala[logInfo]:54) - Stopped Spark web UI at > http://r640-1-10-mlx.mlx:42304 > > INFO [2019-10-17 09:29:25,387] ({dispatcher-event-loop-44} > Logging.scala[logInfo]:54) - Driver requested a total number of 0 > executor(s). > > INFO [2019-10-17 09:29:25,389] ({shutdown-hook-0} > Logging.scala[logInfo]:54) - Shutting down all executors > > INFO [2019-10-17 09:29:25,392] ({dispatcher-event-loop-51} > Logging.scala[logInfo]:54) - Asking each executor to shut down > > INFO [2019-10-17 09:29:25,398] ({shutdown-hook-0} > Logging.scala[logInfo]:54) - Stopping SchedulerExtensionServices > > (serviceOption=None, > > services=List(), > > started=false) > > ERROR [2019-10-17 09:29:25,407] ({Reporter} Logging.scala[logError]:91) - > Exception from Reporter thread. > > org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: > Application attempt appattempt_1570749574365_0027_000001 doesn't exist in > ApplicationMasterService cache. > > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:404) > > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) > > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) > > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) > > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:422) > > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) > > > > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) > > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateYarnException(RPCUtil.java:75) > > at > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:116) > > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:498) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > > at com.sun.proxy.$Proxy21.allocate(Unknown Source) > > at > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:320) > > at > org.apache.spark.deploy.yarn.YarnAllocator.allocateResources(YarnAllocator.scala:268) > > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:556) > > Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException): > Application attempt appattempt_1570749574365_0027_000001 doesn't exist in > ApplicationMasterService cache. > > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:404) > > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) > > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) > > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) > > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:422) > > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) > > > > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1497) > > at org.apache.hadoop.ipc.Client.call(Client.java:1443) > > at org.apache.hadoop.ipc.Client.call(Client.java:1353) > > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) > > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) > > at com.sun.proxy.$Proxy20.allocate(Unknown Source) > > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77) > > ... 13 more > > INFO [2019-10-17 09:29:25,411] ({dispatcher-event-loop-55} > Logging.scala[logInfo]:54) - MapOutputTrackerMasterEndpoint stopped! > > INFO [2019-10-17 09:29:25,412] ({Reporter} Logging.scala[logInfo]:54) - > Final app status: FAILED, exitCode: 12, (reason: Application attempt > appattempt_1570749574365_0027_000001 doesn't exist in > ApplicationMasterService cache. > > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:404) > > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) > > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) > > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) > > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:422) > > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) > > ) > > INFO [2019-10-17 09:29:25,439] ({shutdown-hook-0} > Logging.scala[logInfo]:54) - MemoryStore cleared > > INFO [2019-10-17 09:29:25,440] ({shutdown-hook-0} > Logging.scala[logInfo]:54) - BlockManager stopped > > INFO [2019-10-17 09:29:25,441] ({shutdown-hook-0} > Logging.scala[logInfo]:54) - BlockManagerMaster stopped > > INFO [2019-10-17 09:29:25,445] ({dispatcher-event-loop-63} > Logging.scala[logInfo]:54) - OutputCommitCoordinator stopped! > > INFO [2019-10-17 09:29:25,459] ({shutdown-hook-0} > Logging.scala[logInfo]:54) - Successfully stopped SparkContext > > INFO [2019-10-17 09:29:25,461] ({shutdown-hook-0} > Logging.scala[logInfo]:54) - Shutdown hook called > > INFO [2019-10-17 09:29:25,463] ({shutdown-hook-0} > Logging.scala[logInfo]:54) - Deleting directory > /d0/hadoop/yarn/local/usercache/mansop/appcache/application_1570749574365_0027/spark-8328975e-fd10-4676-9b51-c800253c1016/pyspark-55a3ee94-1620-423b-8564-a0f2a7b1da76 > > INFO [2019-10-17 09:29:25,463] ({shutdown-hook-0} > Logging.scala[logInfo]:54) - Deleting directory > /d0/hadoop/yarn/local/usercache/mansop/appcache/application_1570749574365_0027/spark-8328975e-fd10-4676-9b51-c800253c1016 > > INFO [2019-10-17 09:29:25,463] ({shutdown-hook-0} > Logging.scala[logInfo]:54) - Deleting directory > /d1/hadoop/yarn/local/usercache/mansop/appcache/application_1570749574365_0027/spark-f0f652c9-2a8a-4235-818d-9f9165890c52 > > > > End of LogType:stdout > > > > > > Exception 3: > > INFO [2019-10-17 09:29:28,077] ({main} Logging.scala[logInfo]:54) - > Registered signal handler for TERM > > INFO [2019-10-17 09:29:28,079] ({main} Logging.scala[logInfo]:54) - > Registered signal handler for HUP > > INFO [2019-10-17 09:29:28,080] ({main} Logging.scala[logInfo]:54) - > Registered signal handler for INT > > INFO [2019-10-17 09:29:28,193] ({main} Logging.scala[logInfo]:54) - > Changing view acls to: mansop > > INFO [2019-10-17 09:29:28,193] ({main} Logging.scala[logInfo]:54) - > Changing modify acls to: mansop > > INFO [2019-10-17 09:29:28,194] ({main} Logging.scala[logInfo]:54) - > Changing view acls groups to: > > INFO [2019-10-17 09:29:28,194] ({main} Logging.scala[logInfo]:54) - > Changing modify acls groups to: > > INFO [2019-10-17 09:29:28,195] ({main} Logging.scala[logInfo]:54) - > SecurityManager: authentication disabled; ui acls disabled; users with > view permissions: Set(mansop); groups with view permissions: Set(); users > with modify permissions: Set(mansop); groups with modify permissions: Set() > > INFO [2019-10-17 09:29:28,493] ({main} Logging.scala[logInfo]:54) - > Preparing Local resources > > INFO [2019-10-17 09:29:29,195] ({main} Logging.scala[logInfo]:54) - > ApplicationAttemptId: appattempt_1570749574365_0027_000002 > > INFO [2019-10-17 09:29:29,219] ({main} Logging.scala[logInfo]:54) - > Starting the user application in a separate Thread > > INFO [2019-10-17 09:29:29,241] ({main} Logging.scala[logInfo]:54) - > Waiting for spark context initialization... > > INFO [2019-10-17 09:29:29,243] ({Driver} > RemoteInterpreterServer.java[main]:260) - > URL:jar:file:/d1/hadoop/yarn/local/usercache/mansop/appcache/application_1570749574365_0027/container_e15_1570749574365_0027_02_000001/__app__.jar!/org/apache/zeppelin/interpreter/remote/RemoteInterpreterServer.class > > INFO [2019-10-17 09:29:29,311] ({Driver} > RemoteInterpreterServer.java[<init>]:161) - Launching ThriftServer at > 10.0.1.12:41438 > > INFO [2019-10-17 09:29:29,319] ({Driver} > RemoteInterpreterServer.java[<init>]:165) - Starting remote interpreter > server on port 41438 > > INFO [2019-10-17 09:29:29,322] ({Thread-8} > RemoteInterpreterServer.java[run]:202) - Starting remote interpreter server > on port 41438 > > INFO [2019-10-17 09:29:30,340] ({Thread-9} > RemoteInterpreterUtils.java[registerInterpreter]:165) - callbackHost: > 10.0.11.18, callbackPort: 42835, callbackInfo: CallbackInfo(host:10.0.1.12, > port:41438) > > ERROR [2019-10-17 09:31:09,266] ({main} Logging.scala[logError]:91) - > Uncaught exception: > > java.util.concurrent.TimeoutException: Futures timed out after [100000 > milliseconds] > > at > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > > at > org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201) > > at > org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:498) > > at org.apache.spark.deploy.yarn.ApplicationMaster.org > $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:345) > > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply$mcV$sp(ApplicationMaster.scala:260) > > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260) > > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260) > > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$5.run(ApplicationMaster.scala:815) > > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:422) > > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > > at > org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:814) > > at > org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:259) > > at > org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:839) > > at > org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) > > INFO [2019-10-17 09:31:09,273] ({main} Logging.scala[logInfo]:54) - Final > app status: FAILED, exitCode: 13, (reason: Uncaught exception: > java.util.concurrent.TimeoutException: Futures timed out after [100000 > milliseconds]) > > INFO [2019-10-17 09:31:09,284] ({shutdown-hook-0} > Logging.scala[logInfo]:54) - Deleting staging directory > hdfs://gl-hdp-ctrl01-mlx.mlx:8020/user/mansop/.sparkStaging/application_1570749574365_0027 > > INFO [2019-10-17 09:31:09,299] ({shutdown-hook-0} > Logging.scala[logInfo]:54) - Shutdown hook called > > > > End of LogType:stdout > > > > Could someone please help to understand what these errors means? > > > > Manuel Sopena Ballesteros > > Big Data Engineer | Kinghorn Centre for Clinical Genomics > > [image: cid:image001.png@01D4C835.ED3C2230] <https://www.garvan.org.au/> > > > *a:* 384 Victoria Street, Darlinghurst NSW 2010 > *p:* +61 2 9355 5760 | +61 4 12 123 123 > *e:* manuel...@garvan.org.au > > Like us on Facebook <http://www.facebook.com/garvaninstitute> | Follow us > on Twitter <http://twitter.com/GarvanInstitute> and LinkedIn > <http://www.linkedin.com/company/garvan-institute-of-medical-research> > > > NOTICE > Please consider the environment before printing this email. This message > and any attachments are intended for the addressee named and may contain > legally privileged/confidential/copyright information. If you are not the > intended recipient, you should not read, use, disclose, copy or distribute > this communication. If you have received this message in error please > notify us at once by return email and then delete both messages. We accept > no liability for the distribution of viruses or similar in electronic > communications. This notice should not be removed. >