Hello, im running Spark 2.3 job on kubernetes cluster > > kubectl version > > Client Version: version.Info{Major:"1", Minor:"9", > GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", > GitTreeState:"clean", BuildDate:"2018-02-09T21:51:06Z", > GoVersion:"go1.9.4", Compiler:"gc", Platform:"darwin/amd64"} > > Server Version: version.Info{Major:"1", Minor:"8", > GitVersion:"v1.8.3", GitCommit:"f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd", > GitTreeState:"clean", BuildDate:"2017-11-08T18:27:48Z", > GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"} > > > > when i ran spark submit on k8s master the driver pod is stuck in Waiting: > PodInitializing state. > I had to manually kill the driver pod and submit new job in this case > ,then it works.How this can be handled in production ? >
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-25128 > > This is happening if i submit the jobs almost parallel ie submit 5 jobs > one after the other simultaneously. > > I'm running spark jobs on 20 nodes each having below configuration > > I tried kubectl describe node on the node where trhe driver pod is running > this is what i got ,i do see there is overcommit on resources but i > expected kubernetes scheduler not to schedule if resources in node are > overcommitted or node is in Not Ready state ,in this case node is in Ready > State but i observe same behaviour if node is in "Not Ready" state > > > > Name: ********** > > Roles: worker > > Labels: beta.kubernetes.io/arch=amd64 > > beta.kubernetes.io/os=linux > > kubernetes.io/hostname=**** > > node-role.kubernetes.io/worker=true > > Annotations: node.alpha.kubernetes.io/ttl=0 > > > volumes.kubernetes.io/controller-managed-attach-detach=true > > Taints: <none> > > CreationTimestamp: Tue, 31 Jul 2018 09:59:24 -0400 > > Conditions: > > Type Status LastHeartbeatTime > LastTransitionTime Reason Message > > ---- ------ ----------------- > ------------------ ------ ------- > > OutOfDisk False Tue, 14 Aug 2018 09:31:20 -0400 Tue, 31 > Jul 2018 09:59:24 -0400 KubeletHasSufficientDisk kubelet has > sufficient disk space available > > MemoryPressure False Tue, 14 Aug 2018 09:31:20 -0400 Tue, 31 > Jul 2018 09:59:24 -0400 KubeletHasSufficientMemory kubelet has > sufficient memory available > > DiskPressure False Tue, 14 Aug 2018 09:31:20 -0400 Tue, 31 > Jul 2018 09:59:24 -0400 KubeletHasNoDiskPressure kubelet has no disk > pressure > > Ready True Tue, 14 Aug 2018 09:31:20 -0400 Sat, 11 > Aug 2018 00:41:27 -0400 KubeletReady kubelet is posting > ready status. AppArmor enabled > > Addresses: > > InternalIP: ***** > > Hostname: ****** > > Capacity: > > cpu: 16 > > memory: 125827288Ki > > pods: 110 > > Allocatable: > > cpu: 16 > > memory: 125724888Ki > > pods: 110 > > System Info: > > Machine ID: ************* > > System UUID: ************** > > Boot ID: 1493028d-0a80-4f2f-b0f1-48d9b8910e9f > > Kernel Version: 4.4.0-1062-aws > > OS Image: Ubuntu 16.04.4 LTS > > Operating System: linux > > Architecture: amd64 > > Container Runtime Version: docker://Unknown > > Kubelet Version: v1.8.3 > > Kube-Proxy Version: v1.8.3 > > PodCIDR: ****** > > ExternalID: ************** > > Non-terminated Pods: (11 in total) > > Namespace Name > CPU Requests CPU Limits Memory Requests Memory > Limits > > --------- ---- > ------------ ---------- --------------- > ------------- > > kube-system calico-node-gj5mb > 250m (1%) 0 (0%) 0 (0%) 0 (0%) > > kube-system > kube-proxy-**************************************** 100m (0%) > 0 (0%) 0 (0%) 0 (0%) > > kube-system prometheus-prometheus-node-exporter-9cntq > 100m (0%) 200m (1%) 30Mi (0%) 50Mi (0%) > > logging > elasticsearch-elasticsearch-data-69df997486-gqcwg 400m (2%) > 1 (6%) 8Gi (6%) 16Gi (13%) > > logging fluentd-fluentd-elasticsearch-tj7nd > 200m (1%) 0 (0%) 612Mi (0%) 0 (0%) > > rook rook-agent-6jtzm > 0 (0%) 0 (0%) 0 (0%) 0 (0%) > > rook > rook-ceph-osd-10-6-42-250.accel.aws-cardda.cb4good.com-gwb8j 0 (0%) > 0 (0%) 0 (0%) 0 (0%) > > spark > accelerate-test-5-a3bfb8a597e83d459193a183e17f13b5-exec-1 2 (12%) > 0 (0%) 10Gi (8%) 12Gi (10%) > > spark > accelerate-testing-1-8ed0482f3bfb3c0a83da30bb7d433dff-exec-5 2 (12%) > 0 (0%) 10Gi (8%) 12Gi (10%) > > spark > accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver 1 (6%) > 0 (0%) 2Gi (1%) 2432Mi (1%) > > spark > accelerate-testing-2-e8bd0607cc693bc8ae25cc6dc300b2c7-driver 1 (6%) > 0 (0%) 2Gi (1%) 2432Mi (1%) > > Allocated resources: > > (Total limits may be over 100 percent, i.e., overcommitted.) > > CPU Requests CPU Limits Memory Requests Memory Limits > > ------------ ---------- --------------- ------------- > > 7050m (44%) 1200m (7%) 33410Mi (27%) 45874Mi (37%) > > > Events: <none> > > > Kubectl describe pod gives below message > > Name: > accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver > Namespace: spark > Node: **** > Start Time: Mon, 13 Aug 2018 16:18:34 -0400 > Labels: > launch-id=k8s-submit-service-cddf45ff-0d88-4681-af85-d8ed0359ce73 > spark-app-selector=spark-63f536fd87f8457796802767922ef7d9 > spark-role=driver > Annotations: spark-app-name=accelerate-testing-2 > Status: Pending > IP: > Init Containers: > spark-init: > Container ID: > Image: ****:v2.3.0 > Image ID: > Port: <none> > Args: > init > /etc/spark-init/spark-init.properties > State: Waiting > Reason: PodInitializing > Ready: False > Restart Count: 0 > Environment: <none> > Mounts: > /etc/spark-init from spark-init-properties (rw) > /var/run/secrets/kubernetes.io/serviceaccount from > spark-token-mj86g (ro) > /var/spark-data/spark-files from download-files-volume (rw) > /var/spark-data/spark-jars from download-jars-volume (rw) > Containers: > spark-kubernetes-driver: > Container ID: > Image: ******:v2.3.0 > Image ID: > Port: <none> > Args: > driver > State: Waiting > Reason: PodInitializing > Ready: False > Restart Count: 0 > Limits: > memory: 2432Mi > Requests: > cpu: 1 > memory: 2Gi > Environment: > SPARK_DRIVER_MEMORY: 2g > SPARK_DRIVER_CLASS: com.myclass > SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP) > SPARK_MOUNTED_CLASSPATH: > > /var/spark-data/spark-jars/quantum-workflow-2.2.24.0-SNAPSHOT-assembly.jar:/var/spark-data/spark-jars/my.jar > SPARK_MOUNTED_FILES_DIR: /var/spark-data/spark-files > SPARK_JAVA_OPT_0: > -Dspark.kubernetes.container.image=*** > SPARK_JAVA_OPT_1: > -Dspark.jars=s3a://my/my.jar,s3a://my/my1.jar > SPARK_JAVA_OPT_2: -Dspark.submit.deployMode=cluster > SPARK_JAVA_OPT_3: -Dspark.driver.blockManager.port=7079 > SPARK_JAVA_OPT_4: -Dspark.executor.memory=10g > SPARK_JAVA_OPT_5: -Dspark.app.id > =spark-63f536fd87f8457796802767922ef7d9 > SPARK_JAVA_OPT_6: > -Dspark.kubernetes.authenticate.driver.serviceAccountName=spark > SPARK_JAVA_OPT_7: -Dspark.master=k8s:// > https://kubernetes.default > SPARK_JAVA_OPT_8: > -Dspark.driver.host=spark-1534191513364-driver-svc.spark.svc > SPARK_JAVA_OPT_9: -Dspark.executor.cores=2 > SPARK_JAVA_OPT_10: > > -Dspark.kubernetes.executor.podNamePrefix=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba > SPARK_JAVA_OPT_11: -Dspark.driver.port=7078 > SPARK_JAVA_OPT_12: -Dspark.kubernetes.namespace=spark > SPARK_JAVA_OPT_13: -Dspark.executor.memoryOverhead=2g > SPARK_JAVA_OPT_14: > > -Dspark.kubernetes.initContainer.configMapName=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-init-config > SPARK_JAVA_OPT_15: > -Dspark.kubernetes.initContainer.configMapKey=spark-init.properties > SPARK_JAVA_OPT_16: -Dspark.executor.instances=10 > SPARK_JAVA_OPT_17: -Dspark.memory.fraction=0.6 > SPARK_JAVA_OPT_18: -Dspark.driver.memory=2g > SPARK_JAVA_OPT_19: -Dspark.kubernetes.driver.pod.name > =accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver > SPARK_JAVA_OPT_20: -Dspark.app.name > =accelerate-testing-2 > SPARK_JAVA_OPT_21: > -Dspark.kubernetes.driver.label.launch-id=******** > Mounts: > /var/run/secrets/kubernetes.io/serviceaccount from > spark-token-mj86g (ro) > /var/spark-data/spark-files from download-files-volume (rw) > /var/spark-data/spark-jars from download-jars-volume (rw) > Conditions: > Type Status > Initialized False > Ready False > PodScheduled True > Volumes: > spark-init-properties: > Type: ConfigMap (a volume populated by a ConfigMap) > Name: > accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-init-config > Optional: false > download-jars-volume: > Type: EmptyDir (a temporary directory that shares a pod's > lifetime) > Medium: > download-files-volume: > Type: EmptyDir (a temporary directory that shares a pod's > lifetime) > Medium: > spark-token-mj86g: > Type: Secret (a volume populated by a Secret) > SecretName: spark-token-mj86g > Optional: false > QoS Class: Burstable > Node-Selectors: <none> > Tolerations: <none> > Events: > Type Reason Age From > Message > ---- ------ ---- ---- > ------- > Normal SandboxChanged 44m (x518 over 18h) kubelet, > **************************** Pod sandbox changed, it will be killed and > re-created. > Warning FailedSync 19s (x540 over 18h) kubelet, > **************************** Error syncing pod > >