Re: spark driver pod stuck in Waiting: PodInitializing state in Kubernetes
Resurfacing The question to get more attention Hello, > > im running Spark 2.3 job on kubernetes cluster >> >> kubectl version >> >> Client Version: version.Info{Major:"1", Minor:"9", >> GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", >> GitTreeState:"clean", BuildDate:"2018-02-09T21:51:06Z", >> GoVersion:"go1.9.4", Compiler:"gc", Platform:"darwin/amd64"} >> >> Server Version: version.Info{Major:"1", Minor:"8", >> GitVersion:"v1.8.3", GitCommit:"f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd", >> GitTreeState:"clean", BuildDate:"2017-11-08T18:27:48Z", >> GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"} >> >> >> >> when i ran spark submit on k8s master the driverpod is stuck in Waiting: >> PodInitializing state. >> I had to manually kill the driver pod and submit new job in this case >> ,then it works.How this can be handled in production ? >> > This happens with executor pods as well > https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-25128 > > >> >> This is happening if i submit the jobs almost parallel ie submit 5 jobs >> one after the other simultaneously. >> >> I'm running spark jobs on 20 nodes each having below configuration >> >> I tried kubectl describe node on the node where trhe driver pod is >> running this is what i got ,i do see there is overcommit on resources but i >> expected kubernetes scheduler not to schedule if resources in node are >> overcommitted or node is in Not Ready state ,in this case node is in Ready >> State but i observe same behaviour if node is in "Not Ready" state >> >> >> >> Name: ** >> >> Roles: worker >> >> Labels: beta.kubernetes.io/arch=amd64 >> >> beta.kubernetes.io/os=linux >> >> kubernetes.io/hostname= >> >> node-role.kubernetes.io/worker=true >> >> Annotations:node.alpha.kubernetes.io/ttl=0 >> >> >> volumes.kubernetes.io/controller-managed-attach-detach=true >> >> Taints: >> >> CreationTimestamp: Tue, 31 Jul 2018 09:59:24 -0400 >> >> Conditions: >> >> Type Status LastHeartbeatTime >> LastTransitionTimeReason Message >> >> -- - >> ---- --- >> >> OutOfDiskFalse Tue, 14 Aug 2018 09:31:20 -0400 Tue, 31 >> Jul 2018 09:59:24 -0400 KubeletHasSufficientDisk kubelet has >> sufficient disk space available >> >> MemoryPressure False Tue, 14 Aug 2018 09:31:20 -0400 Tue, 31 >> Jul 2018 09:59:24 -0400 KubeletHasSufficientMemory kubelet has >> sufficient memory available >> >> DiskPressure False Tue, 14 Aug 2018 09:31:20 -0400 Tue, 31 >> Jul 2018 09:59:24 -0400 KubeletHasNoDiskPressure kubelet has no disk >> pressure >> >> ReadyTrueTue, 14 Aug 2018 09:31:20 -0400 Sat, 11 >> Aug 2018 00:41:27 -0400 KubeletReady kubelet is posting >> ready status. AppArmor enabled >> >> Addresses: >> >> InternalIP: * >> >> Hostname:** >> >> Capacity: >> >> cpu: 16 >> >> memory: 125827288Ki >> >> pods:110 >> >> Allocatable: >> >> cpu: 16 >> >> memory: 125724888Ki >> >> pods:110 >> >> System Info: >> >> Machine ID: * >> >> System UUID:** >> >> Boot ID:1493028d-0a80-4f2f-b0f1-48d9b8910e9f >> >> Kernel Version: 4.4.0-1062-aws >> >> OS Image: Ubuntu 16.04.4 LTS >> >> Operating System: linux >> >> Architecture: amd64 >> >> Container Runtime Version: docker://Unknown >> >> Kubelet Version:v1.8.3 >> >> Kube-Proxy Version: v1.8.3 >> >> PodCIDR: ** >> >> ExternalID: ** >> >> Non-terminated Pods: (11 in total) >> >> Namespace Name >>CPU Requests CPU Limits Memory Requests Memory >> Limits >> >> - >> -- --- >> - >> >> kube-systemcalico-node-gj5mb >> 250m (1%) 0 (0%) 0 (0%) 0 (0%) >> >> kube-system >> kube-proxy- 100m (0%) >> 0 (0%) 0 (0%) 0 (0%) >> >> kube-system >> prometheus-prometheus-node-exporter-9cntq 100m (0%) >> 200m (1%) 30Mi (0%)50Mi (0%) >> >> logging >> elasticsearch-elasticsearch-data-69df997486-gqcwg 400m (2%) >> 1 (6%) 8Gi (6%) 16Gi (13%) >> >> loggingf
Re: spark driver pod stuck in Waiting: PodInitializing state in Kubernetes
Hello, im running Spark 2.3 job on kubernetes cluster > > kubectl version > > Client Version: version.Info{Major:"1", Minor:"9", > GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", > GitTreeState:"clean", BuildDate:"2018-02-09T21:51:06Z", > GoVersion:"go1.9.4", Compiler:"gc", Platform:"darwin/amd64"} > > Server Version: version.Info{Major:"1", Minor:"8", > GitVersion:"v1.8.3", GitCommit:"f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd", > GitTreeState:"clean", BuildDate:"2017-11-08T18:27:48Z", > GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"} > > > > when i ran spark submit on k8s master the driver pod is stuck in Waiting: > PodInitializing state. > I had to manually kill the driver pod and submit new job in this case > ,then it works.How this can be handled in production ? > https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-25128 > > This is happening if i submit the jobs almost parallel ie submit 5 jobs > one after the other simultaneously. > > I'm running spark jobs on 20 nodes each having below configuration > > I tried kubectl describe node on the node where trhe driver pod is running > this is what i got ,i do see there is overcommit on resources but i > expected kubernetes scheduler not to schedule if resources in node are > overcommitted or node is in Not Ready state ,in this case node is in Ready > State but i observe same behaviour if node is in "Not Ready" state > > > > Name: ** > > Roles: worker > > Labels: beta.kubernetes.io/arch=amd64 > > beta.kubernetes.io/os=linux > > kubernetes.io/hostname= > > node-role.kubernetes.io/worker=true > > Annotations:node.alpha.kubernetes.io/ttl=0 > > > volumes.kubernetes.io/controller-managed-attach-detach=true > > Taints: > > CreationTimestamp: Tue, 31 Jul 2018 09:59:24 -0400 > > Conditions: > > Type Status LastHeartbeatTime > LastTransitionTimeReason Message > > -- - > ---- --- > > OutOfDiskFalse Tue, 14 Aug 2018 09:31:20 -0400 Tue, 31 > Jul 2018 09:59:24 -0400 KubeletHasSufficientDisk kubelet has > sufficient disk space available > > MemoryPressure False Tue, 14 Aug 2018 09:31:20 -0400 Tue, 31 > Jul 2018 09:59:24 -0400 KubeletHasSufficientMemory kubelet has > sufficient memory available > > DiskPressure False Tue, 14 Aug 2018 09:31:20 -0400 Tue, 31 > Jul 2018 09:59:24 -0400 KubeletHasNoDiskPressure kubelet has no disk > pressure > > ReadyTrueTue, 14 Aug 2018 09:31:20 -0400 Sat, 11 > Aug 2018 00:41:27 -0400 KubeletReady kubelet is posting > ready status. AppArmor enabled > > Addresses: > > InternalIP: * > > Hostname:** > > Capacity: > > cpu: 16 > > memory: 125827288Ki > > pods:110 > > Allocatable: > > cpu: 16 > > memory: 125724888Ki > > pods:110 > > System Info: > > Machine ID: * > > System UUID:** > > Boot ID:1493028d-0a80-4f2f-b0f1-48d9b8910e9f > > Kernel Version: 4.4.0-1062-aws > > OS Image: Ubuntu 16.04.4 LTS > > Operating System: linux > > Architecture: amd64 > > Container Runtime Version: docker://Unknown > > Kubelet Version:v1.8.3 > > Kube-Proxy Version: v1.8.3 > > PodCIDR: ** > > ExternalID: ** > > Non-terminated Pods: (11 in total) > > Namespace Name >CPU Requests CPU Limits Memory Requests Memory > Limits > > - > -- --- > - > > kube-systemcalico-node-gj5mb > 250m (1%) 0 (0%) 0 (0%) 0 (0%) > > kube-system > kube-proxy- 100m (0%) > 0 (0%) 0 (0%) 0 (0%) > > kube-systemprometheus-prometheus-node-exporter-9cntq > 100m (0%) 200m (1%) 30Mi (0%)50Mi (0%) > > logging > elasticsearch-elasticsearch-data-69df997486-gqcwg 400m (2%) > 1 (6%) 8Gi (6%) 16Gi (13%) > > loggingfluentd-fluentd-elasticsearch-tj7nd > 200m (1%) 0 (0%) 612Mi (0%) 0 (0%) > > rook rook-agent-6jtzm >0 (0%)0 (0%) 0 (0%) 0 (0%) >
spark driver pod stuck in Waiting: PodInitializing state in Kubernetes
im running Spark 2.3 job on kubernetes cluster kubectl version Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", BuildDate:"2018-02-09T21:51:06Z", GoVersion:"go1.9.4", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.3", GitCommit:"f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd", GitTreeState:"clean", BuildDate:"2017-11-08T18:27:48Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"} when i ran spark submit on k8s master the driver pod is stuck in Waiting: PodInitializing state. I had to manually kill the driver pod and submit new job in this case ,then it works. This is happening if i submit the jobs almost parallel ie submit 5 jobs one after the other simultaneously. I'm running spark jobs on 20 nodes each having below configuration I tried kubectl describe node on the node where trhe driver pod is running this is what i got ,i do see there is overcommit on resources but i expected kubernetes scheduler not to schedule if resources in node are overcommitted or node is in Not Ready state ,in this case node is in Ready State but i observe same behaviour if node is in "Not Ready" state Name: ** Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux kubernetes.io/hostname= node-role.kubernetes.io/worker=true Annotations:node.alpha.kubernetes.io/ttl=0 volumes.kubernetes.io/controller-managed-attach-detach=true Taints: CreationTimestamp: Tue, 31 Jul 2018 09:59:24 -0400 Conditions: Type Status LastHeartbeatTime LastTransitionTimeReason Message -- - ---- --- OutOfDiskFalse Tue, 14 Aug 2018 09:31:20 -0400 Tue, 31 Jul 2018 09:59:24 -0400 KubeletHasSufficientDisk kubelet has sufficient disk space available MemoryPressure False Tue, 14 Aug 2018 09:31:20 -0400 Tue, 31 Jul 2018 09:59:24 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Tue, 14 Aug 2018 09:31:20 -0400 Tue, 31 Jul 2018 09:59:24 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure ReadyTrueTue, 14 Aug 2018 09:31:20 -0400 Sat, 11 Aug 2018 00:41:27 -0400 KubeletReady kubelet is posting ready status. AppArmor enabled Addresses: InternalIP: * Hostname:** Capacity: cpu: 16 memory: 125827288Ki pods:110 Allocatable: cpu: 16 memory: 125724888Ki pods:110 System Info: Machine ID: * System UUID:** Boot ID:1493028d-0a80-4f2f-b0f1-48d9b8910e9f Kernel Version: 4.4.0-1062-aws OS Image: Ubuntu 16.04.4 LTS Operating System: linux Architecture: amd64 Container Runtime Version: docker://Unknown Kubelet Version:v1.8.3 Kube-Proxy Version: v1.8.3 PodCIDR: ** ExternalID: ** Non-terminated Pods: (11 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits - -- --- - kube-systemcalico-node-gj5mb 250m (1%) 0 (0%) 0 (0%) 0 (0%) kube-system kube-proxy- 100m (0%) 0 (0%) 0 (0%) 0 (0%) kube-systemprometheus-prometheus-node-exporter-9cntq 100m (0%) 200m (1%) 30Mi (0%)50Mi (0%) logging elasticsearch-elasticsearch-data-69df997486-gqcwg 400m (2%) 1 (6%) 8Gi (6%) 16Gi (13%) loggingfluentd-fluentd-elasticsearch-tj7nd 200m (1%) 0 (0%) 612Mi (0%) 0 (0%) rook rook-agent-6jtzm 0 (0%)0 (0%) 0 (0%) 0 (0%) rook rook-ceph-osd-10-6-42-250.accel.aws-cardda.cb4good.com-gwb8j0 (0%) 0 (0%) 0 (0%) 0 (0%) spark accelerate-test-5-a3bfb8a597e83d459193a183e17f13b5-exec-1 2 (12%) 0 (0%) 10Gi (8%)12Gi (10%) spark accelerate-testing-1-8ed0482f3bfb3c0a83da30bb7d433dff-exec-52 (12%) 0 (0%) 10Gi (8%)12Gi (1