[jira] [Commented] (SPARK-26342) Support for NFS mount for Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-26342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721997#comment-16721997 ] Yinan Li commented on SPARK-26342: -- Yes, that's true. Feel free to create a PR to add nfs and flex. > Support for NFS mount for Kubernetes > > > Key: SPARK-26342 > URL: https://issues.apache.org/jira/browse/SPARK-26342 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Eric Carlson >Priority: Minor > > Currently only hostPath, emptyDir, and PVC volume types are accepted for > Kubernetes-deployed drivers and executors. Possibility to mount NFS paths > would allow access to a common and easy-to-deploy shared storage solution. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26290) [K8s] Driver Pods no mounted volumes on submissions from older spark versions
[ https://issues.apache.org/jira/browse/SPARK-26290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yinan Li resolved SPARK-26290. -- Resolution: Not A Bug > [K8s] Driver Pods no mounted volumes on submissions from older spark versions > - > > Key: SPARK-26290 > URL: https://issues.apache.org/jira/browse/SPARK-26290 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 > Environment: Kuberentes: 1.10.6 > Container: Spark 2.4.0 > Spark containers are built from the archive served by > [www.apache.org/dist/spark/|http://www.apache.org/dist/spark/] > Submission done by older spark versions integrated e.g. in livy >Reporter: Martin Buchleitner >Priority: Major > > I want to use the volume feature to mount an existing PVC as readonly volume > into the driver and also executor. > The executor gets the PVC mounted, but the driver is missing the mount > {code:java} > /opt/spark/bin/spark-submit \ > --deploy-mode cluster \ > --class org.apache.spark.examples.SparkPi \ > --conf spark.app.name=spark-pi \ > --conf spark.executor.instances=4 \ > --conf spark.kubernetes.namespace=spark-demo \ > --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ > --conf spark.kubernetes.container.image.pullPolicy=Always \ > --conf spark.kubernetes.container.image=kube-spark:2.4.0 \ > --conf spark.master=k8s://https:// \ > --conf > spark.kubernetes.driver.volumes.persistentVolumeClaim.ddata.mount.path=/srv \ > --conf > spark.kubernetes.driver.volumes.persistentVolumeClaim.ddata.mount.readOnly=true > \ > --conf > spark.kubernetes.driver.volumes.persistentVolumeClaim.ddata.options.claimName=nfs-pvc > \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/srv \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly=true > \ > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=nfs-pvc > \ > /srv/spark-examples_2.11-2.4.0.jar > {code} > When i use the jar included in the container > {code:java} > local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar > {code} > the call works and i can review the pod descriptions to review the behavior > *Driver description* > {code:java} > Name: spark-pi-1544018157391-driver > [...] > Containers: > spark-kubernetes-driver: > Container ID: > docker://3a31d867c140183247cb296e13a8b35d03835f7657dd7e625c59083024e51e28 > Image: kube-spark:2.4.0 > Image ID: [...] > Port: > Host Port: > State: Terminated > Reason: Completed > Exit Code:0 > Started: Wed, 05 Dec 2018 14:55:59 +0100 > Finished: Wed, 05 Dec 2018 14:56:08 +0100 > Ready: False > Restart Count: 0 > Limits: > memory: 1408Mi > Requests: > cpu: 1 > memory: 1Gi > Environment: > SPARK_DRIVER_MEMORY:1g > SPARK_DRIVER_CLASS: org.apache.spark.examples.SparkPi > SPARK_DRIVER_ARGS: > SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP) > SPARK_MOUNTED_CLASSPATH: > /opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar > SPARK_JAVA_OPT_1: > -Dspark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/srv > SPARK_JAVA_OPT_3: -Dspark.app.name=spark-pi > SPARK_JAVA_OPT_4: > -Dspark.kubernetes.driver.volumes.persistentVolumeClaim.ddata.mount.path=/srv > SPARK_JAVA_OPT_5: -Dspark.submit.deployMode=cluster > SPARK_JAVA_OPT_6: -Dspark.driver.blockManager.port=7079 > SPARK_JAVA_OPT_7: > -Dspark.kubernetes.driver.volumes.persistentVolumeClaim.ddata.mount.readOnly=true > SPARK_JAVA_OPT_8: > -Dspark.kubernetes.authenticate.driver.serviceAccountName=spark > SPARK_JAVA_OPT_9: > -Dspark.driver.host=spark-pi-1544018157391-driver-svc.spark-demo.svc.cluster.local > SPARK_JAVA_OPT_10: > -Dspark.kubernetes.driver.pod.name=spark-pi-1544018157391-driver > SPARK_JAVA_OPT_11: > -Dspark.kubernetes.driver.volumes.persistentVolumeClaim.ddata.options.claimName=nfs-pvc > SPARK_JAVA_OPT_12: > -Dspark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly=true > SPARK_JAVA_OPT_13: -Dspark.driver.port=7078 > SPARK_JAVA_OPT_14: > -Dspark.jars=/opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar > SPARK_JAVA_OPT_15: > -Dspark.kubernetes.executor.podNamePrefix=spark-pi-1544018157391 > SPARK_JAVA_OPT_16: -Dspark.local.dir=/tmp/spark-local > SPARK_JAVA_OPT_17: -Dspark.master=k8s://https:// >
[jira] [Commented] (SPARK-26342) Support for NFS mount for Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-26342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721894#comment-16721894 ] Yinan Li commented on SPARK-26342: -- So basically what you want is a generic way to mount arbitrary types of volumes. This is covered by SPARK-24434, which enables using a pod template to configure the driver and/or executor pods. > Support for NFS mount for Kubernetes > > > Key: SPARK-26342 > URL: https://issues.apache.org/jira/browse/SPARK-26342 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Eric Carlson >Priority: Minor > > Currently only hostPath, emptyDir, and PVC volume types are accepted for > Kubernetes-deployed drivers and executors. Possibility to mount NFS paths > would allow access to a common and easy-to-deploy shared storage solution. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26344) Support for flexVolume mount for Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-26344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721893#comment-16721893 ] Yinan Li commented on SPARK-26344: -- This is covered by SPARK-24434, which enables using a pod template to configure the driver and/or executor pods. > Support for flexVolume mount for Kubernetes > --- > > Key: SPARK-26344 > URL: https://issues.apache.org/jira/browse/SPARK-26344 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Eric Carlson >Priority: Minor > > Currently only hostPath, emptyDir, and PVC volume types are accepted for > Kubernetes-deployed drivers and executors. > flexVolume types allow for pluggable volume drivers to be used in Kubernetes > - a widely used example of this is the Rook deployment of CephFS, which > provides a POSIX-compliant distributed filesystem integrated into K8s. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25515) Add a config property for disabling auto deletion of PODS for debugging.
[ https://issues.apache.org/jira/browse/SPARK-25515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yinan Li resolved SPARK-25515. -- Resolution: Fixed Fix Version/s: 3.0.0 > Add a config property for disabling auto deletion of PODS for debugging. > > > Key: SPARK-25515 > URL: https://issues.apache.org/jira/browse/SPARK-25515 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Prashant Sharma >Priority: Major > Fix For: 3.0.0 > > > Currently, if a pod fails to start due to some failure, it gets removed and > new one is attempted. These sequence of events go on, until the app is > killed. Given the speed of creation and deletion, it becomes difficult to > debug the reason for failure. > So adding a configuration parameter to disable auto-deletion of pods, will be > helpful for debugging. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25922) [K8] Spark Driver/Executor "spark-app-selector" label mismatch
[ https://issues.apache.org/jira/browse/SPARK-25922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677276#comment-16677276 ] Yinan Li commented on SPARK-25922: -- The application ID used to set the {{spark-app-selector}} label for the driver pod is from this line [https://github.com/apache/spark/blob/3404a73f4cf7be37e574026d08ad5cf82cfac871/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L217.] The application ID used to set the {{spark-app-selector}} label for the executor pod is from this line [https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L87|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L87,]. Agreed that it's problematic that two different labels are used. > [K8] Spark Driver/Executor "spark-app-selector" label mismatch > -- > > Key: SPARK-25922 > URL: https://issues.apache.org/jira/browse/SPARK-25922 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 RC4 >Reporter: Anmol Khurana >Priority: Major > > Hi, > I have been testing Spark 2.4.0 RC4 on Kubernetes to run Python Spark > Applications and running into an issue where the AppId label on the driver > and executors mis-match. I am using the > [https://github.com/GoogleCloudPlatform/spark-on-k8s-operator] to run these > applications. > I see a spark.app.id of the form spark-* as "spark-app-selector" label on > the driver as well as in the K8 config-map which gets created for the driver > via spark-submit . My guess is this is coming from > [https://github.com/apache/spark/blob/f6cc354d83c2c9a757f9b507aadd4dbdc5825cca/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L211] > > But when the driver actually comes up and brings up executors etc. , I see > that the "spark-app-selector" label on the executors as well as the > spark.app.Id config within the user-code on the driver is something of the > form spark-application-* ( probably from > [https://github.com/apache/spark/blob/b19a28dea098c7d6188f8540429c50f42952d678/core/src/main/scala/org/apache/spark/SparkContext.scala#L511] > & > [https://github.com/apache/spark/blob/bfb74394a5513134ea1da9fcf4a1783b77dd64e4/core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala#L26|https://github.com/apache/spark/blob/bfb74394a5513134ea1da9fcf4a1783b77dd64e4/core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala#L26)] > ) > We were consuming this "spark-app-selector" label on the Driver Pod to get > the App Id and use it to look-up the app in SparkHistory server (among other > use-cases). but due to this mis-match, this logic no longer works. This was > working fine in Spark 2.2 fork for Kubernetes which i was using earlier. Is > this expected behavior and if yes, what's the correct way to fetch the > applicationId from outside the application ? > Let me know if I can provide any more details or if I am doing something > wrong. Here is an example run with different *spark-app-selector* label on > the driver/executor : > > {code:java} > Name: pyfiles-driver > Namespace: default > Priority: 0 > PriorityClassName: > Start Time: Thu, 01 Nov 2018 18:19:46 -0700 > Labels: spark-app-selector=spark-b78bb10feebf4e2d98c11d7b6320e18f > spark-role=driver > sparkoperator.k8s.io/app-name=pyfiles > sparkoperator.k8s.io/launched-by-spark-operator=true > version=2.4.0 > Status: Running > Name: pyfiles-1541121585642-exec-1 > Namespace: default > Priority: 0 > PriorityClassName: > Start Time: Thu, 01 Nov 2018 18:24:02 -0700 > Labels: spark-app-selector=spark-application-1541121829445 > spark-exec-id=1 > spark-role=executor > sparkoperator.k8s.io/app-name=pyfiles > sparkoperator.k8s.io/launched-by-spark-operator=true > version=2.4.0 > Status: Pending > {code} > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25787) [K8S] Spark can't use data locality information
[ https://issues.apache.org/jira/browse/SPARK-25787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16659593#comment-16659593 ] Yinan Li commented on SPARK-25787: -- Support for data locality on k8s has not been ported to the upstream Spark repo yet. > [K8S] Spark can't use data locality information > --- > > Key: SPARK-25787 > URL: https://issues.apache.org/jira/browse/SPARK-25787 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Maciej Bryński >Priority: Major > > I started experimenting with Spark based on this presentation: > https://www.slideshare.net/databricks/hdfs-on-kuberneteslessons-learned-with-kimoon-kim > I'm using excelent https://github.com/apache-spark-on-k8s/kubernetes-HDFS > charts to deploy HDFS. > Unfortunately reading from HDFS gives ANY locality for every task. > Is data locality working on Kubernetes cluster ? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25796) Enable external shuffle service for kubernetes mode.
[ https://issues.apache.org/jira/browse/SPARK-25796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16659579#comment-16659579 ] Yinan Li commented on SPARK-25796: -- See https://issues.apache.org/jira/browse/SPARK-24432. > Enable external shuffle service for kubernetes mode. > > > Key: SPARK-25796 > URL: https://issues.apache.org/jira/browse/SPARK-25796 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Prashant Sharma >Priority: Major > > This is required to support dynamic scaling for spark jobs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24432) Add support for dynamic resource allocation
[ https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yinan Li updated SPARK-24432: - Affects Version/s: 3.0.0 > Add support for dynamic resource allocation > --- > > Key: SPARK-24432 > URL: https://issues.apache.org/jira/browse/SPARK-24432 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0, 3.0.0 >Reporter: Yinan Li >Priority: Major > > This is an umbrella ticket for work on adding support for dynamic resource > allocation into the Kubernetes mode. This requires a Kubernetes-specific > external shuffle service. The feature is available in our fork at > github.com/apache-spark-on-k8s/spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25742) Is there a way to pass the Azure blob storage credentials to the spark for k8s init-container?
[ https://issues.apache.org/jira/browse/SPARK-25742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652064#comment-16652064 ] Yinan Li commented on SPARK-25742: -- The k8s secrets you add through the {{spark.kubernetes.driver.secrets.}} config option will also get mounted into the init-container in the driver pod. You can use that to pass credential for pulling dependencies into the driver init-container. > Is there a way to pass the Azure blob storage credentials to the spark for > k8s init-container? > -- > > Key: SPARK-25742 > URL: https://issues.apache.org/jira/browse/SPARK-25742 > Project: Spark > Issue Type: Question > Components: Kubernetes >Affects Versions: 2.3.2 >Reporter: Oscar Bonilla >Priority: Minor > > I'm trying to run spark on a kubernetes cluster in Azure. The idea is to > store the Spark application jars and dependencies in a container in Azure > Blob Storage. > I've tried to do this with a public container and this works OK, but when > having a private Blob Storage container, the spark-init init container > doesn't download the jars. > The equivalent in AWS S3 is as simple as adding the key_id and secret as > environment variables, but I don't see how to do this for Azure Blob Storage. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25682) Docker images generated from dev build and from dist tarball are different
[ https://issues.apache.org/jira/browse/SPARK-25682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644198#comment-16644198 ] Yinan Li commented on SPARK-25682: -- Cool, thanks! > Docker images generated from dev build and from dist tarball are different > -- > > Key: SPARK-25682 > URL: https://issues.apache.org/jira/browse/SPARK-25682 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Minor > > There's at least one difference I noticed, because of this line: > {noformat} > COPY examples /opt/spark/examples > {noformat} > In a dev build, "examples" contains your usual source code and maven-style > directories, whereas in the dist version, it's this: > {code} > cp "$SPARK_HOME"/examples/target/scala*/jars/* "$DISTDIR/examples/jars" > {code} > So the path to the actual jar files ends up being different depending on how > you built the image. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25682) Docker images generated from dev build and from dist tarball are different
[ https://issues.apache.org/jira/browse/SPARK-25682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644157#comment-16644157 ] Yinan Li commented on SPARK-25682: -- That looks like to me the only difference. {{bin}}, {{sbin}}, and {{data}} are also hard-coded but they appear to be the same between the source and a distribution. Are you working on a fix? > Docker images generated from dev build and from dist tarball are different > -- > > Key: SPARK-25682 > URL: https://issues.apache.org/jira/browse/SPARK-25682 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Minor > > There's at least one difference I noticed, because of this line: > {noformat} > COPY examples /opt/spark/examples > {noformat} > In a dev build, "examples" contains your usual source code and maven-style > directories, whereas in the dist version, it's this: > {code} > cp "$SPARK_HOME"/examples/target/scala*/jars/* "$DISTDIR/examples/jars" > {code} > So the path to the actual jar files ends up being different depending on how > you built the image. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25500) Specify configmap and secrets in Spark driver and executor pods in Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-25500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16625404#comment-16625404 ] Yinan Li edited comment on SPARK-25500 at 9/24/18 5:51 AM: --- We don't plan to add more configuration properties for pod customization as we move to a pod template model. See https://issues.apache.org/jira/browse/SPARK-24434. It supports all use cases you mentioned above. BTW: we already have {{spark.kubernetes.[driver|executor].secrets.[SecretName]=[MountPath]}} since Spark 2.3. was (Author: liyinan926): We don't plan to add more configuration properties for pod customization as we move to a pod template model. See https://issues.apache.org/jira/browse/SPARK-24434. It supports all use cases you mentioned above. BTW: we already have {{spark.kubernetes.\{driver|executor}.secrets.[SecretName]=[MountPath] }}since Spark 2.3{{.}} > Specify configmap and secrets in Spark driver and executor pods in Kubernetes > - > > Key: SPARK-25500 > URL: https://issues.apache.org/jira/browse/SPARK-25500 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.1 >Reporter: Abhishek Rao >Priority: Minor > > This uses SPARK-23529. Support for specifying configmap and secrets as > spark-configuration is requested. > Using PR #22146, the above functionality can be achieved by passing template > file. However, for spark properties (like log4j.properties, fairscheduler.xml > and metrics.properties), we are proposing this approach as this is native to > other configuration options specifications in spark. > The configmaps and secrets have to be pre-created before using this as spark > configuration. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25500) Specify configmap and secrets in Spark driver and executor pods in Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-25500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16625404#comment-16625404 ] Yinan Li commented on SPARK-25500: -- We don't plan to add more configuration properties for pod customization as we move to a pod template model. See https://issues.apache.org/jira/browse/SPARK-24434. It supports all use cases you mentioned above. BTW: we already have {{spark.kubernetes.\{driver|executor}.secrets.[SecretName]=[MountPath] }}since Spark 2.3{{.}} > Specify configmap and secrets in Spark driver and executor pods in Kubernetes > - > > Key: SPARK-25500 > URL: https://issues.apache.org/jira/browse/SPARK-25500 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.1 >Reporter: Abhishek Rao >Priority: Minor > > This uses SPARK-23529. Support for specifying configmap and secrets as > spark-configuration is requested. > Using PR #22146, the above functionality can be achieved by passing template > file. However, for spark properties (like log4j.properties, fairscheduler.xml > and metrics.properties), we are proposing this approach as this is native to > other configuration options specifications in spark. > The configmaps and secrets have to be pre-created before using this as spark > configuration. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23200) Reset configuration when restarting from checkpoints
[ https://issues.apache.org/jira/browse/SPARK-23200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yinan Li resolved SPARK-23200. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22392 [https://github.com/apache/spark/pull/22392] > Reset configuration when restarting from checkpoints > > > Key: SPARK-23200 > URL: https://issues.apache.org/jira/browse/SPARK-23200 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Anirudh Ramanathan >Priority: Major > Fix For: 2.4.0 > > > Streaming workloads and restarting from checkpoints may need additional > changes, i.e. resetting properties - see > https://github.com/apache-spark-on-k8s/spark/pull/516 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25291) Flakiness of tests in terms of executor memory (SecretsTestSuite)
[ https://issues.apache.org/jira/browse/SPARK-25291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yinan Li resolved SPARK-25291. -- Resolution: Fixed Fix Version/s: 2.4.0 > Flakiness of tests in terms of executor memory (SecretsTestSuite) > - > > Key: SPARK-25291 > URL: https://issues.apache.org/jira/browse/SPARK-25291 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Ilan Filonenko >Priority: Major > Fix For: 2.4.0 > > > SecretsTestSuite shows flakiness in terms of correct setting of executor > memory: > Run SparkPi with env and mount secrets. *** FAILED *** > "[884]Mi" did not equal "[1408]Mi" (KubernetesSuite.scala:272) > When ran with default settings -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25295) Pod names conflicts in client mode, if previous submission was not a clean shutdown.
[ https://issues.apache.org/jira/browse/SPARK-25295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yinan Li resolved SPARK-25295. -- Resolution: Fixed Fix Version/s: 2.4.0 > Pod names conflicts in client mode, if previous submission was not a clean > shutdown. > > > Key: SPARK-25295 > URL: https://issues.apache.org/jira/browse/SPARK-25295 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Prashant Sharma >Priority: Major > Fix For: 2.4.0 > > > If the previous job was killed somehow, by disconnecting the client. It > leaves behind the executor pods named spark-exec-#, which cause naming > conflicts and failures for the next job submission. > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://:6443/api/v1/namespaces/default/pods. Message: pods > "spark-exec-4" already exists. Received status: Status(apiVersion=v1, > code=409, details=StatusDetails(causes=[], group=null, kind=pods, > name=spark-exec-4, retryAfterSeconds=null, uid=null, > additionalProperties={}), kind=Status, message=pods "spark-exec-4" already > exists, metadata=ListMeta(resourceVersion=null, selfLink=null, > additionalProperties={}), reason=AlreadyExists, status=Failure, > additionalProperties={}). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-25295) Pod names conflicts in client mode, if previous submission was not a clean shutdown.
[ https://issues.apache.org/jira/browse/SPARK-25295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yinan Li updated SPARK-25295: - Comment: was deleted (was: We made it clear in the documentation of the Kubernetes mode at [https://github.com/apache/spark/blob/master/docs/running-on-kubernetes.md#client-mode-executor-pod-garbage-collection] that when running the client mode, executor pods may be left behind. This is by design. If you want to have the executor pods deleted automatically, run the driver in a pod inside the cluster and set {{spark.driver.pod.name}} to the name of the driver pod so an {{OwnerReference}} pointing to the driver pod gets added to the executor pods. This way the executor pods get garbage collected when the driver pod is gone.) > Pod names conflicts in client mode, if previous submission was not a clean > shutdown. > > > Key: SPARK-25295 > URL: https://issues.apache.org/jira/browse/SPARK-25295 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Prashant Sharma >Priority: Major > > If the previous job was killed somehow, by disconnecting the client. It > leaves behind the executor pods named spark-exec-#, which cause naming > conflicts and failures for the next job submission. > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://:6443/api/v1/namespaces/default/pods. Message: pods > "spark-exec-4" already exists. Received status: Status(apiVersion=v1, > code=409, details=StatusDetails(causes=[], group=null, kind=pods, > name=spark-exec-4, retryAfterSeconds=null, uid=null, > additionalProperties={}), kind=Status, message=pods "spark-exec-4" already > exists, metadata=ListMeta(resourceVersion=null, selfLink=null, > additionalProperties={}), reason=AlreadyExists, status=Failure, > additionalProperties={}). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25282) Fix support for spark-shell with K8s
[ https://issues.apache.org/jira/browse/SPARK-25282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599310#comment-16599310 ] Yinan Li commented on SPARK-25282: -- I'm not sure this is a bug and how this should be enforced systematically. When you use the client mode and run the driver outside a cluster on a host, you are using the Spark distribution on the host, which may or may not have the same version as that of the Spark jars in the image. I guess this is not even a unique problem to Spark on Kubernetes. > Fix support for spark-shell with K8s > > > Key: SPARK-25282 > URL: https://issues.apache.org/jira/browse/SPARK-25282 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Prashant Sharma >Priority: Major > > Spark shell when run with kubernetes master, gives following errors. > {noformat} > java.io.InvalidClassException: org.apache.spark.storage.BlockManagerId; local > class incompatible: stream classdesc serialVersionUID = -3720498261147521051, > local class serialVersionUID = -6655865447853211720 > at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616) > at > java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630) > at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942) > {noformat} > Special care was taken to ensure, the same compiled jar was used both in > images and the host system. or system running the driver. > This issue affects, pyspark and R interface as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25295) Pod names conflicts in client mode, if previous submission was not a clean shutdown.
[ https://issues.apache.org/jira/browse/SPARK-25295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599308#comment-16599308 ] Yinan Li commented on SPARK-25295: -- We made it clear in the documentation of the Kubernetes mode at [https://github.com/apache/spark/blob/master/docs/running-on-kubernetes.md#client-mode-executor-pod-garbage-collection] that when running the client mode, executor pods may be left behind. This is by design. If you want to have the executor pods deleted automatically, run the driver in a pod inside the cluster and set {{spark.driver.pod.name}} to the name of the driver pod so an {{OwnerReference}} pointing to the driver pod gets added to the executor pods. This way the executor pods get garbage collected when the driver pod is gone. > Pod names conflicts in client mode, if previous submission was not a clean > shutdown. > > > Key: SPARK-25295 > URL: https://issues.apache.org/jira/browse/SPARK-25295 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Prashant Sharma >Priority: Major > > If the previous job was killed somehow, by disconnecting the client. It > leaves behind the executor pods named spark-exec-#, which cause naming > conflicts and failures for the next job submission. > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://:6443/api/v1/namespaces/default/pods. Message: pods > "spark-exec-4" already exists. Received status: Status(apiVersion=v1, > code=409, details=StatusDetails(causes=[], group=null, kind=pods, > name=spark-exec-4, retryAfterSeconds=null, uid=null, > additionalProperties={}), kind=Status, message=pods "spark-exec-4" already > exists, metadata=ListMeta(resourceVersion=null, selfLink=null, > additionalProperties={}), reason=AlreadyExists, status=Failure, > additionalProperties={}). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates
[ https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599304#comment-16599304 ] Yinan Li commented on SPARK-24434: -- [~skonto] we can understand your feeling and frustration on this, and we really appreciate your work driving the design. AFAIK, the PR created by [~onursatici] follows the design (you are helping reviewing it so you can judge if this is the case). I think the situation was that people wanted to move this forward (granted that you were driving this) while you were on vacation and thought it would be good to get the ball rolling with a WIP PR that everyone could comment and give early feedbacks on. The fact that no one knew how far you had gone on the implementation before you started your vacation is probably also a factor here. Anyway, with that being said, we really appreciate your work driving the design and reviewing the PR! If you want to have further discussion on this and have ideas on how to better coordinate on big features in the future, let us know and we can bring it up at the next sig meeting. > Support user-specified driver and executor pod templates > > > Key: SPARK-24434 > URL: https://issues.apache.org/jira/browse/SPARK-24434 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > With more requests for customizing the driver and executor pods coming, the > current approach of adding new Spark configuration options has some serious > drawbacks: 1) it means more Kubernetes specific configuration options to > maintain, and 2) it widens the gap between the declarative model used by > Kubernetes and the configuration model used by Spark. We should start > designing a solution that allows users to specify pod templates as central > places for all customization needs for the driver and executor pods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates
[ https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594038#comment-16594038 ] Yinan Li commented on SPARK-24434: -- It seemed I couldn't change the assignee. > Support user-specified driver and executor pod templates > > > Key: SPARK-24434 > URL: https://issues.apache.org/jira/browse/SPARK-24434 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > With more requests for customizing the driver and executor pods coming, the > current approach of adding new Spark configuration options has some serious > drawbacks: 1) it means more Kubernetes specific configuration options to > maintain, and 2) it widens the gap between the declarative model used by > Kubernetes and the configuration model used by Spark. We should start > designing a solution that allows users to specify pod templates as central > places for all customization needs for the driver and executor pods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25162) Kubernetes 'in-cluster' client mode and value of spark.driver.host
[ https://issues.apache.org/jira/browse/SPARK-25162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589286#comment-16589286 ] Yinan Li commented on SPARK-25162: -- > Where the driver is running _outside-cluster client_ mode, would you >recommend a default behavior of deriving the IP address of the host on which >the driver is running (provided that IP address is routable from inside the >cluster) and giving the user the option to override and supply a FQDN or >routable IP address for the driver? The philosophy behind the client mode in the Kubernetes deployment mode is to not be opinionated on how users setup network connectivity from the executors to the driver. So it's really up to the users to decide what's the best way to provide such connectivity. Please check out https://github.com/apache/spark/blob/master/docs/running-on-kubernetes.md#client-mode. > Kubernetes 'in-cluster' client mode and value of spark.driver.host > -- > > Key: SPARK-25162 > URL: https://issues.apache.org/jira/browse/SPARK-25162 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 > Environment: A java program, deployed to kubernetes, that establishes > a Spark Context in client mode. > Not using spark-submit. > Kubernetes 1.10 > AWS EKS > > >Reporter: James Carter >Priority: Minor > > When creating Kubernetes scheduler 'in-cluster' using client mode, the value > for spark.driver.host can be derived from the IP address of the driver pod. > I observed that the value of _spark.driver.host_ defaulted to the value of > _spark.kubernetes.driver.pod.name_, which is not a valid hostname. This > caused the executors to fail to establish a connection back to the driver. > As a work around, in my configuration I pass the driver's pod name _and_ the > driver's ip address to ensure that executors can establish a connection with > the driver. > _spark.kubernetes.driver.pod.name_ := env.valueFrom.fieldRef.fieldPath: > metadata.name > _spark.driver.host_ := env.valueFrom.fieldRef.fieldPath: status.podIp > e.g. > Deployment: > {noformat} > env: > - name: DRIVER_POD_NAME > valueFrom: > fieldRef: > fieldPath: metadata.name > - name: DRIVER_POD_IP > valueFrom: > fieldRef: > fieldPath: status.podIP > {noformat} > > Application Properties: > {noformat} > config[spark.kubernetes.driver.pod.name]: ${DRIVER_POD_NAME} > config[spark.driver.host]: ${DRIVER_POD_IP} > {noformat} > > BasicExecutorFeatureStep.scala: > {code:java} > private val driverUrl = RpcEndpointAddress( > kubernetesConf.get("spark.driver.host"), > kubernetesConf.sparkConf.getInt("spark.driver.port", DEFAULT_DRIVER_PORT), > CoarseGrainedSchedulerBackend.ENDPOINT_NAME).toString > {code} > > Ideally only _spark.kubernetes.driver.pod.name_ would need be provided in > this deployment scenario. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25194) Kubernetes - Define cpu and memory limit to init container
[ https://issues.apache.org/jira/browse/SPARK-25194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589283#comment-16589283 ] Yinan Li commented on SPARK-25194: -- The upcoming Spark 2.4 gets rid of the init-container and switch to running {{spark-submit}} in client mode in the driver to download remote dependencies. Given that 2.4 is coming soon, I would suggest waiting for and using it instead. > Kubernetes - Define cpu and memory limit to init container > -- > > Key: SPARK-25194 > URL: https://issues.apache.org/jira/browse/SPARK-25194 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.1 >Reporter: Daniel Majano >Priority: Major > Labels: features > > Hi, > > Recently I have started to work with spark under kubernetes. We have all our > kubernetes clusters with resources quotes, so if you want to do a deploy yo > need to define container cpu and memory limit. > > With driver and executors this is ok due to with spark submit props you can > define this limits. But today for one of my projects, I need to load an > external dependency. I have tried to define the dependency with --jars and > the link with https so then, the init container will pop up and you don't > have the possibility to define limits and the submitter failed due to he > can't start the pod with driver + init container. > > > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25162) Kubernetes 'in-cluster' client mode and value of spark.driver.host
[ https://issues.apache.org/jira/browse/SPARK-25162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588014#comment-16588014 ] Yinan Li commented on SPARK-25162: -- We actually moved away from using the IP address of the driver pod to set {{spark.driver.host}}, to using a headless service to give the driver pod a FQDN name and set {{spark.driver.host}} to the FQDN name. Internally, we set {{spark.driver.bindAddress}} to the value of environment variable {{SPARK_DRIVER_BIND_ADDRESS}} which gets its value from the IP address of the pod using the downward API. We could do the same for {{spark.kubernetes.driver.pod.name}} as you suggested for in-cluster client mode. > Kubernetes 'in-cluster' client mode and value of spark.driver.host > -- > > Key: SPARK-25162 > URL: https://issues.apache.org/jira/browse/SPARK-25162 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 > Environment: A java program, deployed to kubernetes, that establishes > a Spark Context in client mode. > Not using spark-submit. > Kubernetes 1.10 > AWS EKS > > >Reporter: James Carter >Priority: Minor > > When creating Kubernetes scheduler 'in-cluster' using client mode, the value > for spark.driver.host can be derived from the IP address of the driver pod. > I observed that the value of _spark.driver.host_ defaulted to the value of > _spark.kubernetes.driver.pod.name_, which is not a valid hostname. This > caused the executors to fail to establish a connection back to the driver. > As a work around, in my configuration I pass the driver's pod name _and_ the > driver's ip address to ensure that executors can establish a connection with > the driver. > _spark.kubernetes.driver.pod.name_ := env.valueFrom.fieldRef.fieldPath: > metadata.name > _spark.driver.host_ := env.valueFrom.fieldRef.fieldPath: status.podIp > e.g. > Deployment: > {noformat} > env: > - name: DRIVER_POD_NAME > valueFrom: > fieldRef: > fieldPath: metadata.name > - name: DRIVER_POD_IP > valueFrom: > fieldRef: > fieldPath: status.podIP > {noformat} > > Application Properties: > {noformat} > config[spark.kubernetes.driver.pod.name]: ${DRIVER_POD_NAME} > config[spark.driver.host]: ${DRIVER_POD_IP} > {noformat} > > BasicExecutorFeatureStep.scala: > {code:java} > private val driverUrl = RpcEndpointAddress( > kubernetesConf.get("spark.driver.host"), > kubernetesConf.sparkConf.getInt("spark.driver.port", DEFAULT_DRIVER_PORT), > CoarseGrainedSchedulerBackend.ENDPOINT_NAME).toString > {code} > > Ideally only _spark.kubernetes.driver.pod.name_ would need be provided in > this deployment scenario. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates
[ https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16586314#comment-16586314 ] Yinan Li commented on SPARK-24434: -- [~skonto] I will make sure the assignee gets properly set for future JIRAs. [~onursatici], if you would like to work on the implementation, please make sure you read through the design doc from [~skonto] and make the implementation follow what the design proposes. Thanks! > Support user-specified driver and executor pod templates > > > Key: SPARK-24434 > URL: https://issues.apache.org/jira/browse/SPARK-24434 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > With more requests for customizing the driver and executor pods coming, the > current approach of adding new Spark configuration options has some serious > drawbacks: 1) it means more Kubernetes specific configuration options to > maintain, and 2) it widens the gap between the declarative model used by > Kubernetes and the configuration model used by Spark. We should start > designing a solution that allows users to specify pod templates as central > places for all customization needs for the driver and executor pods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25066) Provide Spark R image for deploying Spark on kubernetes.
[ https://issues.apache.org/jira/browse/SPARK-25066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576555#comment-16576555 ] Yinan Li commented on SPARK-25066: -- R support is still being worked on and will likely go into 2.4. Is this Jira for that work? > Provide Spark R image for deploying Spark on kubernetes. > > > Key: SPARK-25066 > URL: https://issues.apache.org/jira/browse/SPARK-25066 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.1 >Reporter: Prashant Sharma >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24724) Discuss necessary info and access in barrier mode + Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-24724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560436#comment-16560436 ] Yinan Li commented on SPARK-24724: -- Sorry haven't got a chance to look into this. What pieces of info and what kind of access we need to provide? I saw some comments on thesimilar Jira for Yarn and particularly the one quoted below: "The main problem is how to provide necessary information for barrier tasks to start MPI job in a password-less manner". Is the main problem the same for Kubernetes? > Discuss necessary info and access in barrier mode + Kubernetes > -- > > Key: SPARK-24724 > URL: https://issues.apache.org/jira/browse/SPARK-24724 > Project: Spark > Issue Type: Story > Components: Kubernetes, ML, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Yinan Li >Priority: Major > > In barrier mode, to run hybrid distributed DL training jobs, we need to > provide users sufficient info and access so they can set up a hybrid > distributed training job, e.g., using MPI. > This ticket limits the scope of discussion to Spark + Kubernetes. There were > some past and on-going attempts from the Kubenetes community. So we should > find someone with good knowledge to lead the discussion here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24894) Invalid DNS name due to hostname truncation
[ https://issues.apache.org/jira/browse/SPARK-24894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554602#comment-16554602 ] Yinan Li commented on SPARK-24894: -- [~mcheah]. We need to make sure the truncation leads to a valid hostname. > Invalid DNS name due to hostname truncation > > > Key: SPARK-24894 > URL: https://issues.apache.org/jira/browse/SPARK-24894 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.1 >Reporter: Dharmesh Kakadia >Priority: Major > > The truncation for hostname happening here > [https://github.com/apache/spark/blob/5ff1b9ba1983d5601add62aef64a3e87d07050eb/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L77] > is a problematic and can lead to DNS names starting with "-". > Originally filled here : > https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/229 > ``` > {{2018-07-23 21:21:42 ERROR Utils:91 - Uncaught exception in thread > kubernetes-pod-allocator > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://kubernetes.default.svc/api/v1/namespaces/default/pods. > Message: Pod > "user-archetypes-all-weekly-1532380861251850404-1532380862321-exec-9" is > invalid: spec.hostname: Invalid value: > "-archetypes-all-weekly-1532380861251850404-1532380862321-exec-9": a DNS-1123 > label must consist of lower case alphanumeric characters or '-', and must > start and end with an alphanumeric character (e.g. 'my-name', or '123-abc', > regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?'). Received > status: Status(apiVersion=v1, code=422, > details=StatusDetails(causes=[StatusCause(field=spec.hostname, > message=Invalid value: > "-archetypes-all-weekly-1532380861251850404-1532380862321-exec-9": a DNS-1123 > label must consist of lower case alphanumeric characters or '-', and must > start and end with an alphanumeric character (e.g. 'my-name', or '123-abc', > regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?'), > reason=FieldValueInvalid, additionalProperties={})], group=null, kind=Pod, > name=user-archetypes-all-weekly-1532380861251850404-1532380862321-exec-9, > retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, > message=Pod > "user-archetypes-all-weekly-1532380861251850404-1532380862321-exec-9" is > invalid: spec.hostname: Invalid value: > "-archetypes-all-weekly-1532380861251850404-1532380862321-exec-9": a DNS-1123 > label must consist of lower case alphanumeric characters or '-', and must > start and end with an alphanumeric character (e.g. 'my-name', or '123-abc', > regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?'), > metadata=ListMeta(resourceVersion=null, selfLink=null, > additionalProperties={}), reason=Invalid, status=Failure, > additionalProperties={}). at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:470) > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:409) > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:379) > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:343) > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:226) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:769) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:356) > at > org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend$$anon$1$$anonfun$3$$anonfun$apply$3.apply(KubernetesClusterSchedulerBackend.scala:140) > at > org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend$$anon$1$$anonfun$3$$anonfun$apply$3.apply(KubernetesClusterSchedulerBackend.scala:140) > at org.apache.spark.util.Utils$.tryLog(Utils.scala:1922) at > org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend$$anon$1$$anonfun$3.apply(KubernetesClusterSchedulerBackend.scala:139) > at > org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend$$anon$1$$anonfun$3.apply(KubernetesClusterSchedulerBackend.scala:138) > at > scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) > at > scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) > at
[jira] [Updated] (SPARK-24724) Discuss necessary info and access in barrier mode + Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-24724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yinan Li updated SPARK-24724: - Component/s: Kubernetes > Discuss necessary info and access in barrier mode + Kubernetes > -- > > Key: SPARK-24724 > URL: https://issues.apache.org/jira/browse/SPARK-24724 > Project: Spark > Issue Type: Story > Components: Kubernetes, ML, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Yinan Li >Priority: Major > > In barrier mode, to run hybrid distributed DL training jobs, we need to > provide users sufficient info and access so they can set up a hybrid > distributed training job, e.g., using MPI. > This ticket limits the scope of discussion to Spark + Kubernetes. There were > some past and on-going attempts from the Kubenetes community. So we should > find someone with good knowledge to lead the discussion here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24793) Make spark-submit more useful with k8s
[ https://issues.apache.org/jira/browse/SPARK-24793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542109#comment-16542109 ] Yinan Li edited comment on SPARK-24793 at 7/12/18 7:11 PM: --- Oh, yeah, {{kill}} and {{status}} are existing options of {{spark-submit}}. Agreed we should add support for them into the k8s submission client. But options that are k8s backend specific probably should need a better place. was (Author: liyinan926): Oh, yeah, {{kill}} and {{status}} are existing options of {{spark-submit}}. Agreed we should add support for them into the k8s submission client. > Make spark-submit more useful with k8s > -- > > Key: SPARK-24793 > URL: https://issues.apache.org/jira/browse/SPARK-24793 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan >Assignee: Anirudh Ramanathan >Priority: Major > > Support controlling the lifecycle of Spark Application through spark-submit. > For example: > {{ > --kill app_name If given, kills the driver specified. > --status app_name If given, requests the status of the driver > specified. > }} > Potentially also --list to list all spark drivers running. > Given that our submission client can actually launch jobs into many different > namespaces, we'll need an additional specification of the namespace through a > --namespace flag potentially. > I think this is pretty useful to have instead of forcing a user to use > kubectl to manage the lifecycle of any k8s Spark Application. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24793) Make spark-submit more useful with k8s
[ https://issues.apache.org/jira/browse/SPARK-24793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542109#comment-16542109 ] Yinan Li commented on SPARK-24793: -- Oh, yeah, {{kill}} and {{status}} are existing options of {{spark-submit}}. Agreed we should add support for them into the k8s submission client. > Make spark-submit more useful with k8s > -- > > Key: SPARK-24793 > URL: https://issues.apache.org/jira/browse/SPARK-24793 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan >Assignee: Anirudh Ramanathan >Priority: Major > > Support controlling the lifecycle of Spark Application through spark-submit. > For example: > {{ > --kill app_name If given, kills the driver specified. > --status app_name If given, requests the status of the driver > specified. > }} > Potentially also --list to list all spark drivers running. > Given that our submission client can actually launch jobs into many different > namespaces, we'll need an additional specification of the namespace through a > --namespace flag potentially. > I think this is pretty useful to have instead of forcing a user to use > kubectl to manage the lifecycle of any k8s Spark Application. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24793) Make spark-submit more useful with k8s
[ https://issues.apache.org/jira/browse/SPARK-24793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542003#comment-16542003 ] Yinan Li commented on SPARK-24793: -- Good points, Erik. I think [sparkctl|https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/tree/master/sparkctl] is also a good alternative for supporting the set of functionalities proposed. It is positioned better to operate on the driver pods than kubectl, while still looks familiar to users who are used to kubectl. > Make spark-submit more useful with k8s > -- > > Key: SPARK-24793 > URL: https://issues.apache.org/jira/browse/SPARK-24793 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan >Assignee: Anirudh Ramanathan >Priority: Major > > Support controlling the lifecycle of Spark Application through spark-submit. > For example: > {{ > --kill app_name If given, kills the driver specified. > --status app_name If given, requests the status of the driver > specified. > }} > Potentially also --list to list all spark drivers running. > Given that our submission client can actually launch jobs into many different > namespaces, we'll need an additional specification of the namespace through a > --namespace flag potentially. > I think this is pretty useful to have instead of forcing a user to use > kubectl to manage the lifecycle of any k8s Spark Application. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24432) Add support for dynamic resource allocation
[ https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16540252#comment-16540252 ] Yinan Li commented on SPARK-24432: -- No one is working on this right now, but I think foxish planned to work on this although I'm not sure where he's at. The existing implementation in the fork has some issue that we need to solve. A redesign might be needed. > Add support for dynamic resource allocation > --- > > Key: SPARK-24432 > URL: https://issues.apache.org/jira/browse/SPARK-24432 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > This is an umbrella ticket for work on adding support for dynamic resource > allocation into the Kubernetes mode. This requires a Kubernetes-specific > external shuffle service. The feature is available in our fork at > github.com/apache-spark-on-k8s/spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24765) Add custom Kubernetes scheduler config parameter to spark-submit
[ https://issues.apache.org/jira/browse/SPARK-24765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538999#comment-16538999 ] Yinan Li commented on SPARK-24765: -- Check out https://issues.apache.org/jira/browse/SPARK-24434 and https://docs.google.com/document/d/1pcyH5f610X2jyJW9WbWHnj8jktQPLlbbmmUwdeK4fJk/edit#. > Add custom Kubernetes scheduler config parameter to spark-submit > - > > Key: SPARK-24765 > URL: https://issues.apache.org/jira/browse/SPARK-24765 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.1 >Reporter: Nihal Harish >Priority: Minor > > spark submit currently does not accept any config parameter that can enable > the driver and executor pods to be scheduled by a custom scheduler as opposed > to just the default-scheduler. > I propose the addition of a new config parameter: > spark.kubernetes.schedulerName > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates
[ https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513064#comment-16513064 ] Yinan Li commented on SPARK-24434: -- [~skonto] Thanks! Will take a look at the design doc once I'm back from vacation. > Support user-specified driver and executor pod templates > > > Key: SPARK-24434 > URL: https://issues.apache.org/jira/browse/SPARK-24434 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > With more requests for customizing the driver and executor pods coming, the > current approach of adding new Spark configuration options has some serious > drawbacks: 1) it means more Kubernetes specific configuration options to > maintain, and 2) it widens the gap between the declarative model used by > Kubernetes and the configuration model used by Spark. We should start > designing a solution that allows users to specify pod templates as central > places for all customization needs for the driver and executor pods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates
[ https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16499220#comment-16499220 ] Yinan Li commented on SPARK-24434: -- [~skonto] Thanks for the detailed thoughts! I agree with you that we can start with allowing users to pass in a YAML file that stores the pod template. YAML is more familiar to K8s users and this is key to make the experience as idiomatic as possible for k8s users, who are the ones that are aware of what a pod template is and what purpose it serves, and know what they would like to put into the template. > Support user-specified driver and executor pod templates > > > Key: SPARK-24434 > URL: https://issues.apache.org/jira/browse/SPARK-24434 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > With more requests for customizing the driver and executor pods coming, the > current approach of adding new Spark configuration options has some serious > drawbacks: 1) it means more Kubernetes specific configuration options to > maintain, and 2) it widens the gap between the declarative model used by > Kubernetes and the configuration model used by Spark. We should start > designing a solution that allows users to specify pod templates as central > places for all customization needs for the driver and executor pods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24434) Support user-specified driver and executor pod templates
[ https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498307#comment-16498307 ] Yinan Li edited comment on SPARK-24434 at 6/1/18 5:39 PM: -- The pod template is basically a pod specification and can contain every possible pieces of information about a pod. It should look similar to what the core workload types (deployments and statefulsets for example) use, which contains a {{[PodSpec|https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/api/core/v1/types.go#L2636]}}. The problem is unique for the Kubernetes mode as there are many things to customize for a pod. Currently we basically just introduce a new Spark config property for each new customization aspect of a pod. Given the number of things to customize, this will soon become hard to maintain if we keep introducing new config properties. was (Author: liyinan926): The pod template is basically a pod specification and can contain every possible pieces of information about a pod. It should look similar to what the core workload types (deployments and statefulsets for example) use, which contains a {{[PodSpec|https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/api/core/v1/types.go#L2636]}}. The problem is unique for the Kubernetes mode as there are many things to customize for a pod. Currently we basically just introduce a new Spark config property for each new customization aspect of a pod. Given the number of things to customize, this will soon become hard to maintain if we keep introducing new config properties. > Support user-specified driver and executor pod templates > > > Key: SPARK-24434 > URL: https://issues.apache.org/jira/browse/SPARK-24434 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > With more requests for customizing the driver and executor pods coming, the > current approach of adding new Spark configuration options has some serious > drawbacks: 1) it means more Kubernetes specific configuration options to > maintain, and 2) it widens the gap between the declarative model used by > Kubernetes and the configuration model used by Spark. We should start > designing a solution that allows users to specify pod templates as central > places for all customization needs for the driver and executor pods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24434) Support user-specified driver and executor pod templates
[ https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498307#comment-16498307 ] Yinan Li edited comment on SPARK-24434 at 6/1/18 5:38 PM: -- The pod template is basically a pod specification and can contain every possible pieces of information about a pod. It should look similar to what the core workload types (deployments and statefulsets for example) use, which contains a {{[PodSpec|https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/api/core/v1/types.go#L2636]}}. The problem is unique for the Kubernetes mode as there are many things to customize for a pod. Currently we basically just introduce a new Spark config property for each new customization aspect of a pod. Given the number of things to customize, this will soon become hard to maintain if we keep introducing new config properties. was (Author: liyinan926): The pod template is basically a pod specification and can contain every possible pieces of information about a pod. It should look similar to what the core workload types (deployments and statefulsets for example) use, which contains a {{[PodSpec|https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/api/core/v1/types.go#L2636]}}. > Support user-specified driver and executor pod templates > > > Key: SPARK-24434 > URL: https://issues.apache.org/jira/browse/SPARK-24434 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > With more requests for customizing the driver and executor pods coming, the > current approach of adding new Spark configuration options has some serious > drawbacks: 1) it means more Kubernetes specific configuration options to > maintain, and 2) it widens the gap between the declarative model used by > Kubernetes and the configuration model used by Spark. We should start > designing a solution that allows users to specify pod templates as central > places for all customization needs for the driver and executor pods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates
[ https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498307#comment-16498307 ] Yinan Li commented on SPARK-24434: -- The pod template is basically a pod specification and can contain every possible pieces of information about a pod. It should look similar to what the core workload types (deployments and statefulsets for example) use, which contains a {{[PodSpec|https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/api/core/v1/types.go#L2636]}}. > Support user-specified driver and executor pod templates > > > Key: SPARK-24434 > URL: https://issues.apache.org/jira/browse/SPARK-24434 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > With more requests for customizing the driver and executor pods coming, the > current approach of adding new Spark configuration options has some serious > drawbacks: 1) it means more Kubernetes specific configuration options to > maintain, and 2) it widens the gap between the declarative model used by > Kubernetes and the configuration model used by Spark. We should start > designing a solution that allows users to specify pod templates as central > places for all customization needs for the driver and executor pods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates
[ https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496967#comment-16496967 ] Yinan Li commented on SPARK-24434: -- [~foxish] that sounds like the approach to go. > Support user-specified driver and executor pod templates > > > Key: SPARK-24434 > URL: https://issues.apache.org/jira/browse/SPARK-24434 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > With more requests for customizing the driver and executor pods coming, the > current approach of adding new Spark configuration options has some serious > drawbacks: 1) it means more Kubernetes specific configuration options to > maintain, and 2) it widens the gap between the declarative model used by > Kubernetes and the configuration model used by Spark. We should start > designing a solution that allows users to specify pod templates as central > places for all customization needs for the driver and executor pods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates
[ https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495637#comment-16495637 ] Yinan Li commented on SPARK-24434: -- [~eje] That's a good question. I think we need to compare both and have a thorough discussion in the community once the design is out. There are pros and cons with each of them. > Support user-specified driver and executor pod templates > > > Key: SPARK-24434 > URL: https://issues.apache.org/jira/browse/SPARK-24434 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > With more requests for customizing the driver and executor pods coming, the > current approach of adding new Spark configuration options has some serious > drawbacks: 1) it means more Kubernetes specific configuration options to > maintain, and 2) it widens the gap between the declarative model used by > Kubernetes and the configuration model used by Spark. We should start > designing a solution that allows users to specify pod templates as central > places for all customization needs for the driver and executor pods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24434) Support user-specified driver and executor pod templates
Yinan Li created SPARK-24434: Summary: Support user-specified driver and executor pod templates Key: SPARK-24434 URL: https://issues.apache.org/jira/browse/SPARK-24434 Project: Spark Issue Type: New Feature Components: Kubernetes Affects Versions: 2.4.0 Reporter: Yinan Li With more requests for customizing the driver and executor pods coming, the current approach of adding new Spark configuration options has some serious drawbacks: 1) it means more Kubernetes specific configuration options to maintain, and 2) it widens the gap between the declarative model used by Kubernetes and the configuration model used by Spark. We should start designing a solution that allows users to specify pod templates as central places for all customization needs for the driver and executor pods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24433) Add Spark R support
Yinan Li created SPARK-24433: Summary: Add Spark R support Key: SPARK-24433 URL: https://issues.apache.org/jira/browse/SPARK-24433 Project: Spark Issue Type: New Feature Components: Kubernetes Affects Versions: 2.4.0 Reporter: Yinan Li This is the ticket to track work on adding support for R binding into the Kubernetes mode. The feature is available in our fork at github.com/apache-spark-on-k8s/spark and needs to be upstreamed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24432) Support for dynamic resource allocation
Yinan Li created SPARK-24432: Summary: Support for dynamic resource allocation Key: SPARK-24432 URL: https://issues.apache.org/jira/browse/SPARK-24432 Project: Spark Issue Type: New Feature Components: Kubernetes Affects Versions: 2.4.0 Reporter: Yinan Li This is an umbrella ticket for work on adding support for dynamic resource allocation into the Kubernetes mode. This requires a Kubernetes-specific external shuffle service. The feature is available in our fork at github.com/apache-spark-on-k8s/spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24432) Add support for dynamic resource allocation
[ https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yinan Li updated SPARK-24432: - Summary: Add support for dynamic resource allocation (was: Support for dynamic resource allocation) > Add support for dynamic resource allocation > --- > > Key: SPARK-24432 > URL: https://issues.apache.org/jira/browse/SPARK-24432 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > This is an umbrella ticket for work on adding support for dynamic resource > allocation into the Kubernetes mode. This requires a Kubernetes-specific > external shuffle service. The feature is available in our fork at > github.com/apache-spark-on-k8s/spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24122) Allow automatic driver restarts on K8s
[ https://issues.apache.org/jira/browse/SPARK-24122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491338#comment-16491338 ] Yinan Li commented on SPARK-24122: -- The operator does cover automatic restart of an application with a configurable restart policy. For batch ETL jobs, this is probably sufficient for common needs to restart jobs on failures. For streaming jobs, checkpointing is needed. https://issues.apache.org/jira/browse/SPARK-23980 is also relevant. > Allow automatic driver restarts on K8s > -- > > Key: SPARK-24122 > URL: https://issues.apache.org/jira/browse/SPARK-24122 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Oz Ben-Ami >Priority: Minor > > [~foxish] > Right now SparkSubmit creates the driver as a bare pod, rather than a managed > controller like a Deployment or a StatefulSet. This means there is no way to > guarantee automatic restarts, eg in case a node has an issue. Note Pod > RestartPolicy does not apply if a node fails. A StatefulSet would allow us to > guarantee that, and keep the ability for executors to find the driver using > DNS. > This is particularly helpful for long-running streaming workloads, where we > currently use {{yarn.resourcemanager.am.max-attempts}} with YARN. I can > confirm that Spark Streaming and Structured Streaming applications can be > made to recover from such a restart, with the help of checkpointing. The > executors will have to be started again by the driver, but this should not be > a problem. > For batch processing, we could alternatively use Kubernetes {{Job}} objects, > which restart pods on failure but not success. For example, note the > semantics provided by the {{kubectl run}} > [command|https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#run] > * {{--restart=Never}}: bare Pod > * {{--restart=Always}}: Deployment > * {{--restart=OnFailure}}: Job > https://github.com/apache-spark-on-k8s/spark/issues/288 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24091) Internally used ConfigMap prevents use of user-specified ConfigMaps carrying Spark configs files
[ https://issues.apache.org/jira/browse/SPARK-24091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491279#comment-16491279 ] Yinan Li commented on SPARK-24091: -- Thanks [~tmckay]! I think the first approach is a good way of handling override and customization. > Internally used ConfigMap prevents use of user-specified ConfigMaps carrying > Spark configs files > > > Key: SPARK-24091 > URL: https://issues.apache.org/jira/browse/SPARK-24091 > Project: Spark > Issue Type: Brainstorming > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > The recent PR [https://github.com/apache/spark/pull/20669] for removing the > init-container introduced a internally used ConfigMap carrying Spark > configuration properties in a file for the driver. This ConfigMap gets > mounted under {{$SPARK_HOME/conf}} and the environment variable > {{SPARK_CONF_DIR}} is set to point to the mount path. This pretty much > prevents users from mounting their own ConfigMaps that carry custom Spark > configuration files, e.g., {{log4j.properties}} and {{spark-env.sh}} and > leaves users with only the option of building custom images. IMO, it is very > useful to support mounting user-specified ConfigMaps for custom Spark > configuration files. This worths further discussions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24383) spark on k8s: "driver-svc" are not getting deleted
[ https://issues.apache.org/jira/browse/SPARK-24383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491106#comment-16491106 ] Yinan Li commented on SPARK-24383: -- OK, then garbage collection should kick in and delete the service when the driver pod is gone unless there's some issue with the GC. > spark on k8s: "driver-svc" are not getting deleted > -- > > Key: SPARK-24383 > URL: https://issues.apache.org/jira/browse/SPARK-24383 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Lenin >Priority: Major > > When the driver pod exists, the "*driver-svc" services created for the driver > are not cleaned up. This causes accumulation of services in the k8s layer, at > one point no more services can be created. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24383) spark on k8s: "driver-svc" are not getting deleted
[ https://issues.apache.org/jira/browse/SPARK-24383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489942#comment-16489942 ] Yinan Li commented on SPARK-24383: -- You can use {{kubectl get service -o=yaml}} to get a YAML-formatted representation of the service and check if the {{metadata}} section contains a {{OwnerReference}} pointing to the driver pod. > spark on k8s: "driver-svc" are not getting deleted > -- > > Key: SPARK-24383 > URL: https://issues.apache.org/jira/browse/SPARK-24383 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Lenin >Priority: Major > > When the driver pod exists, the "*driver-svc" services created for the driver > are not cleaned up. This causes accumulation of services in the k8s layer, at > one point no more services can be created. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24383) spark on k8s: "driver-svc" are not getting deleted
[ https://issues.apache.org/jira/browse/SPARK-24383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489776#comment-16489776 ] Yinan Li commented on SPARK-24383: -- Can you double check if the services have an {{OwnerReference}} pointing to a driver pod? > spark on k8s: "driver-svc" are not getting deleted > -- > > Key: SPARK-24383 > URL: https://issues.apache.org/jira/browse/SPARK-24383 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Lenin >Priority: Major > > When the driver pod exists, the "*driver-svc" services created for the driver > are not cleaned up. This causes accumulation of services in the k8s layer, at > one point no more services can be created. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24383) spark on k8s: "driver-svc" are not getting deleted
[ https://issues.apache.org/jira/browse/SPARK-24383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489601#comment-16489601 ] Yinan Li commented on SPARK-24383: -- The Kubernetes specific submission client adds an {{OwnerReference}} referencing the driver pod to the service so if you delete the driver pod, the corresponding service should be garbage collected. > spark on k8s: "driver-svc" are not getting deleted > -- > > Key: SPARK-24383 > URL: https://issues.apache.org/jira/browse/SPARK-24383 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Lenin >Priority: Major > > When the driver pod exists, the "*driver-svc" services created for the driver > are not cleaned up. This causes accumulation of services in the k8s layer, at > one point no more services can be created. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24248) [K8S] Use the Kubernetes cluster as the backing store for the state of pods
[ https://issues.apache.org/jira/browse/SPARK-24248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477825#comment-16477825 ] Yinan Li commented on SPARK-24248: -- Re-sync is not a fallback nor a replacement, but a complement to the watcher. Re-sync runs periodically. There won't be race conditions if we use a concurrent queue. > [K8S] Use the Kubernetes cluster as the backing store for the state of pods > --- > > Key: SPARK-24248 > URL: https://issues.apache.org/jira/browse/SPARK-24248 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Matt Cheah >Priority: Major > > We have a number of places in KubernetesClusterSchedulerBackend right now > that maintains the state of pods in memory. However, the Kubernetes API can > always give us the most up to date and correct view of what our executors are > doing. We should consider moving away from in-memory state as much as can in > favor of using the Kubernetes cluster as the source of truth for pod status. > Maintaining less state in memory makes it so that there's a lower chance that > we accidentally miss updating one of these data structures and breaking the > lifecycle of executors. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24232) Allow referring to kubernetes secrets as env variable
[ https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472574#comment-16472574 ] Yinan Li commented on SPARK-24232: -- As long as we document it clearly what is for, I think it's OK, particularly given that `secretKeyRef` is a well-known field name used by k8s. > Allow referring to kubernetes secrets as env variable > - > > Key: SPARK-24232 > URL: https://issues.apache.org/jira/browse/SPARK-24232 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Dharmesh Kakadia >Priority: Major > > Allow referring to kubernetes secrets in the driver process via environment > variables. This will allow developers to use secretes without leaking them in > the code and at the same time secrets can be decoupled and managed > separately. This can be used to refer to passwords, certificates etc while > talking to other service (jdbc passwords, storage keys etc). > So, at the deployment time, something like > ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified > which will make [EnvName].[key] available as an environment variable and in > the code its always referred as env variable [key]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24232) Allow referring to kubernetes secrets as env variable
[ https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472561#comment-16472561 ] Yinan Li edited comment on SPARK-24232 at 5/11/18 7:55 PM: --- We should keep the current semantics of `spark.kubernetes.driver.secrets.=`. The proposal you have above is likely confusing to existing users who already use `spark.kubernetes.driver.secrets.=`. It also makes the code unnecessarily complicated. Like what I said on Slack, it's better to do this through a new property prefix, e.g., `spark.kubernetes.driver.secretKeyRef.`. We also need the same for executors. See [http://spark.apache.org/docs/latest/running-on-kubernetes.html#secret-management]. was (Author: liyinan926): We should keep the current semantics of `spark.kubernetes.driver.secrets.=`. The proposal you have above is a breaking change for existing users who already use `spark.kubernetes.driver.secrets.=`. Like what I said on Slack, it's better to do this through a new property prefix, e.g., `spark.kubernetes.driver.secretKeyRef.`. We also need the same for executors. See http://spark.apache.org/docs/latest/running-on-kubernetes.html#secret-management. > Allow referring to kubernetes secrets as env variable > - > > Key: SPARK-24232 > URL: https://issues.apache.org/jira/browse/SPARK-24232 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Dharmesh Kakadia >Priority: Major > > Allow referring to kubernetes secrets in the driver process via environment > variables. This will allow developers to use secretes without leaking them in > the code and at the same time secrets can be decoupled and managed > separately. This can be used to refer to passwords, certificates etc while > talking to other service (jdbc passwords, storage keys etc). > So, at the deployment time, something like > ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified > which will make [EnvName].[key] available as an environment variable and in > the code its always referred as env variable [key]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24232) Allow referring to kubernetes secrets as env variable
[ https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472561#comment-16472561 ] Yinan Li commented on SPARK-24232: -- We should keep the current semantics of `spark.kubernetes.driver.secrets.=`. The proposal you have above is a breaking change for existing users who already use `spark.kubernetes.driver.secrets.=`. Like what I said on Slack, it's better to do this through a new property prefix, e.g., `spark.kubernetes.driver.secretKeyRef.`. We also need the same for executors. See http://spark.apache.org/docs/latest/running-on-kubernetes.html#secret-management. > Allow referring to kubernetes secrets as env variable > - > > Key: SPARK-24232 > URL: https://issues.apache.org/jira/browse/SPARK-24232 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Dharmesh Kakadia >Priority: Major > > Allow referring to kubernetes secrets in the driver process via environment > variables. This will allow developers to use secretes without leaking them in > the code and at the same time secrets can be decoupled and managed > separately. This can be used to refer to passwords, certificates etc while > talking to other service (jdbc passwords, storage keys etc). > So, at the deployment time, something like > ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified > which will make [EnvName].[key] available as an environment variable and in > the code its always referred as env variable [key]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24248) [K8S] Use the Kubernetes cluster as the backing store for the state of pods
[ https://issues.apache.org/jira/browse/SPARK-24248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471479#comment-16471479 ] Yinan Li commented on SPARK-24248: -- I think it's both more robust and easier to implement with a periodic resync, which is what most of the core controllers use. With this setup, you can use a queue to hold executor pod updates to be processed. The resync and watcher both enqueues pod updates, whereas a thread dequeues and processes each update sequentially. This avoids the need for explicit synchronization. The queue also serves as a cache. > [K8S] Use the Kubernetes cluster as the backing store for the state of pods > --- > > Key: SPARK-24248 > URL: https://issues.apache.org/jira/browse/SPARK-24248 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Matt Cheah >Priority: Major > > We have a number of places in KubernetesClusterSchedulerBackend right now > that maintains the state of pods in memory. However, the Kubernetes API can > always give us the most up to date and correct view of what our executors are > doing. We should consider moving away from in-memory state as much as can in > favor of using the Kubernetes cluster as the source of truth for pod status. > Maintaining less state in memory makes it so that there's a lower chance that > we accidentally miss updating one of these data structures and breaking the > lifecycle of executors. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24248) [K8S] Use the Kubernetes cluster as the backing store for the state of pods
[ https://issues.apache.org/jira/browse/SPARK-24248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471288#comment-16471288 ] Yinan Li commented on SPARK-24248: -- Just realized one thing: solely replying on the watcher poses risks of losing executor pod updates. This can potentially happen for example if the API server gets restarted or if the watch connection is interrupted temporarily while the pods are running. So periodic polling is still needed. This is referred to as resync in controller term. Enabling resync is almost always a good thing. > [K8S] Use the Kubernetes cluster as the backing store for the state of pods > --- > > Key: SPARK-24248 > URL: https://issues.apache.org/jira/browse/SPARK-24248 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Matt Cheah >Priority: Major > > We have a number of places in KubernetesClusterSchedulerBackend right now > that maintains the state of pods in memory. However, the Kubernetes API can > always give us the most up to date and correct view of what our executors are > doing. We should consider moving away from in-memory state as much as can in > favor of using the Kubernetes cluster as the source of truth for pod status. > Maintaining less state in memory makes it so that there's a lower chance that > we accidentally miss updating one of these data structures and breaking the > lifecycle of executors. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24248) [K8S] Use the Kubernetes cluster as the backing store for the state of pods
[ https://issues.apache.org/jira/browse/SPARK-24248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471259#comment-16471259 ] Yinan Li commented on SPARK-24248: -- Actually even if the fabric8 client does not support caching, we can effectively achieve that and greatly simplify our code logic by doing the following: # Get rid of the existing in-memory data structures and replace them with a single in-memory cache of all live executor pod objects. # The cache is updated on every watch events. A new pod event adds one entry to the cache, a modification event updates an existing object, and a deletion event deletes the object. # Always get status of an executor pod by retrieving the pod object from the cache, falling back to talking to the API server if there's a cache miss (due to the delay of the watch event). Thoughts? > [K8S] Use the Kubernetes cluster as the backing store for the state of pods > --- > > Key: SPARK-24248 > URL: https://issues.apache.org/jira/browse/SPARK-24248 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Matt Cheah >Priority: Major > > We have a number of places in KubernetesClusterSchedulerBackend right now > that maintains the state of pods in memory. However, the Kubernetes API can > always give us the most up to date and correct view of what our executors are > doing. We should consider moving away from in-memory state as much as can in > favor of using the Kubernetes cluster as the source of truth for pod status. > Maintaining less state in memory makes it so that there's a lower chance that > we accidentally miss updating one of these data structures and breaking the > lifecycle of executors. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24248) [K8S] Use the Kubernetes cluster as the backing store for the state of pods
[ https://issues.apache.org/jira/browse/SPARK-24248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471244#comment-16471244 ] Yinan Li commented on SPARK-24248: -- It's potentially possible to get rid of the in-memory state in favor of getting pod state from the pod objects directly if we are fine with the performance penalty of communicating with the API server for each state check. One optimization is to cache executor pod objects so retrieving them doesn't involve network communication. This is possible with the golang client library, but I'm not sure about the Java client we use. > [K8S] Use the Kubernetes cluster as the backing store for the state of pods > --- > > Key: SPARK-24248 > URL: https://issues.apache.org/jira/browse/SPARK-24248 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Matt Cheah >Priority: Major > > We have a number of places in KubernetesClusterSchedulerBackend right now > that maintains the state of pods in memory. However, the Kubernetes API can > always give us the most up to date and correct view of what our executors are > doing. We should consider moving away from in-memory state as much as can in > favor of using the Kubernetes cluster as the source of truth for pod status. > Maintaining less state in memory makes it so that there's a lower chance that > we accidentally miss updating one of these data structures and breaking the > lifecycle of executors. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24137) [K8s] Mount temporary directories in emptydir volumes
[ https://issues.apache.org/jira/browse/SPARK-24137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yinan Li updated SPARK-24137: - Fix Version/s: (was: 2.3.1) > [K8s] Mount temporary directories in emptydir volumes > - > > Key: SPARK-24137 > URL: https://issues.apache.org/jira/browse/SPARK-24137 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Matt Cheah >Assignee: Matt Cheah >Priority: Major > Fix For: 3.0.0 > > > Currently the Spark local directories do not get any volumes and volume > mounts, which means we're writing Spark shuffle and cache contents to the > file system mounted by Docker. This can be terribly inefficient. We should > use emptydir volumes for these directories instead for significant > performance improvements. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24137) [K8s] Mount temporary directories in emptydir volumes
[ https://issues.apache.org/jira/browse/SPARK-24137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yinan Li updated SPARK-24137: - Fix Version/s: 2.3.1 > [K8s] Mount temporary directories in emptydir volumes > - > > Key: SPARK-24137 > URL: https://issues.apache.org/jira/browse/SPARK-24137 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Matt Cheah >Assignee: Matt Cheah >Priority: Major > Fix For: 2.3.1, 3.0.0 > > > Currently the Spark local directories do not get any volumes and volume > mounts, which means we're writing Spark shuffle and cache contents to the > file system mounted by Docker. This can be terribly inefficient. We should > use emptydir volumes for these directories instead for significant > performance improvements. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size
[ https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460066#comment-16460066 ] Yinan Li edited comment on SPARK-24135 at 5/1/18 7:53 PM: -- I agree that we should add detection for initialization errors. But I'm not sure if requesting new executors to replace the ones that failed initialization is a good idea. External webhooks or initializers are typically installed by cluster admins and there's always risks of bugs in the webhooks or initializers that cause pods to fail initialization. In case of initializers, things are worse as pods will not be able to get out of pending status if for whatever reasons the controller that's handling a particular initializer is down. For the reasons [~mcheah] mentioned above, it's not obvious if initialization errors should count towards job failures. I think keeping track of how many initialization errors are seen and stopping requesting new executors after certain threshold might be a good idea. was (Author: liyinan926): I agree that we should add detection for initialization errors. But I'm not sure if requesting new executors to replace the ones that failed initialization is a good idea. External webhooks or initializers are typically installed by cluster admins and there's always risks of bugs in the webhooks or initializers that cause pods to fail initialization. In case of initializers, things are worse as pods will not be able to get out of pending status if for whatever reasons the controller that's handling a particular initializer is down. For the reasons [~mcheah] mentioned above, it's not obvious if initialization errors should count towards job failures. I think keeping track of how many initialization errors are seen and stopping requesting new executors might be a good idea. > [K8s] Executors that fail to start up because of init-container errors are > not retried and limit the executor pool size > --- > > Key: SPARK-24135 > URL: https://issues.apache.org/jira/browse/SPARK-24135 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Matt Cheah >Priority: Major > > In KubernetesClusterSchedulerBackend, we detect if executors disconnect after > having been started or if executors hit the {{ERROR}} or {{DELETED}} states. > When executors fail in these ways, they are removed from the pending > executors pool and the driver should retry requesting these executors. > However, the driver does not handle a different class of error: when the pod > enters the {{Init:Error}} state. This state comes up when the executor fails > to launch because one of its init-containers fails. Spark itself doesn't > attach any init-containers to the executors. However, custom web hooks can > run on the cluster and attach init-containers to the executor pods. > Additionally, pod presets can specify init containers to run on these pods. > Therefore Spark should be handling the {{Init:Error}} cases regardless if > Spark itself is aware of init-containers or not. > This class of error is particularly bad because when we hit this state, the > failed executor will never start, but it's still seen as pending by the > executor allocator. The executor allocator won't request more rounds of > executors because its current batch hasn't been resolved to either running or > failed. Therefore we end up with being stuck with the number of executors > that successfully started before the faulty one failed to start, potentially > creating a fake resource bottleneck. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size
[ https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460066#comment-16460066 ] Yinan Li commented on SPARK-24135: -- I agree that we should add detection for initialization errors. But I'm not sure if requesting new executors to replace the ones that failed initialization is a good idea. External webhooks or initializers are typically installed by cluster admins and there's always risks of bugs in the webhooks or initializers that cause pods to fail initialization. In case of initializers, things are worse as pods will not be able to get out of pending status if for whatever reasons the controller that's handling a particular initializer is down. For the reasons [~mcheah] mentioned above, it's not obvious if initialization errors should count towards job failures. I think keeping track of how many initialization errors are seen and stopping requesting new executors might be a good idea. > [K8s] Executors that fail to start up because of init-container errors are > not retried and limit the executor pool size > --- > > Key: SPARK-24135 > URL: https://issues.apache.org/jira/browse/SPARK-24135 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Matt Cheah >Priority: Major > > In KubernetesClusterSchedulerBackend, we detect if executors disconnect after > having been started or if executors hit the {{ERROR}} or {{DELETED}} states. > When executors fail in these ways, they are removed from the pending > executors pool and the driver should retry requesting these executors. > However, the driver does not handle a different class of error: when the pod > enters the {{Init:Error}} state. This state comes up when the executor fails > to launch because one of its init-containers fails. Spark itself doesn't > attach any init-containers to the executors. However, custom web hooks can > run on the cluster and attach init-containers to the executor pods. > Additionally, pod presets can specify init containers to run on these pods. > Therefore Spark should be handling the {{Init:Error}} cases regardless if > Spark itself is aware of init-containers or not. > This class of error is particularly bad because when we hit this state, the > failed executor will never start, but it's still seen as pending by the > executor allocator. The executor allocator won't request more rounds of > executors because its current batch hasn't been resolved to either running or > failed. Therefore we end up with being stuck with the number of executors > that successfully started before the faulty one failed to start, potentially > creating a fake resource bottleneck. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24137) [K8s] Mount temporary directories in emptydir volumes
[ https://issues.apache.org/jira/browse/SPARK-24137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459900#comment-16459900 ] Yinan Li commented on SPARK-24137: -- Yeah, {{LocalDirectoryMountConfigurationStep}} was missed in the upstream PRs. We probably should try to get it into 2.3.1. > [K8s] Mount temporary directories in emptydir volumes > - > > Key: SPARK-24137 > URL: https://issues.apache.org/jira/browse/SPARK-24137 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Matt Cheah >Priority: Major > > Currently the Spark local directories do not get any volumes and volume > mounts, which means we're writing Spark shuffle and cache contents to the > file system mounted by Docker. This can be terribly inefficient. We should > use emptydir volumes for these directories instead for significant > performance improvements. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size
[ https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459892#comment-16459892 ] Yinan Li commented on SPARK-24135: -- I think it's fine detecting and deleting the executor pods that failed initialization. But I'm not sure how much this buys us because very likely the newly requested executors will fail to be initialized, in particular if the init-container is added by an external webhook or an initializer. The job won't be able to proceed in case of this and effectively the bottleneck still exists. > [K8s] Executors that fail to start up because of init-container errors are > not retried and limit the executor pool size > --- > > Key: SPARK-24135 > URL: https://issues.apache.org/jira/browse/SPARK-24135 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Matt Cheah >Priority: Major > > In KubernetesClusterSchedulerBackend, we detect if executors disconnect after > having been started or if executors hit the {{ERROR}} or {{DELETED}} states. > When executors fail in these ways, they are removed from the pending > executors pool and the driver should retry requesting these executors. > However, the driver does not handle a different class of error: when the pod > enters the {{Init:Error}} state. This state comes up when the executor fails > to launch because one of its init-containers fails. Spark itself doesn't > attach any init-containers to the executors. However, custom web hooks can > run on the cluster and attach init-containers to the executor pods. > Additionally, pod presets can specify init containers to run on these pods. > Therefore Spark should be handling the {{Init:Error}} cases regardless if > Spark itself is aware of init-containers or not. > This class of error is particularly bad because when we hit this state, the > failed executor will never start, but it's still seen as pending by the > executor allocator. The executor allocator won't request more rounds of > executors because its current batch hasn't been resolved to either running or > failed. Therefore we end up with being stuck with the number of executors > that successfully started before the faulty one failed to start, potentially > creating a fake resource bottleneck. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24091) Internally used ConfigMap prevents use of user-specified ConfigMaps carrying Spark configs files
[ https://issues.apache.org/jira/browse/SPARK-24091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yinan Li updated SPARK-24091: - Affects Version/s: (was: 2.3.0) 2.4.0 > Internally used ConfigMap prevents use of user-specified ConfigMaps carrying > Spark configs files > > > Key: SPARK-24091 > URL: https://issues.apache.org/jira/browse/SPARK-24091 > Project: Spark > Issue Type: Brainstorming > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > The recent PR [https://github.com/apache/spark/pull/20669] for removing the > init-container introduced a internally used ConfigMap carrying Spark > configuration properties in a file for the driver. This ConfigMap gets > mounted under {{$SPARK_HOME/conf}} and the environment variable > {{SPARK_CONF_DIR}} is set to point to the mount path. This pretty much > prevents users from mounting their own ConfigMaps that carry custom Spark > configuration files, e.g., {{log4j.properties}} and {{spark-env.sh}} and > leaves users with only the option of building custom images. IMO, it is very > useful to support mounting user-specified ConfigMaps for custom Spark > configuration files. This worths further discussions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24091) Internally used ConfigMap prevents use of user-specified ConfigMaps carrying Spark configs files
Yinan Li created SPARK-24091: Summary: Internally used ConfigMap prevents use of user-specified ConfigMaps carrying Spark configs files Key: SPARK-24091 URL: https://issues.apache.org/jira/browse/SPARK-24091 Project: Spark Issue Type: Brainstorming Components: Kubernetes Affects Versions: 2.3.0 Reporter: Yinan Li The recent PR [https://github.com/apache/spark/pull/20669] for removing the init-container introduced a internally used ConfigMap carrying Spark configuration properties in a file for the driver. This ConfigMap gets mounted under {{$SPARK_HOME/conf}} and the environment variable {{SPARK_CONF_DIR}} is set to point to the mount path. This pretty much prevents users from mounting their own ConfigMaps that carry custom Spark configuration files, e.g., {{log4j.properties}} and {{spark-env.sh}} and leaves users with only the option of building custom images. IMO, it is very useful to support mounting user-specified ConfigMaps for custom Spark configuration files. This worths further discussions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23638) Spark on k8s: spark.kubernetes.initContainer.image has no effect
[ https://issues.apache.org/jira/browse/SPARK-23638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yinan Li resolved SPARK-23638. -- Resolution: Not A Problem > Spark on k8s: spark.kubernetes.initContainer.image has no effect > > > Key: SPARK-23638 > URL: https://issues.apache.org/jira/browse/SPARK-23638 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 > Environment: K8 server: Ubuntu 16.04 > Submission client: macOS Sierra 10.12.x > Client Version: version.Info\{Major:"1", Minor:"9", GitVersion:"v1.9.3", > GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", > BuildDate:"2018-02-07T12:22:21Z", GoVersion:"go1.9.2", Compiler:"gc", > Platform:"darwin/amd64"} > Server Version: version.Info\{Major:"1", Minor:"8", GitVersion:"v1.8.3", > GitCommit:"f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd", GitTreeState:"clean", > BuildDate:"2017-11-08T18:27:48Z", GoVersion:"go1.8.3", Compiler:"gc", > Platform:"linux/amd64"} >Reporter: maheshvra >Priority: Major > > Hi all - I am trying to use initContainer to download remote dependencies. To > begin with, I ran a test with initContainer which basically "echo hello > world". However, when i triggered the pod deployment via spark-submit, I did > not see any trace of initContainer execution in my kubernetes cluster. > > {code:java} > SPARK_DRIVER_MEMORY: 1g > SPARK_DRIVER_CLASS: com.bigdata.App SPARK_DRIVER_ARGS: -c > /opt/spark/work-dir/app/main/environments/int -w > ./../../workflows/workflow_main.json -e prod -n features -v off > SPARK_DRIVER_BIND_ADDRESS: > SPARK_JAVA_OPT_0: -Dspark.submit.deployMode=cluster > SPARK_JAVA_OPT_1: -Dspark.driver.blockManager.port=7079 > SPARK_JAVA_OPT_2: -Dspark.app.name=fg-am00-raw12 > SPARK_JAVA_OPT_3: > -Dspark.kubernetes.container.image=docker.com/cmapp/fg-am00-raw:1.0.0 > SPARK_JAVA_OPT_4: -Dspark.app.id=spark-4fa9a5ce1b1d401fa9c1e413ff030d44 > SPARK_JAVA_OPT_5: > -Dspark.jars=/opt/spark/jars/aws-java-sdk-1.7.4.jar,/opt/spark/jars/hadoop-aws-2.7.3.jar,/opt/spark/jars/guava-14.0.1.jar,/opt/spark/jars/SparkApp.jar,/opt/spark/jars/datacleanup-component-1.0-SNAPSHOT.jar > > SPARK_JAVA_OPT_6: -Dspark.driver.port=7078 > SPARK_JAVA_OPT_7: > -Dspark.kubernetes.initContainer.image=docker.com/cmapp/custombusybox:1.0.0 > SPARK_JAVA_OPT_8: > -Dspark.kubernetes.executor.podNamePrefix=fg-am00-raw12-b1c8112b8536304ab0fc64fcc41e0615 > > SPARK_JAVA_OPT_9: > -Dspark.kubernetes.driver.pod.name=fg-am00-raw12-b1c8112b8536304ab0fc64fcc41e0615-driver > > SPARK_JAVA_OPT_10: > -Dspark.driver.host=fg-am00-raw12-b1c8112b8536304ab0fc64fcc41e0615-driver-svc.experimental.svc > SPARK_JAVA_OPT_11: -Dspark.executor.instances=5 > SPARK_JAVA_OPT_12: > -Dspark.hadoop.fs.s3a.server-side-encryption-algorithm=AES256 > SPARK_JAVA_OPT_13: -Dspark.kubernetes.namespace=experimental > SPARK_JAVA_OPT_14: > -Dspark.kubernetes.authenticate.driver.serviceAccountName=experimental-service-account > SPARK_JAVA_OPT_15: -Dspark.master=k8s://https://bigdata > {code} > > Further, I did not see spec.initContainers section in the generated pod. > Please see the details below > > {code:java} > > { > "kind": "Pod", > "apiVersion": "v1", > "metadata": { > "name": "fg-am00-raw12-b1c8112b8536304ab0fc64fcc41e0615-driver", > "namespace": "experimental", > "selfLink": > "/api/v1/namespaces/experimental/pods/fg-am00-raw12-b1c8112b8536304ab0fc64fcc41e0615-driver", > "uid": "adc5a50a-2342-11e8-87dc-12c5b3954044", > "resourceVersion": "299054", > "creationTimestamp": "2018-03-09T02:36:32Z", > "labels": { > "spark-app-selector": "spark-4fa9a5ce1b1d401fa9c1e413ff030d44", > "spark-role": "driver" > }, > "annotations": { > "spark-app-name": "fg-am00-raw12" > } > }, > "spec": { > "volumes": [ > { > "name": "experimental-service-account-token-msmth", > "secret": { > "secretName": "experimental-service-account-token-msmth", > "defaultMode": 420 > } > } > ], > "containers": [ > { > "name": "spark-kubernetes-driver", > "image": "docker.com/cmapp/fg-am00-raw:1.0.0", > "args": [ > "driver" > ], > "env": [ > { > "name": "SPARK_DRIVER_MEMORY", > "value": "1g" > }, > { > "name": "SPARK_DRIVER_CLASS", > "value": "com.myapp.App" > }, > { > "name": "SPARK_DRIVER_ARGS", > "value": "-c /opt/spark/work-dir/app/main/environments/int -w > ./../../workflows/workflow_main.json -e prod -n features -v off" > }, > { > "name": "SPARK_DRIVER_BIND_ADDRESS", > "valueFrom": { > "fieldRef": { > "apiVersion": "v1", > "fieldPath": "status.podIP" > } > } > }, > { > "name": "SPARK_MOUNTED_CLASSPATH", > "value": > "/opt/spark/jars/aws-java-sdk-1.7.4.jar:/opt/spark/jars/hadoop-aws-2.7.3.jar:/opt/spark/jars/guava-14.0.1.jar:/opt/spark/jars/datacleanup-component-1.0-SNAPSHOT.jar:/opt/spark/jars/SparkApp.jar" > }, > {
[jira] [Commented] (SPARK-24028) [K8s] Creating secrets and config maps before creating the driver pod has unpredictable behavior
[ https://issues.apache.org/jira/browse/SPARK-24028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444899#comment-16444899 ] Yinan Li commented on SPARK-24028: -- 2.3.0 does create a configmap for the init-container if one is used. See [https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/DriverInitContainerBootstrapStep.scala#L54.] The content of this configmap is used when the init-container starts. > [K8s] Creating secrets and config maps before creating the driver pod has > unpredictable behavior > > > Key: SPARK-24028 > URL: https://issues.apache.org/jira/browse/SPARK-24028 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Matt Cheah >Priority: Critical > > Currently we create the Kubernetes resources the driver depends on - such as > the properties config map and secrets to mount into the pod - only after we > create the driver pod. This is because we want these extra objects to > immediately have an owner reference to be tied to the driver pod. > On our Kubernetes 1.9.4. cluster, we're seeing that sometimes this works > fine, but other times the driver ends up being started with empty volumes > instead of volumes with the contents of the secrets we expect. The result is > that sometimes the driver will start without these files mounted, which leads > to various failures if the driver requires these files to be present early on > in their code. Missing the properties file config map, for example, would > mean spark-submit doesn't have a properties file to read at all. See the > warning on [https://kubernetes.io/docs/concepts/storage/volumes/#secret.] > Unfortunately we cannot link owner references to non-existent objects, so we > have to do this instead: > # Create the auxiliary resources without any owner references. > # Create the driver pod mounting these resources into volumes, as before. > # If #2 fails, clean up the resources created in #1. > # Edit the auxiliary resources to have an owner reference for the driver pod. > The multi-step approach leaves a small chance for us to leak resources - for > example, if we fail to make the resource edits in #4 for some reason. This > also changes the permissioning mode required for spark-submit - credentials > provided to spark-submit need to be able to edit resources in addition to > creating them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24028) [K8s] Creating secrets and config maps before creating the driver pod has unpredictable behavior
[ https://issues.apache.org/jira/browse/SPARK-24028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444890#comment-16444890 ] Yinan Li edited comment on SPARK-24028 at 4/19/18 10:14 PM: I run a 1.9.6 cluster. No, I was using the 2.3.0 release. The configmap I was referring to was for the init-container. was (Author: liyinan926): I run a 1.9.6 cluster. No, I was using the 2.3.0 release. > [K8s] Creating secrets and config maps before creating the driver pod has > unpredictable behavior > > > Key: SPARK-24028 > URL: https://issues.apache.org/jira/browse/SPARK-24028 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Matt Cheah >Priority: Critical > > Currently we create the Kubernetes resources the driver depends on - such as > the properties config map and secrets to mount into the pod - only after we > create the driver pod. This is because we want these extra objects to > immediately have an owner reference to be tied to the driver pod. > On our Kubernetes 1.9.4. cluster, we're seeing that sometimes this works > fine, but other times the driver ends up being started with empty volumes > instead of volumes with the contents of the secrets we expect. The result is > that sometimes the driver will start without these files mounted, which leads > to various failures if the driver requires these files to be present early on > in their code. Missing the properties file config map, for example, would > mean spark-submit doesn't have a properties file to read at all. See the > warning on [https://kubernetes.io/docs/concepts/storage/volumes/#secret.] > Unfortunately we cannot link owner references to non-existent objects, so we > have to do this instead: > # Create the auxiliary resources without any owner references. > # Create the driver pod mounting these resources into volumes, as before. > # If #2 fails, clean up the resources created in #1. > # Edit the auxiliary resources to have an owner reference for the driver pod. > The multi-step approach leaves a small chance for us to leak resources - for > example, if we fail to make the resource edits in #4 for some reason. This > also changes the permissioning mode required for spark-submit - credentials > provided to spark-submit need to be able to edit resources in addition to > creating them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24028) [K8s] Creating secrets and config maps before creating the driver pod has unpredictable behavior
[ https://issues.apache.org/jira/browse/SPARK-24028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444890#comment-16444890 ] Yinan Li commented on SPARK-24028: -- I run a 1.9.6 cluster. No, I was using the 2.3.0 release. > [K8s] Creating secrets and config maps before creating the driver pod has > unpredictable behavior > > > Key: SPARK-24028 > URL: https://issues.apache.org/jira/browse/SPARK-24028 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Matt Cheah >Priority: Critical > > Currently we create the Kubernetes resources the driver depends on - such as > the properties config map and secrets to mount into the pod - only after we > create the driver pod. This is because we want these extra objects to > immediately have an owner reference to be tied to the driver pod. > On our Kubernetes 1.9.4. cluster, we're seeing that sometimes this works > fine, but other times the driver ends up being started with empty volumes > instead of volumes with the contents of the secrets we expect. The result is > that sometimes the driver will start without these files mounted, which leads > to various failures if the driver requires these files to be present early on > in their code. Missing the properties file config map, for example, would > mean spark-submit doesn't have a properties file to read at all. See the > warning on [https://kubernetes.io/docs/concepts/storage/volumes/#secret.] > Unfortunately we cannot link owner references to non-existent objects, so we > have to do this instead: > # Create the auxiliary resources without any owner references. > # Create the driver pod mounting these resources into volumes, as before. > # If #2 fails, clean up the resources created in #1. > # Edit the auxiliary resources to have an owner reference for the driver pod. > The multi-step approach leaves a small chance for us to leak resources - for > example, if we fail to make the resource edits in #4 for some reason. This > also changes the permissioning mode required for spark-submit - credentials > provided to spark-submit need to be able to edit resources in addition to > creating them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24028) [K8s] Creating secrets and config maps before creating the driver pod has unpredictable behavior
[ https://issues.apache.org/jira/browse/SPARK-24028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444856#comment-16444856 ] Yinan Li commented on SPARK-24028: -- I am also running a 1.9 cluster on GKE and I have never run into the issue you mentioned above. I do often see events on the driver pod showing that the configmap failed to mount, but eventually retries just succeeded. I believe a pod won't start running if any of the specified volumes (being it a secret volume, a configmap volume, or something else) fail to mount, and Kubernetes also retries mounting volumes that it failed to mount when the pod first started. > [K8s] Creating secrets and config maps before creating the driver pod has > unpredictable behavior > > > Key: SPARK-24028 > URL: https://issues.apache.org/jira/browse/SPARK-24028 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Matt Cheah >Priority: Critical > > Currently we create the Kubernetes resources the driver depends on - such as > the properties config map and secrets to mount into the pod - only after we > create the driver pod. This is because we want these extra objects to > immediately have an owner reference to be tied to the driver pod. > On our Kubernetes 1.9.4. cluster, we're seeing that sometimes this works > fine, but other times the driver ends up being started with empty volumes > instead of volumes with the contents of the secrets we expect. The result is > that sometimes the driver will start without these files mounted, which leads > to various failures if the driver requires these files to be present early on > in their code. Missing the properties file config map, for example, would > mean spark-submit doesn't have a properties file to read at all. See the > warning on [https://kubernetes.io/docs/concepts/storage/volumes/#secret.] > Unfortunately we cannot link owner references to non-existent objects, so we > have to do this instead: > # Create the auxiliary resources without any owner references. > # Create the driver pod mounting these resources into volumes, as before. > # If #2 fails, clean up the resources created in #1. > # Edit the auxiliary resources to have an owner reference for the driver pod. > The multi-step approach leaves a small chance for us to leak resources - for > example, if we fail to make the resource edits in #4 for some reason. This > also changes the permissioning mode required for spark-submit - credentials > provided to spark-submit need to be able to edit resources in addition to > creating them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23638) Spark on k8s: spark.kubernetes.initContainer.image has no effect
[ https://issues.apache.org/jira/browse/SPARK-23638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440067#comment-16440067 ] Yinan Li commented on SPARK-23638: -- Can this be closed? > Spark on k8s: spark.kubernetes.initContainer.image has no effect > > > Key: SPARK-23638 > URL: https://issues.apache.org/jira/browse/SPARK-23638 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 > Environment: K8 server: Ubuntu 16.04 > Submission client: macOS Sierra 10.12.x > Client Version: version.Info\{Major:"1", Minor:"9", GitVersion:"v1.9.3", > GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", > BuildDate:"2018-02-07T12:22:21Z", GoVersion:"go1.9.2", Compiler:"gc", > Platform:"darwin/amd64"} > Server Version: version.Info\{Major:"1", Minor:"8", GitVersion:"v1.8.3", > GitCommit:"f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd", GitTreeState:"clean", > BuildDate:"2017-11-08T18:27:48Z", GoVersion:"go1.8.3", Compiler:"gc", > Platform:"linux/amd64"} >Reporter: maheshvra >Priority: Major > > Hi all - I am trying to use initContainer to download remote dependencies. To > begin with, I ran a test with initContainer which basically "echo hello > world". However, when i triggered the pod deployment via spark-submit, I did > not see any trace of initContainer execution in my kubernetes cluster. > > {code:java} > SPARK_DRIVER_MEMORY: 1g > SPARK_DRIVER_CLASS: com.bigdata.App SPARK_DRIVER_ARGS: -c > /opt/spark/work-dir/app/main/environments/int -w > ./../../workflows/workflow_main.json -e prod -n features -v off > SPARK_DRIVER_BIND_ADDRESS: > SPARK_JAVA_OPT_0: -Dspark.submit.deployMode=cluster > SPARK_JAVA_OPT_1: -Dspark.driver.blockManager.port=7079 > SPARK_JAVA_OPT_2: -Dspark.app.name=fg-am00-raw12 > SPARK_JAVA_OPT_3: > -Dspark.kubernetes.container.image=docker.com/cmapp/fg-am00-raw:1.0.0 > SPARK_JAVA_OPT_4: -Dspark.app.id=spark-4fa9a5ce1b1d401fa9c1e413ff030d44 > SPARK_JAVA_OPT_5: > -Dspark.jars=/opt/spark/jars/aws-java-sdk-1.7.4.jar,/opt/spark/jars/hadoop-aws-2.7.3.jar,/opt/spark/jars/guava-14.0.1.jar,/opt/spark/jars/SparkApp.jar,/opt/spark/jars/datacleanup-component-1.0-SNAPSHOT.jar > > SPARK_JAVA_OPT_6: -Dspark.driver.port=7078 > SPARK_JAVA_OPT_7: > -Dspark.kubernetes.initContainer.image=docker.com/cmapp/custombusybox:1.0.0 > SPARK_JAVA_OPT_8: > -Dspark.kubernetes.executor.podNamePrefix=fg-am00-raw12-b1c8112b8536304ab0fc64fcc41e0615 > > SPARK_JAVA_OPT_9: > -Dspark.kubernetes.driver.pod.name=fg-am00-raw12-b1c8112b8536304ab0fc64fcc41e0615-driver > > SPARK_JAVA_OPT_10: > -Dspark.driver.host=fg-am00-raw12-b1c8112b8536304ab0fc64fcc41e0615-driver-svc.experimental.svc > SPARK_JAVA_OPT_11: -Dspark.executor.instances=5 > SPARK_JAVA_OPT_12: > -Dspark.hadoop.fs.s3a.server-side-encryption-algorithm=AES256 > SPARK_JAVA_OPT_13: -Dspark.kubernetes.namespace=experimental > SPARK_JAVA_OPT_14: > -Dspark.kubernetes.authenticate.driver.serviceAccountName=experimental-service-account > SPARK_JAVA_OPT_15: -Dspark.master=k8s://https://bigdata > {code} > > Further, I did not see spec.initContainers section in the generated pod. > Please see the details below > > {code:java} > > { > "kind": "Pod", > "apiVersion": "v1", > "metadata": { > "name": "fg-am00-raw12-b1c8112b8536304ab0fc64fcc41e0615-driver", > "namespace": "experimental", > "selfLink": > "/api/v1/namespaces/experimental/pods/fg-am00-raw12-b1c8112b8536304ab0fc64fcc41e0615-driver", > "uid": "adc5a50a-2342-11e8-87dc-12c5b3954044", > "resourceVersion": "299054", > "creationTimestamp": "2018-03-09T02:36:32Z", > "labels": { > "spark-app-selector": "spark-4fa9a5ce1b1d401fa9c1e413ff030d44", > "spark-role": "driver" > }, > "annotations": { > "spark-app-name": "fg-am00-raw12" > } > }, > "spec": { > "volumes": [ > { > "name": "experimental-service-account-token-msmth", > "secret": { > "secretName": "experimental-service-account-token-msmth", > "defaultMode": 420 > } > } > ], > "containers": [ > { > "name": "spark-kubernetes-driver", > "image": "docker.com/cmapp/fg-am00-raw:1.0.0", > "args": [ > "driver" > ], > "env": [ > { > "name": "SPARK_DRIVER_MEMORY", > "value": "1g" > }, > { > "name": "SPARK_DRIVER_CLASS", > "value": "com.myapp.App" > }, > { > "name": "SPARK_DRIVER_ARGS", > "value": "-c /opt/spark/work-dir/app/main/environments/int -w > ./../../workflows/workflow_main.json -e prod -n features -v off" > }, > { > "name": "SPARK_DRIVER_BIND_ADDRESS", > "valueFrom": { > "fieldRef": { > "apiVersion": "v1", > "fieldPath": "status.podIP" > } > } > }, > { > "name": "SPARK_MOUNTED_CLASSPATH", > "value": >
[jira] [Commented] (SPARK-23638) Spark on k8s: spark.kubernetes.initContainer.image has no effect
[ https://issues.apache.org/jira/browse/SPARK-23638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402070#comment-16402070 ] Yinan Li commented on SPARK-23638: -- The Kubernetes-specific submission client will only add an init-container to the driver and executor pods if there is any remote dependencies to download. Otherwise, it won't regardless if you specify \{{spark.kubernetes.initContainer.image}}. > Spark on k8s: spark.kubernetes.initContainer.image has no effect > > > Key: SPARK-23638 > URL: https://issues.apache.org/jira/browse/SPARK-23638 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 > Environment: K8 server: Ubuntu 16.04 > Submission client: macOS Sierra 10.12.x > Client Version: version.Info\{Major:"1", Minor:"9", GitVersion:"v1.9.3", > GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", > BuildDate:"2018-02-07T12:22:21Z", GoVersion:"go1.9.2", Compiler:"gc", > Platform:"darwin/amd64"} > Server Version: version.Info\{Major:"1", Minor:"8", GitVersion:"v1.8.3", > GitCommit:"f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd", GitTreeState:"clean", > BuildDate:"2017-11-08T18:27:48Z", GoVersion:"go1.8.3", Compiler:"gc", > Platform:"linux/amd64"} >Reporter: maheshvra >Priority: Major > > Hi all - I am trying to use initContainer to download remote dependencies. To > begin with, I ran a test with initContainer which basically "echo hello > world". However, when i triggered the pod deployment via spark-submit, I did > not see any trace of initContainer execution in my kubernetes cluster. > > {code:java} > SPARK_DRIVER_MEMORY: 1g > SPARK_DRIVER_CLASS: com.bigdata.App SPARK_DRIVER_ARGS: -c > /opt/spark/work-dir/app/main/environments/int -w > ./../../workflows/workflow_main.json -e prod -n features -v off > SPARK_DRIVER_BIND_ADDRESS: > SPARK_JAVA_OPT_0: -Dspark.submit.deployMode=cluster > SPARK_JAVA_OPT_1: -Dspark.driver.blockManager.port=7079 > SPARK_JAVA_OPT_2: -Dspark.app.name=fg-am00-raw12 > SPARK_JAVA_OPT_3: > -Dspark.kubernetes.container.image=docker.com/cmapp/fg-am00-raw:1.0.0 > SPARK_JAVA_OPT_4: -Dspark.app.id=spark-4fa9a5ce1b1d401fa9c1e413ff030d44 > SPARK_JAVA_OPT_5: > -Dspark.jars=/opt/spark/jars/aws-java-sdk-1.7.4.jar,/opt/spark/jars/hadoop-aws-2.7.3.jar,/opt/spark/jars/guava-14.0.1.jar,/opt/spark/jars/SparkApp.jar,/opt/spark/jars/datacleanup-component-1.0-SNAPSHOT.jar > > SPARK_JAVA_OPT_6: -Dspark.driver.port=7078 > SPARK_JAVA_OPT_7: > -Dspark.kubernetes.initContainer.image=docker.com/cmapp/custombusybox:1.0.0 > SPARK_JAVA_OPT_8: > -Dspark.kubernetes.executor.podNamePrefix=fg-am00-raw12-b1c8112b8536304ab0fc64fcc41e0615 > > SPARK_JAVA_OPT_9: > -Dspark.kubernetes.driver.pod.name=fg-am00-raw12-b1c8112b8536304ab0fc64fcc41e0615-driver > > SPARK_JAVA_OPT_10: > -Dspark.driver.host=fg-am00-raw12-b1c8112b8536304ab0fc64fcc41e0615-driver-svc.experimental.svc > SPARK_JAVA_OPT_11: -Dspark.executor.instances=5 > SPARK_JAVA_OPT_12: > -Dspark.hadoop.fs.s3a.server-side-encryption-algorithm=AES256 > SPARK_JAVA_OPT_13: -Dspark.kubernetes.namespace=experimental > SPARK_JAVA_OPT_14: > -Dspark.kubernetes.authenticate.driver.serviceAccountName=experimental-service-account > SPARK_JAVA_OPT_15: -Dspark.master=k8s://https://bigdata > {code} > > Further, I did not see spec.initContainers section in the generated pod. > Please see the details below > > {code:java} > > { > "kind": "Pod", > "apiVersion": "v1", > "metadata": { > "name": "fg-am00-raw12-b1c8112b8536304ab0fc64fcc41e0615-driver", > "namespace": "experimental", > "selfLink": > "/api/v1/namespaces/experimental/pods/fg-am00-raw12-b1c8112b8536304ab0fc64fcc41e0615-driver", > "uid": "adc5a50a-2342-11e8-87dc-12c5b3954044", > "resourceVersion": "299054", > "creationTimestamp": "2018-03-09T02:36:32Z", > "labels": { > "spark-app-selector": "spark-4fa9a5ce1b1d401fa9c1e413ff030d44", > "spark-role": "driver" > }, > "annotations": { > "spark-app-name": "fg-am00-raw12" > } > }, > "spec": { > "volumes": [ > { > "name": "experimental-service-account-token-msmth", > "secret": { > "secretName": "experimental-service-account-token-msmth", > "defaultMode": 420 > } > } > ], > "containers": [ > { > "name": "spark-kubernetes-driver", > "image": "docker.com/cmapp/fg-am00-raw:1.0.0", > "args": [ > "driver" > ], > "env": [ > { > "name": "SPARK_DRIVER_MEMORY", > "value": "1g" > }, > { > "name": "SPARK_DRIVER_CLASS", > "value": "com.myapp.App" > }, > { > "name": "SPARK_DRIVER_ARGS", > "value": "-c /opt/spark/work-dir/app/main/environments/int -w > ./../../workflows/workflow_main.json -e prod -n features -v off" > }, > { > "name": "SPARK_DRIVER_BIND_ADDRESS", > "valueFrom": { > "fieldRef": { > "apiVersion": "v1", > "fieldPath": "status.podIP" > } > } > }, >
[jira] [Updated] (SPARK-23571) Delete auxiliary Kubernetes resources upon application completion
[ https://issues.apache.org/jira/browse/SPARK-23571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yinan Li updated SPARK-23571: - Affects Version/s: 2.3.1 > Delete auxiliary Kubernetes resources upon application completion > - > > Key: SPARK-23571 > URL: https://issues.apache.org/jira/browse/SPARK-23571 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0, 2.3.1 >Reporter: Yinan Li >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23571) Delete auxiliary Kubernetes resources upon application completion
Yinan Li created SPARK-23571: Summary: Delete auxiliary Kubernetes resources upon application completion Key: SPARK-23571 URL: https://issues.apache.org/jira/browse/SPARK-23571 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 2.3.0 Reporter: Yinan Li -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23485) Kubernetes should support node blacklist
[ https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16374757#comment-16374757 ] Yinan Li edited comment on SPARK-23485 at 2/23/18 6:22 PM: --- It's not that I'm too confident on the capability of Kubernetes to detect node problems. I just don't see it as a good practice of worrying about node problems at application level in a containerized environment running on a container orchestration system. For that reason, yes, I don't think Spark on Kubernetes should really need to worry about blacklisting nodes. was (Author: liyinan926): It's not that I'm too confident on the capability of Kubernetes to detect node problems. I just don't see it as a good practice of worrying about node problems at application level in a containerized environment running on a container orchestration system. Yes, I don't think Spark on Kubernetes should really need to worry about blacklisting nodes. > Kubernetes should support node blacklist > > > Key: SPARK-23485 > URL: https://issues.apache.org/jira/browse/SPARK-23485 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Scheduler >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Priority: Major > > Spark's BlacklistTracker maintains a list of "bad nodes" which it will not > use for running tasks (eg., because of bad hardware). When running in yarn, > this blacklist is used to avoid ever allocating resources on blacklisted > nodes: > https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128 > I'm just beginning to poke around the kubernetes code, so apologies if this > is incorrect -- but I didn't see any references to > {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it > seems this is missing. Thought of this while looking at SPARK-19755, a > similar issue on mesos. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23485) Kubernetes should support node blacklist
[ https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16374757#comment-16374757 ] Yinan Li commented on SPARK-23485: -- It's not that I'm too confident on the capability of Kubernetes to detect node problems. I just don't see it as a good practice of worrying about node problems at application level in a containerized environment running on a container orchestration system. Yes, I don't think Spark on Kubernetes should really need to worry about blacklisting nodes. > Kubernetes should support node blacklist > > > Key: SPARK-23485 > URL: https://issues.apache.org/jira/browse/SPARK-23485 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Scheduler >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Priority: Major > > Spark's BlacklistTracker maintains a list of "bad nodes" which it will not > use for running tasks (eg., because of bad hardware). When running in yarn, > this blacklist is used to avoid ever allocating resources on blacklisted > nodes: > https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128 > I'm just beginning to poke around the kubernetes code, so apologies if this > is incorrect -- but I didn't see any references to > {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it > seems this is missing. Thought of this while looking at SPARK-19755, a > similar issue on mesos. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23485) Kubernetes should support node blacklist
[ https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16374708#comment-16374708 ] Yinan Li commented on SPARK-23485: -- In the Yarn case, yes, it's possible that a node is missing a jar commonly needed by applications. In the Kubernetes mode, this will never be the case because containers either all have a particular jar locally or none of them has it. An image missing a dependency is problematic by itself. This consistency is one of the benefit of being containerized. Talking about node problems, detecting node problems and avoid scheduling pods onto problematic nodes are the concerns of the kubelets and the scheduler. Applications should not need to worry about if nodes are healthy or not. Node problems happening at runtime cause pods to be evicted from the problematic nodes and rescheduled somewhere else. Having applications be responsible for keeping track of problematic nodes and maintain a blacklist means unnecessarily jumping into the business of kubelets and the scheduler. [~foxish] > Kubernetes should support node blacklist > > > Key: SPARK-23485 > URL: https://issues.apache.org/jira/browse/SPARK-23485 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Scheduler >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Priority: Major > > Spark's BlacklistTracker maintains a list of "bad nodes" which it will not > use for running tasks (eg., because of bad hardware). When running in yarn, > this blacklist is used to avoid ever allocating resources on blacklisted > nodes: > https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128 > I'm just beginning to poke around the kubernetes code, so apologies if this > is incorrect -- but I didn't see any references to > {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it > seems this is missing. Thought of this while looking at SPARK-19755, a > similar issue on mesos. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23485) Kubernetes should support node blacklist
[ https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373620#comment-16373620 ] Yinan Li commented on SPARK-23485: -- The Kubernetes scheduler backend simply creates executor pods through the Kubernetes API server, and the pods are scheduled by the Kubernetes scheduler to run on the available nodes. The scheduler backend is not interested nor it should know about the mapping from pods to nodes. Affinity and anti-affinity, or taint and toleration can be used to influence pod scheduling. But it's the Kubernetes scheduler and Kubelets' responsibilities to keep track of node problems and avoid scheduling pods onto problematic nodes. > Kubernetes should support node blacklist > > > Key: SPARK-23485 > URL: https://issues.apache.org/jira/browse/SPARK-23485 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Scheduler >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Priority: Major > > Spark's BlacklistTracker maintains a list of "bad nodes" which it will not > use for running tasks (eg., because of bad hardware). When running in yarn, > this blacklist is used to avoid ever allocating resources on blacklisted > nodes: > https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128 > I'm just beginning to poke around the kubernetes code, so apologies if this > is incorrect -- but I didn't see any references to > {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it > seems this is missing. Thought of this while looking at SPARK-19755, a > similar issue on mesos. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23485) Kubernetes should support node blacklist
[ https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373544#comment-16373544 ] Yinan Li commented on SPARK-23485: -- I'm not sure if node blacklisting applies to Kubernetes. In the Kubernetes mode, executors run in containers that in turn run in Kubernetes pods scheduled to run on available cluster nodes by the Kubernetes scheduler. The Kubernetes Spark scheduler backend does not keep track of nor really care about which nodes the pods run on. This is a concern of the Kubernetes scheduler. > Kubernetes should support node blacklist > > > Key: SPARK-23485 > URL: https://issues.apache.org/jira/browse/SPARK-23485 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Scheduler >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Priority: Major > > Spark's BlacklistTracker maintains a list of "bad nodes" which it will not > use for running tasks (eg., because of bad hardware). When running in yarn, > this blacklist is used to avoid ever allocating resources on blacklisted > nodes: > https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128 > I'm just beginning to poke around the kubernetes code, so apologies if this > is incorrect -- but I didn't see any references to > {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it > seems this is missing. Thought of this while looking at SPARK-19755, a > similar issue on mesos. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23285) Allow spark.executor.cores to be fractional
[ https://issues.apache.org/jira/browse/SPARK-23285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357500#comment-16357500 ] Yinan Li edited comment on SPARK-23285 at 2/8/18 8:22 PM: -- Given the complexity and significant impact of the changes proposed in [https://github.com/apache/spark/pull/20460] to the way Spark handles task scheduling, task parallelism, and dynamic resource allocation, etc., I'm thinking if we should instead introduce a K8s specific configuration property for specifying the executor cores that follows the Kubernetes [convention|https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu]. It seems Mesos fine-grained mode does this with {{spark.mesos.mesosExecutor.cores}}. We can have something like {{spark.kubernetes.executor.cores}} that is only used for specifying the CPU core request for the executor pods. Existing configuration properties {{spark.executor.cores}} and {{spark.task.cpus}} still play their roles in task parallelism, task scheduling, etc. That is, {{spark.kubernetes.executor.cores}} only determines the physical CPU cores available to an executor. An executor can still run multiple tasks simultaneously if {{spark.executor.cores}} is a multiple of {{spark.task.cpus}}. If not set, {{spark.kubernetes.executor.cores}} falls back to {{spark.executor.cores}}. WDYT? [~felixcheung] [~jerryshao] [~jiangxb1987] was (Author: liyinan926): Given the complexity and significant impact of the changes proposed in [https://github.com/apache/spark/pull/20460] to the way Spark handles task scheduling, task parallelism, and dynamic resource allocation, etc., I'm thinking if we should instead introduce a K8s specific configuration property for specifying the executor cores that follows the Kubernetes [convention|https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu]. It seems Mesos fine-grained mode does this with {{spark.mesos.mesosExecutor.cores}}. We can have something like {{spark.kubernetes.executor.cores}} that is only used for specifying the CPU core request for the executor pods. Existing configuration properties {{spark.executor.cores}} and {{spark.task.cpus}} still play their roles in task parallelism, task scheduling, etc. That is, {{spark.kubernetes.executor.cores}} only determines the physical CPU cores available to an executor. An executor can still run multiple tasks simultaneously if {{spark.executor.cores}} is a multiple of {{spark.task.cpus}}. If not set, {{spark.kubernetes.executor.cores}} falls back to {{spark.executor.cores}}. WDYT? > Allow spark.executor.cores to be fractional > --- > > Key: SPARK-23285 > URL: https://issues.apache.org/jira/browse/SPARK-23285 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Scheduler, Spark Submit >Affects Versions: 2.4.0 >Reporter: Anirudh Ramanathan >Priority: Minor > > There is a strong check for an integral number of cores per executor in > [SparkSubmitArguments.scala#L270-L272|https://github.com/apache/spark/blob/3f4060c340d6bac412e8819c4388ccba226efcf3/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L270-L272]. > Given we're reusing that property in K8s, does it make sense to relax it? > > K8s treats CPU as a "compressible resource" and can actually assign millicpus > to individual containers. Also to be noted - spark.driver.cores has no such > check in place. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23285) Allow spark.executor.cores to be fractional
[ https://issues.apache.org/jira/browse/SPARK-23285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357500#comment-16357500 ] Yinan Li commented on SPARK-23285: -- Given the complexity and significant impact of the changes proposed in [https://github.com/apache/spark/pull/20460] to the way Spark handles task scheduling, task parallelism, and dynamic resource allocation, etc., I'm thinking if we should instead introduce a K8s specific configuration property for specifying the executor cores that follows the Kubernetes [convention|https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu]. It seems Mesos fine-grained mode does this with {{spark.mesos.mesosExecutor.cores}}. We can have something like {{spark.kubernetes.executor.cores}} that is only used for specifying the CPU core request for the executor pods. Existing configuration properties {{spark.executor.cores}} and {{spark.task.cpus}} still play their roles in task parallelism, task scheduling, etc. That is, {{spark.kubernetes.executor.cores}} only determines the physical CPU cores available to an executor. An executor can still run multiple tasks simultaneously if {{spark.executor.cores}} is a multiple of {{spark.task.cpus}}. If not set, {{spark.kubernetes.executor.cores}} falls back to {{spark.executor.cores}}. WDYT? > Allow spark.executor.cores to be fractional > --- > > Key: SPARK-23285 > URL: https://issues.apache.org/jira/browse/SPARK-23285 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Scheduler, Spark Submit >Affects Versions: 2.4.0 >Reporter: Anirudh Ramanathan >Priority: Minor > > There is a strong check for an integral number of cores per executor in > [SparkSubmitArguments.scala#L270-L272|https://github.com/apache/spark/blob/3f4060c340d6bac412e8819c4388ccba226efcf3/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L270-L272]. > Given we're reusing that property in K8s, does it make sense to relax it? > > K8s treats CPU as a "compressible resource" and can actually assign millicpus > to individual containers. Also to be noted - spark.driver.cores has no such > check in place. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23285) Allow spark.executor.cores to be fractional
[ https://issues.apache.org/jira/browse/SPARK-23285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347484#comment-16347484 ] Yinan Li commented on SPARK-23285: -- Another option is to bypass that check for Kubernetes mode. This minimizes the code changes. Thoughts? > Allow spark.executor.cores to be fractional > --- > > Key: SPARK-23285 > URL: https://issues.apache.org/jira/browse/SPARK-23285 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Scheduler, Spark Submit >Affects Versions: 2.4.0 >Reporter: Anirudh Ramanathan >Priority: Minor > > There is a strong check for an integral number of cores per executor in > [SparkSubmitArguments.scala#L270-L272|https://github.com/apache/spark/blob/3f4060c340d6bac412e8819c4388ccba226efcf3/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L270-L272]. > Given we're reusing that property in K8s, does it make sense to relax it? > > K8s treats CPU as a "compressible resource" and can actually assign millicpus > to individual containers. Also to be noted - spark.driver.cores has no such > check in place. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23285) Allow spark.executor.cores to be fractional
[ https://issues.apache.org/jira/browse/SPARK-23285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347267#comment-16347267 ] Yinan Li commented on SPARK-23285: -- FYI: we did this in our fork: https://github.com/apache-spark-on-k8s/spark/pull/361. > Allow spark.executor.cores to be fractional > --- > > Key: SPARK-23285 > URL: https://issues.apache.org/jira/browse/SPARK-23285 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Scheduler, Spark Submit >Affects Versions: 2.4.0 >Reporter: Anirudh Ramanathan >Priority: Minor > > There is a strong check for an integral number of cores per executor in > [SparkSubmitArguments.scala#L270-L272|https://github.com/apache/spark/blob/3f4060c340d6bac412e8819c4388ccba226efcf3/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L270-L272]. > Given we're reusing that property in K8s, does it make sense to relax it? > > K8s treats CPU as a "compressible resource" and can actually assign millicpus > to individual containers. Also to be noted - spark.driver.cores has no such > check in place. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23257) Implement Kerberos Support in Kubernetes resource manager
[ https://issues.apache.org/jira/browse/SPARK-23257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16345488#comment-16345488 ] Yinan Li commented on SPARK-23257: -- [~RJKeevil] AFAIK, no one is working on upstreaming this yet. However, I think the consensus is that we need to first address https://issues.apache.org/jira/browse/SPARK-22839 before pushing more features upstream. The work in [https://github.com/apache-spark-on-k8s/spark/pull/540] adds more configuration steps to the mix, so probably is not going to be upstreamed until the refactoring is done. > Implement Kerberos Support in Kubernetes resource manager > - > > Key: SPARK-23257 > URL: https://issues.apache.org/jira/browse/SPARK-23257 > Project: Spark > Issue Type: Wish > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Rob Keevil >Priority: Major > > On the forked k8s branch of Spark at > [https://github.com/apache-spark-on-k8s/spark/pull/540] , Kerberos support > has been added to the Kubernetes resource manager. The Kubernetes code > between these two repositories appears to have diverged, so this commit > cannot be merged in easily. Are there any plans to re-implement this work on > the main Spark repository? > > [ifilonenko|https://github.com/ifilonenko] [~liyinan926] I am happy to help > with the development and testing of this, but i wanted to confirm that this > isn't already in progress - I could not find any discussion about this > specific topic online. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23153) Support application dependencies in submission client's local file system
Yinan Li created SPARK-23153: Summary: Support application dependencies in submission client's local file system Key: SPARK-23153 URL: https://issues.apache.org/jira/browse/SPARK-23153 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 2.4.0 Reporter: Yinan Li -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22962) Kubernetes app fails if local files are used
[ https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331132#comment-16331132 ] Yinan Li commented on SPARK-22962: -- I agree that before we upstream the staging server, we should fail the submission if a user uses local resources. [~vanzin], if it's not too late to get into 2.3, I'm gonna file a PR for this. > Kubernetes app fails if local files are used > > > Key: SPARK-22962 > URL: https://issues.apache.org/jira/browse/SPARK-22962 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Priority: Major > > If you try to start a Spark app on kubernetes using a local file as the app > resource, for example, it will fail: > {code} > ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar > {code} > {noformat} > + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && > env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g' > \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < > /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x} > ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && > if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP > ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if > ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK > _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java > "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR > Y -Xmx$SPARK_DRIVER_MEMORY > -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS > $SPARK_DRIVER_ARGS' > Error: Could not find or load main class com.cloudera.spark.tests.Sleeper > {noformat} > Using an http server to provide the app jar solves the problem. > The k8s backend should either somehow make these files available to the > cluster or error out with a more user-friendly message if that feature is not > yet available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23137) spark.kubernetes.executor.podNamePrefix is ignored
[ https://issues.apache.org/jira/browse/SPARK-23137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329713#comment-16329713 ] Yinan Li commented on SPARK-23137: -- It's actually marked as an \{{internal}} config property. So the fix could be either removing it from the docs, or removing the \{{internal}} mark and respecting what users set. > spark.kubernetes.executor.podNamePrefix is ignored > -- > > Key: SPARK-23137 > URL: https://issues.apache.org/jira/browse/SPARK-23137 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan >Priority: Major > > [~liyinan926] is fixing this as we speak. Should be a very minor change. > It's also a non-critical option, so, if we decide that the safer thing is to > just remove it, we can do that as well. Will leave that decision to the > release czar and reviewers. > > [~vanzin] [~felixcheung] [~sameerag] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22998) Value for SPARK_MOUNTED_CLASSPATH in executor pods is not set
Yinan Li created SPARK-22998: Summary: Value for SPARK_MOUNTED_CLASSPATH in executor pods is not set Key: SPARK-22998 URL: https://issues.apache.org/jira/browse/SPARK-22998 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 2.3.0 Reporter: Yinan Li Fix For: 2.3.0 The environment variable {{SPARK_MOUNTED_CLASSPATH}} is referenced by the executor's Dockerfile, but is not set by the k8s scheduler backend. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22953) Duplicated secret volumes in Spark pods when init-containers are used
Yinan Li created SPARK-22953: Summary: Duplicated secret volumes in Spark pods when init-containers are used Key: SPARK-22953 URL: https://issues.apache.org/jira/browse/SPARK-22953 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 2.3.0 Reporter: Yinan Li Fix For: 2.3.0 User-specified secrets are mounted into both the main container and init-container (when it is used) in a Spark driver/executor pod, using the {{MountSecretsBootstrap}}. Because {{MountSecretsBootstrap}} always adds the secret volumes to the pod, the same secret volumes get added twice, one when mounting the secrets to the main container, and the other when mounting the secrets to the init-container. See https://github.com/apache-spark-on-k8s/spark/issues/594. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22839) Refactor Kubernetes code for configuring driver/executor pods to use consistent and cleaner abstraction
Yinan Li created SPARK-22839: Summary: Refactor Kubernetes code for configuring driver/executor pods to use consistent and cleaner abstraction Key: SPARK-22839 URL: https://issues.apache.org/jira/browse/SPARK-22839 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 2.3.0 Reporter: Yinan Li As discussed in https://github.com/apache/spark/pull/19954, the current code for configuring the driver pod vs the code for configuring the executor pods are using the same abstraction. Besides that, the current code leaves a lot to be desired in terms of the level and cleaness of abstraction. For example, the current code is passing around many pieces of information around different class hierarchies, which makes code review and maintenance challenging. We need some thorough refactoring of the current code to achieve better, cleaner, and consistent abstraction. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22778) Kubernetes scheduler at master failing to run applications successfully
[ https://issues.apache.org/jira/browse/SPARK-22778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289907#comment-16289907 ] Yinan Li commented on SPARK-22778: -- Just verified that the fix worked. I'm gonna send a PR soon. > Kubernetes scheduler at master failing to run applications successfully > --- > > Key: SPARK-22778 > URL: https://issues.apache.org/jira/browse/SPARK-22778 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan >Priority: Critical > > Building images based on master and deploying Spark PI results in the > following error. > 2017-12-13 19:57:19 INFO SparkContext:54 - Successfully stopped SparkContext > Exception in thread "main" org.apache.spark.SparkException: Could not parse > Master URL: 'k8s:https://xx.yy.zz.ww' > at > org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741) > at org.apache.spark.SparkContext.(SparkContext.scala:496) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2490) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:927) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918) > at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31) > at org.apache.spark.examples.SparkPi.main(SparkPi.scala) > 2017-12-13 19:57:19 INFO ShutdownHookManager:54 - Shutdown hook called > 2017-12-13 19:57:19 INFO ShutdownHookManager:54 - Deleting directory > /tmp/spark-b47515c2-6750-4a37-aa68-6ee12da5d2bd > This is likely an artifact seen because of changes in master, or our > submission code in the reviews. We haven't seen this on our fork. Hopefully > once integration tests are ported against upstream/master, we will catch > these issues earlier. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22778) Kubernetes scheduler at master failing to run applications successfully
[ https://issues.apache.org/jira/browse/SPARK-22778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289876#comment-16289876 ] Yinan Li commented on SPARK-22778: -- Ah, yes, the PR missed that. OK, I'm gonna give that a try and submit a PR to fix it. > Kubernetes scheduler at master failing to run applications successfully > --- > > Key: SPARK-22778 > URL: https://issues.apache.org/jira/browse/SPARK-22778 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan >Priority: Critical > > Building images based on master and deploying Spark PI results in the > following error. > 2017-12-13 19:57:19 INFO SparkContext:54 - Successfully stopped SparkContext > Exception in thread "main" org.apache.spark.SparkException: Could not parse > Master URL: 'k8s:https://xx.yy.zz.ww' > at > org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741) > at org.apache.spark.SparkContext.(SparkContext.scala:496) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2490) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:927) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918) > at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31) > at org.apache.spark.examples.SparkPi.main(SparkPi.scala) > 2017-12-13 19:57:19 INFO ShutdownHookManager:54 - Shutdown hook called > 2017-12-13 19:57:19 INFO ShutdownHookManager:54 - Deleting directory > /tmp/spark-b47515c2-6750-4a37-aa68-6ee12da5d2bd > This is likely an artifact seen because of changes in master, or our > submission code in the reviews. We haven't seen this on our fork. Hopefully > once integration tests are ported against upstream/master, we will catch > these issues earlier. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22778) Kubernetes scheduler at master failing to run applications successfully
[ https://issues.apache.org/jira/browse/SPARK-22778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289822#comment-16289822 ] Yinan Li edited comment on SPARK-22778 at 12/13/17 8:24 PM: Just some background on this. The validation and parsing of k8s master url has been moved to SparkSubmit as being suggested in the review. The parsed master URL (https://... for example) is appended a {{k8s}} prefix after the parsing to satisfy {{KubernetesClusterManager}}, whose {{canCreate}} method is based on if the master URL starts {{k8s}}. That's why you see the {{k8s:}} prefix. The issue seems that in the driver pod {{SparkContext}} could not find {{KubernetesClusterManager}} based on the debug messages I added. The code that triggered the error (with the debugging I added) is as follows: {code:java} private def getClusterManager(url: String): Option[ExternalClusterManager] = { val loader = Utils.getContextOrSparkClassLoader val serviceLoaders = ServiceLoader.load(classOf[ExternalClusterManager], loader).asScala serviceLoaders.foreach { loader => logInfo(s"Found the following external cluster manager: $loader") } val filteredServiceLoaders = serviceLoaders.filter(_.canCreate(url)) if (filteredServiceLoaders.size > 1) { throw new SparkException( s"Multiple external cluster managers registered for the url $url: $serviceLoaders") } else if (filteredServiceLoaders.isEmpty) { logWarning(s"No external cluster manager registered for url $url") } filteredServiceLoaders.headOption } {code} And I got the following: {code:java} No external cluster manager registered for url k8s:https://35.226.8.173 {code} was (Author: liyinan926): Just some background on this. The validation and parsing of k8s master url has been moved to SparkSubmit as being suggested in the review. The parsed master URL (https://... for example) is appended a {{k8s}} prefix after the parsing to satisfy {{KubernetesClusterManager}}, whose {{canCreate}} method is based on if the master URL starts {{k8s}}. That's why you see the {{k8s:}} prefix. The issue seems that in the driver pod {{SparkContext}} could not find {{KubernetesClusterManager}} based on the debug messages I added: {code:scala} private def getClusterManager(url: String): Option[ExternalClusterManager] = { val loader = Utils.getContextOrSparkClassLoader val serviceLoaders = ServiceLoader.load(classOf[ExternalClusterManager], loader).asScala serviceLoaders.foreach { loader => logInfo(s"Found the following external cluster manager: $loader") } val filteredServiceLoaders = serviceLoaders.filter(_.canCreate(url)) if (filteredServiceLoaders.size > 1) { throw new SparkException( s"Multiple external cluster managers registered for the url $url: $serviceLoaders") } else if (filteredServiceLoaders.isEmpty) { logWarning(s"No external cluster manager registered for url $url") } filteredServiceLoaders.headOption } {code} And I got the following: {code:java} No external cluster manager registered for url k8s:https://35.226.8.173 {code} > Kubernetes scheduler at master failing to run applications successfully > --- > > Key: SPARK-22778 > URL: https://issues.apache.org/jira/browse/SPARK-22778 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan > > Building images based on master and deploying Spark PI results in the > following error. > 2017-12-13 19:57:19 INFO SparkContext:54 - Successfully stopped SparkContext > Exception in thread "main" org.apache.spark.SparkException: Could not parse > Master URL: 'k8s:https://xx.yy.zz.ww' > at > org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741) > at org.apache.spark.SparkContext.(SparkContext.scala:496) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2490) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:927) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918) > at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31) > at org.apache.spark.examples.SparkPi.main(SparkPi.scala) > 2017-12-13 19:57:19 INFO ShutdownHookManager:54 - Shutdown hook called > 2017-12-13 19:57:19 INFO ShutdownHookManager:54 - Deleting directory > /tmp/spark-b47515c2-6750-4a37-aa68-6ee12da5d2bd > This is likely an artifact seen because of changes in master, or our > submission code in the reviews.
[jira] [Commented] (SPARK-22778) Kubernetes scheduler at master failing to run applications successfully
[ https://issues.apache.org/jira/browse/SPARK-22778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289822#comment-16289822 ] Yinan Li commented on SPARK-22778: -- Just some background on this. The validation and parsing of k8s master url has been moved to SparkSubmit as being suggested in the review. The parsed master URL (https://... for example) is appended a {{k8s}} prefix after the parsing to satisfy {{KubernetesClusterManager}}, whose {{canCreate}} method is based on if the master URL starts {{k8s}}. That's why you see the {{k8s:}} prefix. The issue seems that in the driver pod {{SparkContext}} could not find {{KubernetesClusterManager}} based on the debug messages I added: {code:scala} private def getClusterManager(url: String): Option[ExternalClusterManager] = { val loader = Utils.getContextOrSparkClassLoader val serviceLoaders = ServiceLoader.load(classOf[ExternalClusterManager], loader).asScala serviceLoaders.foreach { loader => logInfo(s"Found the following external cluster manager: $loader") } val filteredServiceLoaders = serviceLoaders.filter(_.canCreate(url)) if (filteredServiceLoaders.size > 1) { throw new SparkException( s"Multiple external cluster managers registered for the url $url: $serviceLoaders") } else if (filteredServiceLoaders.isEmpty) { logWarning(s"No external cluster manager registered for url $url") } filteredServiceLoaders.headOption } {code} And I got the following: {code:java} No external cluster manager registered for url k8s:https://35.226.8.173 {code} > Kubernetes scheduler at master failing to run applications successfully > --- > > Key: SPARK-22778 > URL: https://issues.apache.org/jira/browse/SPARK-22778 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan > > Building images based on master and deploying Spark PI results in the > following error. > 2017-12-13 19:57:19 INFO SparkContext:54 - Successfully stopped SparkContext > Exception in thread "main" org.apache.spark.SparkException: Could not parse > Master URL: 'k8s:https://xx.yy.zz.ww' > at > org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741) > at org.apache.spark.SparkContext.(SparkContext.scala:496) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2490) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:927) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918) > at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31) > at org.apache.spark.examples.SparkPi.main(SparkPi.scala) > 2017-12-13 19:57:19 INFO ShutdownHookManager:54 - Shutdown hook called > 2017-12-13 19:57:19 INFO ShutdownHookManager:54 - Deleting directory > /tmp/spark-b47515c2-6750-4a37-aa68-6ee12da5d2bd > This is likely an artifact seen because of changes in master, or our > submission code in the reviews. We haven't seen this on our fork. Hopefully > once integration tests are ported against upstream/master, we will catch > these issues earlier. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18278) SPIP: Support native submission of spark jobs to a kubernetes cluster
[ https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yinan Li updated SPARK-18278: - Component/s: Kubernetes > SPIP: Support native submission of spark jobs to a kubernetes cluster > - > > Key: SPARK-18278 > URL: https://issues.apache.org/jira/browse/SPARK-18278 > Project: Spark > Issue Type: Umbrella > Components: Build, Deploy, Documentation, Kubernetes, Scheduler, > Spark Core >Affects Versions: 2.3.0 >Reporter: Erik Erlandson > Labels: SPIP > Attachments: SPARK-18278 Spark on Kubernetes Design Proposal Revision > 2 (1).pdf > > > A new Apache Spark sub-project that enables native support for submitting > Spark applications to a kubernetes cluster. The submitted application runs > in a driver executing on a kubernetes pod, and executors lifecycles are also > managed as pods. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22757) Init-container in the driver/executor pods for downloading remote dependencies
[ https://issues.apache.org/jira/browse/SPARK-22757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289378#comment-16289378 ] Yinan Li commented on SPARK-22757: -- Yes, this is also targeting 2.3. > Init-container in the driver/executor pods for downloading remote dependencies > -- > > Key: SPARK-22757 > URL: https://issues.apache.org/jira/browse/SPARK-22757 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Yinan Li > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22757) Init-container in the driver/executor pods for downloading remote dependencies
[ https://issues.apache.org/jira/browse/SPARK-22757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yinan Li updated SPARK-22757: - Component/s: Kubernetes > Init-container in the driver/executor pods for downloading remote dependencies > -- > > Key: SPARK-22757 > URL: https://issues.apache.org/jira/browse/SPARK-22757 > Project: Spark > Issue Type: Sub-task > Components: Deploy, Kubernetes >Affects Versions: 2.3.0 >Reporter: Yinan Li > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org