Re: k8s orchestrating Spark service
> We’d like to deploy Spark Workers/Executors and Master (whatever master is > easiest to talk about since we really don’t care) in pods as we do with the > other services we use. Replace Spark Master with k8s if you insist. How do > the executors get deployed? When running Spark against Kubernetes natively, the Spark library handles requesting executors from the API server. So presumably one would only need to know how to start the driver in the cluster – maybe spark-operator, spark-submit, or just starting the pod and making a Spark context in client mode with the right parameters. From there, the Spark scheduler code knows how to interface with the API server and request executor pods according to the resource requests configured in the app. > We have a machine Learning Server. It submits various jobs through the Spark > Scala API. The Server is run in a pod deployed from a chart by k8s. It later > uses the Spark API to submit jobs. I guess we find spark-submit to be a > roadblock to our use of Spark and the k8s support is fine but how do you run > our Driver and Executors considering that the Driver is part of the Server > process? It depends on how the server runs the jobs: If each job is meant to be a separate forked driver pod / process: The ML server code can use the SparkLauncher API and configure the Spark driver through that API. Set the master to point to the Kubernetes API server and set the parameters for credentials according to your setup. SparkLauncher is a thin layer on top of spark-submit; a Spark distribution has to be packaged with the ML server image and SparkLauncher would point to the spark-submit script in said distribution. If all jobs run inside the same driver, that being the ML server: One has to start the ML server with the right parameters to point to the Kubernetes master. Since the ML server is a driver, one has the option to use spark-submit or SparkLauncher to deploy the ML server itself. Alternatively one can use a custom script to start the ML server, then the ML server process has to create a SparkContext object parameterized against the Kubernetes server in question. I hope this helps! -Matt Cheah From: Pat Ferrel Date: Monday, July 1, 2019 at 5:05 PM To: "user@spark.apache.org" , Matt Cheah Subject: Re: k8s orchestrating Spark service We have a machine Learning Server. It submits various jobs through the Spark Scala API. The Server is run in a pod deployed from a chart by k8s. It later uses the Spark API to submit jobs. I guess we find spark-submit to be a roadblock to our use of Spark and the k8s support is fine but how do you run our Driver and Executors considering that the Driver is part of the Server process? Maybe we are talking past each other with some mistaken assumptions (on my part perhaps). From: Pat Ferrel Reply: Pat Ferrel Date: July 1, 2019 at 4:57:20 PM To: user@spark.apache.org , Matt Cheah Subject: Re: k8s orchestrating Spark service k8s as master would be nice but doesn’t solve the problem of running the full cluster and is an orthogonal issue. We’d like to deploy Spark Workers/Executors and Master (whatever master is easiest to talk about since we really don’t care) in pods as we do with the other services we use. Replace Spark Master with k8s if you insist. How do the executors get deployed? We have our own containers that almost work for 2.3.3. We have used this before with older Spark so we are reasonably sure it makes sense. We just wonder if our own image builds and charts are the best starting point. Does anyone have something they like? From: Matt Cheah Reply: Matt Cheah Date: July 1, 2019 at 4:45:55 PM To: Pat Ferrel , user@spark.apache.org Subject: Re: k8s orchestrating Spark service Sorry, I don’t quite follow – why use the Spark standalone cluster as an in-between layer when one can just deploy the Spark application directly inside the Helm chart? I’m curious as to what the use case is, since I’m wondering if there’s something we can improve with respect to the native integration with Kubernetes here. Deploying on Spark standalone mode in Kubernetes is, to my understanding, meant to be superseded by the native integration introduced in Spark 2.4. From: Pat Ferrel Date: Monday, July 1, 2019 at 4:40 PM To: "user@spark.apache.org" , Matt Cheah Subject: Re: k8s orchestrating Spark service Thanks Matt, Actually I can’t use spark-submit. We submit the Driver programmatically through the API. But this is not the issue and using k8s as the master is also not the issue though you may be right about it being easier, it doesn’t quite get to the heart. We want to orchestrate a bunch of services including Spark. The rest work, we are asking if anyone has seen a good starting point for adding Spark as a k8s managed service. From: Matt Cheah Reply: Ma
Re: k8s orchestrating Spark service
Sorry, I don’t quite follow – why use the Spark standalone cluster as an in-between layer when one can just deploy the Spark application directly inside the Helm chart? I’m curious as to what the use case is, since I’m wondering if there’s something we can improve with respect to the native integration with Kubernetes here. Deploying on Spark standalone mode in Kubernetes is, to my understanding, meant to be superseded by the native integration introduced in Spark 2.4. From: Pat Ferrel Date: Monday, July 1, 2019 at 4:40 PM To: "user@spark.apache.org" , Matt Cheah Subject: Re: k8s orchestrating Spark service Thanks Matt, Actually I can’t use spark-submit. We submit the Driver programmatically through the API. But this is not the issue and using k8s as the master is also not the issue though you may be right about it being easier, it doesn’t quite get to the heart. We want to orchestrate a bunch of services including Spark. The rest work, we are asking if anyone has seen a good starting point for adding Spark as a k8s managed service. From: Matt Cheah Reply: Matt Cheah Date: July 1, 2019 at 3:26:20 PM To: Pat Ferrel , user@spark.apache.org Subject: Re: k8s orchestrating Spark service I would recommend looking into Spark’s native support for running on Kubernetes. One can just start the application against Kubernetes directly using spark-submit in cluster mode or starting the Spark context with the right parameters in client mode. See https://spark.apache.org/docs/latest/running-on-kubernetes.html [spark.apache.org] I would think that building Helm around this architecture of running Spark applications would be easier than running a Spark standalone cluster. But admittedly I’m not very familiar with the Helm technology – we just use spark-submit. -Matt Cheah From: Pat Ferrel Date: Sunday, June 30, 2019 at 12:55 PM To: "user@spark.apache.org" Subject: k8s orchestrating Spark service We're trying to setup a system that includes Spark. The rest of the services have good Docker containers and Helm charts to start from. Spark on the other hand is proving difficult. We forked a container and have tried to create our own chart but are having several problems with this. So back to the community… Can anyone recommend a Docker Container + Helm Chart for use with Kubernetes to orchestrate: Spark standalone Master several Spark Workers/Executors This not a request to use k8s to orchestrate Spark Jobs, but the service cluster itself. Thanks smime.p7s Description: S/MIME cryptographic signature
Re: k8s orchestrating Spark service
I would recommend looking into Spark’s native support for running on Kubernetes. One can just start the application against Kubernetes directly using spark-submit in cluster mode or starting the Spark context with the right parameters in client mode. See https://spark.apache.org/docs/latest/running-on-kubernetes.html I would think that building Helm around this architecture of running Spark applications would be easier than running a Spark standalone cluster. But admittedly I’m not very familiar with the Helm technology – we just use spark-submit. -Matt Cheah From: Pat Ferrel Date: Sunday, June 30, 2019 at 12:55 PM To: "user@spark.apache.org" Subject: k8s orchestrating Spark service We're trying to setup a system that includes Spark. The rest of the services have good Docker containers and Helm charts to start from. Spark on the other hand is proving difficult. We forked a container and have tried to create our own chart but are having several problems with this. So back to the community… Can anyone recommend a Docker Container + Helm Chart for use with Kubernetes to orchestrate: Spark standalone Master several Spark Workers/Executors This not a request to use k8s to orchestrate Spark Jobs, but the service cluster itself. Thanks smime.p7s Description: S/MIME cryptographic signature
[PSA] Sharing our Experiences With Kubernetes
Hi everyone, I would like to share the experiences my organization has had with deploying Kubernetes and migrating our Spark applications to Kubernetes over from YARN. We are publishing a series of blog posts that describe what we have learned and what we have built. Our introduction post is here: https://medium.com/palantir/introducing-rubix-kubernetes-at-palantir-ab0ce16ea42e We then discuss our custom scheduling work here: https://medium.com/palantir/spark-scheduling-in-kubernetes-4976333235f3 Stay tuned for more posts in this space! I hope that this is helpful for you. -Matt Cheah smime.p7s Description: S/MIME cryptographic signature
Re: Spark on k8s - map persistentStorage for data spilling
Ah I see: We always force the local directory to use emptyDir and it cannot be configured to use any other volume type. See here. I am a bit conflicted on this. On one hand, it makes sense to allow for users to be able to mount their own volumes to handle spill data. On the other hand, I get the impression that emptyDir is the right kind of volume for this in a majority of cases – emptyDir is meant to be used for temporary storage and is meant to be fast to make workflows like Spark performant. Finally, a significant benefit of emptyDir is that Kubernetes will handle the cleanup of the directory for you if the pod exits – if you use a persistent volume claim you will need to ensure the files are cleaned up in the case that the pod exits abruptly. I’d wonder if your organization can consider modifying your Kubernetes setup to make your emptyDir volumes larger and faster? -Matt Cheah From: Tomasz Krol Date: Friday, March 1, 2019 at 10:53 AM To: Matt Cheah Cc: "user@spark.apache.org" Subject: Re: Spark on k8s - map persistentStorage for data spilling Hi Matt, Thanks for coming back to me. Yeah that doesn't work. Basically in the properties I set Volume and mounting point as below; spark.kubernetes.driver.volumes.persistentVolumeClaim.checkvolume.mount.path=/checkpoint spark.kubernetes.driver.volumes.persistentVolumeClaim.checkvolume.mount.readOnly=false spark.kubernetes.driver.volumes.persistentVolumeClaim.checkvolume.mount.claimName=sparkstorage spark.kubernetes.executor.volumes.persistentVolumeClaim.checkvolume.mount.path=/checkpoint spark.kubernetes.executor.volumes.persistentVolumeClaim.checkvolume.mount.readOnly=false spark.kubernetes.executor.volumes.persistentVolumeClaim.checkvolume.mount.claimName=sparkstorage That works as expected and PVC is mounted in the driver and executor PODs on /checkpoint directory. As you suggested, first thing what I was trying it was set spark.local.dir or env SPARK_LOCAL_DIRS to directory /checkpoint. As my expectation was that it will be spilling to my PVC. However this is throwing following error: "spark-kube-driver" is invalid: spec.containers[0].volumeMounts[3].mountPath: Invalid value: "/checkpoint": must be unique" It seems like it's trying to mount emptyDir with mounting point "/checkpoint", but it can't because "/checkpoint" is the directory where the PVC is already mounted. At the moment it looks like to me, the emptyDir is always used for spilling data. The question is how to mount it on the PVC. Unless I miss something here. I can't really run any bigger jobs at the moment because of that. Appreciate any feedback :) Thanks Tom On Thu, 28 Feb 2019 at 17:23, Matt Cheah wrote: I think we want to change the value of spark.local.dir to point to where your PVC is mounted. Can you give that a try and let us know if that moves the spills as expected? -Matt Cheah From: Tomasz Krol Date: Wednesday, February 27, 2019 at 3:41 AM To: "user@spark.apache.org" Subject: Spark on k8s - map persistentStorage for data spilling Hey Guys, I hope someone will be able to help me, as I've stuck with this for a while:) Basically I am running some jobs on kubernetes as per documentation https://spark.apache.org/docs/latest/running-on-kubernetes.html [spark.apache.org] All works fine, however if I run queries on bigger data volume, then jobs failing that there is not enough space in /var/data/spark-1xxx directory. Obviously the reason for this is that emptyDir mounted doesnt have enough space. I also mounted pvc to the driver and executors pods which I can see during the runtime. I am wondering if someone knows how to set that data will be spilled to different directory (i.e my persistent storage directory) instead of empyDir with some limitted space. Or if I can mount the empyDir somehow on my pvc. Basically at the moment I cant run any jobs as they are failing due to insufficient space in that /var/data directory. Thanks -- Tomasz Krol patric...@gmail.com -- Tomasz Krol patric...@gmail.com smime.p7s Description: S/MIME cryptographic signature
Re: Spark on k8s - map persistentStorage for data spilling
I think we want to change the value of spark.local.dir to point to where your PVC is mounted. Can you give that a try and let us know if that moves the spills as expected? -Matt Cheah From: Tomasz Krol Date: Wednesday, February 27, 2019 at 3:41 AM To: "user@spark.apache.org" Subject: Spark on k8s - map persistentStorage for data spilling Hey Guys, I hope someone will be able to help me, as I've stuck with this for a while:) Basically I am running some jobs on kubernetes as per documentation https://spark.apache.org/docs/latest/running-on-kubernetes.html [spark.apache.org] All works fine, however if I run queries on bigger data volume, then jobs failing that there is not enough space in /var/data/spark-1xxx directory. Obviously the reason for this is that emptyDir mounted doesnt have enough space. I also mounted pvc to the driver and executors pods which I can see during the runtime. I am wondering if someone knows how to set that data will be spilled to different directory (i.e my persistent storage directory) instead of empyDir with some limitted space. Or if I can mount the empyDir somehow on my pvc. Basically at the moment I cant run any jobs as they are failing due to insufficient space in that /var/data directory. Thanks -- Tomasz Krol patric...@gmail.com smime.p7s Description: S/MIME cryptographic signature
Re: Problem running Spark on Kubernetes: Certificate error
Hi Steven, What I think is happening is that your machine has a CA certificate that is used for communicating with your API server, particularly because you’re using Digital Ocean’s cluster manager. However, it’s unclear if your pod has the same CA certificate or if the pod needs that certificate file. You can use configurations to have your pod use a particular CA certificate file to communicate with the APi server. If you set spark.kubernetes.authenticate.driver.caCertFile to the path of your CA certificate on your local disk, spark-submit will create a secret that contains that certificate file and use that certificate to configure TLS for the driver pod’s communication with the API server. It's clear that your driver pod doesn’t have the right TLS certificate to communicate with the API server, so I would try to introspect the driver pod and see what certificate it’s using for that communication. If there’s a fix that needs to happen in Spark, feel free to indicate as such. -Matt Cheah From: Steven Stetzler Date: Thursday, December 13, 2018 at 1:49 PM To: "user@spark.apache.org" Subject: Problem running Spark on Kubernetes: Certificate error Hello, I am following the tutorial here (https://spark.apache.org/docs/latest/running-on-kubernetes.html [spark.apache.org]) to get spark running on a Kubernetes cluster. My Kubernetes cluster is hosted with Digital Ocean's kubernetes cluster manager. I have change the KUBECONFIG environment variable to point to my cluster access credentials, so both Spark and kubectl can speak with the nodes. I am running into an issue when trying to run the SparkPi example as described in the Spark on Kubernetes tutorials. The command I am running is: ./bin/spark-submit --master k8s://$CLUSTERIP --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=1 --conf spark.kubernetes.container.image=$IMAGEURL --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar where CLUSTERIP contains the ip of my cluster and IMAGEURL contains the URL of the Spark docker image I am using (https://hub.docker.com/r/stevenstetzler/spark/ [hub.docker.com]). This docker image was built and pushed with the script included in the Spark 2.4 distribution. I have created a service account for Spark to ensure that it has proper permissions to create pods etc., which I checked using kubectl auth can-i create pods --as=system:serviceaccount:default:spark When I try to run the SparkPi example using the above command, I get the following output: 2018-12-12 06:26:15 WARN Utils:66 - Your hostname, docker-test resolves to a loopback address: 127.0.1.1; using 10.46.0.6 instead (on interface eth0) 2018-12-12 06:26:15 WARN Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address 2018-12-12 06:26:19 INFO LoggingPodStatusWatcherImpl:54 - State changed, new state: pod name: spark-pi-1544595975520-driver namespace: default labels: spark-app-selector -> spark-ec5eb54644d348e7a213f8178b8ef61f, spark-role -> driver pod uid: d5d6bdc7-fdd6-11e8-b666-8e815d3815b2 creation time: 2018-12-12T06:26:18Z service account name: spark volumes: spark-local-dir-1, spark-conf-volume, spark-token-qf9dt node name: N/A start time: N/A container images: N/A phase: Pending status: [] 2018-12-12 06:26:19 INFO LoggingPodStatusWatcherImpl:54 - State changed, new state: pod name: spark-pi-1544595975520-driver namespace: default labels: spark-app-selector -> spark-ec5eb54644d348e7a213f8178b8ef61f, spark-role -> driver pod uid: d5d6bdc7-fdd6-11e8-b666-8e815d3815b2 creation time: 2018-12-12T06:26:18Z service account name: spark volumes: spark-local-dir-1, spark-conf-volume, spark-token-qf9dt node name: flamboyant-darwin-3rhc start time: N/A container images: N/A phase: Pending status: [] 2018-12-12 06:26:19 INFO LoggingPodStatusWatcherImpl:54 - State changed, new state: pod name: spark-pi-1544595975520-driver namespace: default labels: spark-app-selector -> spark-ec5eb54644d348e7a213f8178b8ef61f, spark-role -> driver pod uid: d5d6bdc7-fdd6-11e8-b666-8e815d3815b2 creation time: 2018-12-12T06:26:18Z service account name: spark volumes: spark-local-dir-1, spark-conf-volume, spark-token-qf9dt node name: flamboyant-darwin-3rhc start time: 2018-12-12T06:26:18Z container images: docker.io/stevenstetzler/spark:v1 [docker.io] phase: Pending status: [ContainerStatus(containerID=null, image=docker.io/stevenstetzler/spark:v1 [docker.io], imageID=, lastState
Re: External shuffle service on K8S
Hi there, Please see https://issues.apache.org/jira/browse/SPARK-25299 for more discussion around this matter. -Matt Cheah From: Li Gao Date: Friday, October 26, 2018 at 9:10 AM To: "vincent.gromakow...@gmail.com" Cc: "caolijun1...@gmail.com" , "user@spark.apache.org" Subject: Re: External shuffle service on K8S There are existing 2.2 based ext shuffle on the fork: https://apache-spark-on-k8s.github.io/userdocs/running-on-kubernetes.html [apache-spark-on-k8s.github.io] You can modify it to suit your needs. -Li On Fri, Oct 26, 2018 at 3:22 AM vincent gromakowski wrote: No it's on the roadmap >2.4 Le ven. 26 oct. 2018 à 11:15, 曹礼俊 a écrit : Hi all: Does Spark 2.3.2 supports external shuffle service on Kubernetes? I have looked up the documentation(https://spark.apache.org/docs/latest/running-on-kubernetes.html [spark.apache.org]), but couldn't find related suggestions. If suppports, how can I enable it? Best Regards Lijun Cao smime.p7s Description: S/MIME cryptographic signature
Re: [Spark for kubernetes] Azure Blob Storage credentials issue
Hi there, Can you check if HADOOP_CONF_DIR is being set on the executors to /opt/spark/conf? One should set an executor environment variable for that. A kubectl describe pod output for the executors would be helpful here. -Matt Cheah From: Oscar Bonilla Date: Friday, October 19, 2018 at 1:03 AM To: "user@spark.apache.org" Subject: [Spark for kubernetes] Azure Blob Storage credentials issue Hello, I'm having the following issue while trying to run Spark for kubernetes [spark.apache.org]: 2018-10-18 08:48:54 INFO DAGScheduler:54 - Job 0 failed: reduce at SparkPi.scala:38, took 1.743177 s Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, 10.244.1.11, executor 2): org.apache.hadoop.fs.azure.AzureException: org.apache.hadoop.fs.azure.AzureException: No credentials found for account datasets83d858296fd0c49b.blob.core.windows.net [datasets83d858296fd0c49b.blob.core.windows.net] in the configuration, and its container datasets is not accessible using anonymous credentials. Please check if the container exists first. If it is not publicly available, you have to provide account credentials. at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:1086) at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:538) at org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1366) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3242) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:121) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3291) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3259) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:470) at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1897) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:694) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:476) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:755) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:747) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at scala.collection.mutable.HashMap.foreach(HashMap.scala:99) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) at org.apache.spark.executor.Executor.org [org.apache.spark.executor.executor.org]$apache$spark$executor$Executor$$updateDependencies(Executor.scala:747) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:312) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hadoop.fs.azure.AzureException: No credentials found for account datasets83d858296fd0c49b.blob.core.windows.net [datasets83d858296fd0c49b.blob.core.windows.net] in the configuration, and its container datasets is not accessible using anonymous credentials. Please check if the container exists first. If it is not publicly available, you have to provide account credentials. at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.connectUsingAnonymousCredentials(AzureNativeFileSystemStore.java:863) at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:1081) ... 24 more The command I use to launch the job is: /opt/spark/bin/spark-submit --master k8s:// --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=5 --conf spark.kubernetes.container.image= --conf spark.kubernetes.namespace= --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.driver.secrets.spark=/opt/spark/conf --conf spark.kubernetes.executor.secrets.spark=/opt/spark/conf wasb://@.blob.core.windows.net/spark-examples_2.11-2.3.2.jar [blob.core.windows.net] 1 I have a k8s secret named spark with the following content: apiVersion: v1 kind: Secret metadata: name: spark labels: app: spark stack: service type: Opaque data: core-site.xml: |-
Re: Spark on Kubernetes: Kubernetes killing executors because of overallocation of memory
Hi there, You may want to look at setting the memory overhead settings higher. Spark will then start containers with a higher memory limit (spark.executor.memory + spark.executor.memoryOverhead, to be exact) while the heap is still locked to spark.executor.memory. There’s some memory used by offheap storage from Spark that won’t be accounted for in just the heap size. Hope this helps, -Matt Cheah From: Jayesh Lalwani Date: Thursday, August 2, 2018 at 12:35 PM To: "user@spark.apache.org" Subject: Spark on Kubernetes: Kubernetes killing executors because of overallocation of memory We are running Spark 2.3 on a Kubernetes cluster. We have set the following spark configuration options "spark.executor.memory": "7g", "spark.driver.memory": "2g", "spark.memory.fraction": "0.75" WHat we see is a) In the SPark UI, 5G has been allocated to each executor, which makes sense because we set spark.memory.fraction=0.75 b) Kubernetes reports the pod memory usage as 7.6G WHen we run a lot of jobs on the Kubernetes cluster, Kubernetes starts killing the executor pods, because it thinks that the pod is misbehaving. We logged into a running pod, and ran the top command, and most of the 7.6G is being allocated to the executor's java process Why is Spark taking 7.6G instead of 7 G? Where is the 600MB being allocated to? Is there some configuration that controls how much of the executor memory gets allocated to Permgen vs the memory that gets allocated to the heap? The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer. smime.p7s Description: S/MIME cryptographic signature
Re: Structured Streaming on Kubernetes
We don’t provide any Kubernetes-specific mechanisms for streaming, such as checkpointing to persistent volumes. But as long as streaming doesn’t require persisting to the executor’s local disk, streaming ought to work out of the box. E.g. you can checkpoint to HDFS, but not to the pod’s local directories. However, I’m unaware of any specific use of streaming with the Spark on Kubernetes integration right now. Would be curious to get feedback on the failover behavior right now. -Matt Cheah From: Tathagata Das <t...@databricks.com> Date: Friday, April 13, 2018 at 1:27 AM To: Krishna Kalyan <krishnakaly...@gmail.com> Cc: user <user@spark.apache.org> Subject: Re: Structured Streaming on Kubernetes Structured streaming is stable in production! At Databricks, we and our customers collectively process almost 100s of billions of records per day using SS. However, we are not using kubernetes :) Though I don't think it will matter too much as long as kubes are correctly provisioned+configured and you are checkpointing to HDFS (for fault-tolerance guarantees). TD On Fri, Apr 13, 2018, 12:28 AM Krishna Kalyan <krishnakaly...@gmail.com> wrote: Hello All, We were evaluating Spark Structured Streaming on Kubernetes (Running on GCP). It would be awesome if the spark community could share their experience around this. I would like to know more about you production experience and the monitoring tools you are using. Since spark on kubernetes is a relatively new addition to spark, I was wondering if structured streaming is stable in production. We were also evaluating Apache Beam with Flink. Regards, Krishna smime.p7s Description: S/MIME cryptographic signature
Re: Spark on K8s resource staging server timeout
Hello Jenna, Are there any logs from the resource staging server pod? They might show something interesting. Unfortunately, we haven’t been maintaining the resource staging server because we’ve moved all of our effort to the main repository instead of the fork. When we consider the submission of local files in the official release we should probably create a mechanism that’s more resilient. Using a single HTTP server isn’t ideal – would ideally like something that’s highly available, replicated, etc. -Matt Cheah From: Jenna Hoole <jenna.ho...@gmail.com> Date: Thursday, March 29, 2018 at 10:37 AM To: "user@spark.apache.org" <user@spark.apache.org> Subject: Re: Spark on K8s resource staging server timeout I added overkill high timeouts to the OkHttpClient.Builder() in RetrofitClientFactory.scala and I don't seem to be timing out anymore. val okHttpClientBuilder = new OkHttpClient.Builder() .dispatcher(dispatcher) .proxy(resolvedProxy) .connectTimeout(120, TimeUnit.SECONDS) .writeTimeout(120, TimeUnit.SECONDS) .readTimeout(120, TimeUnit.SECONDS) -Jenna On Tue, Mar 27, 2018 at 10:48 AM, Jenna Hoole <jenna.ho...@gmail.com> wrote: So I'm running into an issue with my resource staging server that's producing a stacktrace like Issue 342 [github.com], but I don't think for the same reasons. What's happening is that every time after I start up a resource staging server, the first job submitted that uses it will fail with a java.net [java.net].SocketTimeoutException: timeout, and then every subsequent job will run perfectly. Including with different jars and different users. It's only ever the first job that fails and it always fails. I know I'm also running into Issue 577 [github.com] in that it takes about three minutes before the resource staging server is accessible, but I'm still failing waiting over ten minutes or in one case overnight. And I'm just using the examples jar, so it's not a super large jar like in Issue 342. This isn't great for our CI process, so has anyone seen anything like this before or know how to increase the timeout if it just takes a while on initial contact? Using spark.network.timeout has no effect. [jhoole@nid6 spark]$ kubectl get pods | grep jhoole-spark jhoole-spark-resource-staging-server-6475c8-w5cdm 1/1 Running 13m [jhoole@nid6 spark]$ kubectl get svc | grep jhoole-spark jhoole-spark-resource-staging-service NodePort10.96.143.55 1:30622/TCP 13m [jhoole@nid6 spark]$ bin/spark-submit --class org.apache.spark.examples.SparkPi --conf spark.app.name [spark.app.name]=spark-pi --conf spark.kubernetes.resourceStagingServer.uri=http://192.168.0.1:30622 [192.168.0.1] ./examples/target/scala-2.11/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar 2018-03-27 12:30:13 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2018-03-27 12:30:13 INFO UserGroupInformation:966 - Login successful for user jhoole@local using keytab file /security/secrets/jhoole.keytab 2018-03-27 12:30:14 INFO HadoopStepsOrchestrator:54 - Hadoop Conf directory: /etc/hadoop/conf 2018-03-27 12:30:14 INFO SecurityManager:54 - Changing view acls to: jhoole 2018-03-27 12:30:14 INFO SecurityManager:54 - Changing modify acls to: jhoole 2018-03-27 12:30:14 INFO SecurityManager:54 - Changing view acls groups to: 2018-03-27 12:30:14 INFO SecurityManager:54 - Changing modify acls groups to: 2018-03-27 12:30:14 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(jhoole); groups with view permissions: Set(); users with modify permissions: Set(jhoole); groups with modify permissions: Set() Exception in thread "main" java.net [java.net].SocketTimeoutException: timeout at okio.Okio$4.newTimeoutException(Okio.java:230) at okio.AsyncTimeout.exit(AsyncTimeout.java:285) at okio.AsyncTimeout$2.read(AsyncTimeout.java:241) at okio.RealBufferedSource.indexOf(RealBufferedSource.java:345) at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:217) at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:211) at okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189) at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:75) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) at okhttp3.inte
Re: UnresolvedAddressException in Kubernetes Cluster
Hi there, This closely resembles https://github.com/apache-spark-on-k8s/spark/issues/523, and we’re having some discussion there to find some possible root causes. However, what release of the fork are you working off of? Are you using the HEAD of branch-2.2-kubernetes, or something else? -Matt Cheah From: Suman Somasundar <suman.somasun...@oracle.com> Sent: Monday, October 9, 2017 3:42:37 PM To: user@spark.apache.org Subject: UnresolvedAddressException in Kubernetes Cluster Hi, I am trying to deploy a Spark app in a Kubernetes Cluster. The cluster consists of 2 machines - 1 master and 1 slave, each of them with the following config: RHEL 7.2 Docker 17.03.1 K8S 1.7. I am following the steps provided in https://apache-spark-on-k8s.github.io/userdocs/running-on-kubernetes.html[apache-spark-on-k8s.github.io]<https://urldefense.proofpoint.com/v2/url?u=https-3A__apache-2Dspark-2Don-2Dk8s.github.io_userdocs_running-2Don-2Dkubernetes.html=DwMFAg=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs=zlu5rpKoNz_UVPQO7fiOZsC4zDcywhTcymQFzm4fwtI=oc2RZklavV0xU4gLilQBN3DocwjpwddYI3TNgL7CBhk=> When I submit an application (SparkPi), a driver pod is created on the slave machine of the cluster. But it exits with an exception: 2017-10-09 22:13:24 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set() 2017-10-09 22:13:30 ERROR SparkContext:91 - Error initializing SparkContext. java.nio.channels.UnresolvedAddressException at sun.nio.ch.Net.checkAddress(Net.java:101) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218) at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:127) at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:501) at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1218) at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:496) at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:481) at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:965) at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:210) at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:353) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) at java.lang.Thread.run(Thread.java:748) 2017-10-09 22:13:30 INFO SparkContext:54 - Successfully stopped SparkContext Has anyone come across this problem or know why this might be happening? Thanks, Suman.
Re: spark 2.0 issue with yarn?
@Marcelo: Interesting - why would this manifest on the YARN-client side though (as Spark is the client to YARN in this case)? Spark as a client shouldn’t care about what auxiliary services are on the YARN cluster. @Jesse: The change I wrote excludes all artifacts from the com.sun.jersey group. So to allow usage of the timeline service, we specifically need to find the thing that we don’t want to exclude and exclude everything but that from the YARN dependencies. I think the class that’s missing in this particular stack trace comes from com.sun.jersey:jersey-client. Refactoring the exclusions in pom.xml under hadoop-yarn-api to not exclude jersey-client should fix at least this. Again – we want to exclude all the other Jersey things that YARN is pulling in though, unless further errors indicate otherwise. In general I’m wary of having both Jersey 1 and Jersey 2 jars on the class path at all. However jersey-client looks relatively harmless since it does not bundle in JAX-RS classes, nor does it appear to have anything weird in its META-INF folder. -Matt Cheah On 5/9/16, 3:10 PM, "Marcelo Vanzin" <van...@cloudera.com> wrote: >Hi Jesse, > >On Mon, May 9, 2016 at 2:52 PM, Jesse F Chen <jfc...@us.ibm.com> wrote: >> Sean - thanks. definitely related to SPARK-12154. >> Is there a way to continue use Jersey 1 for existing working >>environment? > >The error you're getting is because of a third-party extension that >tries to talk to the YARN ATS; that's not part of upstream Spark, >although I believe it's part of HDP. So you may have to talk to >Hortonworks about that, or disable that extension in Spark 2.0 for the >moment. > > >-- >Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Specify number of partitions with which to run DataFrame.join?
Hi everyone, I¹m looking into switching raw RDD operations to DataFrames operations. When I used JavaPairRDD.join(), I had the option to specify the number of partitions with which to do the join. However, I don¹t see an equivalent option in DataFrame.join(). Is there a way to specify the partitioning for a DataFrame join operation as it is being computed? Or do I have to compute the join and repartition separately after? Thanks, -Matt Cheah smime.p7s Description: S/MIME cryptographic signature
Cross-compatibility of YARN shuffle service
Hi everyone, I am considering moving from Spark-Standalone to YARN. The context is that there are multiple Spark applications that are using different versions of Spark that all want to use the same YARN cluster. My question is: if I use a single Spark YARN shuffle service jar on the Node Manager, will the service work properly with all of the Spark applications, regardless of the specific versions of the applications? Or, is it it the case that, if I want to use the external shuffle service, I need to have all of my applications using the same version of Spark? Thanks, -Matt Cheah smime.p7s Description: S/MIME cryptographic signature
Reading from Kerberos Secured HDFS in Spark?
Hi everyone, I¹ve been trying to set up Spark so that it can read data from HDFS, when the HDFS cluster is integrated with Kerberos authentication. I¹ve been using the Spark shell to attempt to read from HDFS, in local mode. I¹ve set all of the appropriate properties in core-site.xml and hdfs-site.xml as I can access and write data using the Hadoop command line utilities. I¹ve also set HADOOP_CONF_DIR to point to the directory where core-site.xml and hdfs-site.xml live. I used UserGroupInformation.setConfiguration(conf) and UserGroupInformation.loginUserFromKeytab() to set up the token, and then SparkContext.newAPIHadoopFile( conf) (instead of SparkContext.textFile() which I would think not pass the appropriate configurations with the Kerberos credentials). When I do that, I get the stack trace (sorry about the color): java.io.IOException: Can't get Master Kerberos principal for use as renewer at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInte rnal(TokenCache.java:116) at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInte rnal(TokenCache.java:100) at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(Tok enCache.java:80) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFo rmat.java:242) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFor mat.java:385) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:94) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) I was wondering if anyone has had any experience setting up Spark to read from Kerberized HDFS. What configurations needed to be set in spark-env.sh? What am I missing? Also, will I have an issue if I try to access HDFS in distributed mode, using a standalone setup? Thanks, -Matt Cheah smime.p7s Description: S/MIME cryptographic signature