Re: k8s orchestrating Spark service

2019-07-01 Thread Matt Cheah
> We’d like to deploy Spark Workers/Executors and Master (whatever master is 
> easiest to talk about since we really don’t care) in pods as we do with the 
> other services we use. Replace Spark Master with k8s if you insist. How do 
> the executors get deployed?

 

When running Spark against Kubernetes natively, the Spark library handles 
requesting executors from the API server. So presumably one would only need to 
know how to start the driver in the cluster – maybe spark-operator, 
spark-submit, or just starting the pod and making a Spark context in client 
mode with the right parameters. From there, the Spark scheduler code knows how 
to interface with the API server and request executor pods according to the 
resource requests configured in the app.

 

> We have a machine Learning Server. It submits various jobs through the Spark 
> Scala API. The Server is run in a pod deployed from a chart by k8s. It later 
> uses the Spark API to submit jobs. I guess we find spark-submit to be a 
> roadblock to our use of Spark and the k8s support is fine but how do you run 
> our Driver and Executors considering that the Driver is part of the Server 
> process?

 

It depends on how the server runs the jobs:
If each job is meant to be a separate forked driver pod / process: The ML 
server code can use the SparkLauncher API and configure the Spark driver 
through that API. Set the master to point to the Kubernetes API server and set 
the parameters for credentials according to your setup. SparkLauncher is a thin 
layer on top of spark-submit; a Spark distribution has to be packaged with the 
ML server image and SparkLauncher would point to the spark-submit script in 
said distribution.
If all jobs run inside the same driver, that being the ML server: One has to 
start the ML server with the right parameters to point to the Kubernetes 
master. Since the ML server is a driver, one has the option to use spark-submit 
or SparkLauncher to deploy the ML server itself. Alternatively one can use a 
custom script to start the ML server, then the ML server process has to create 
a SparkContext object parameterized against the Kubernetes server in question.
 

I hope this helps!

 

-Matt Cheah

From: Pat Ferrel 
Date: Monday, July 1, 2019 at 5:05 PM
To: "user@spark.apache.org" , Matt Cheah 

Subject: Re: k8s orchestrating Spark service

 

We have a machine Learning Server. It submits various jobs through the Spark 
Scala API. The Server is run in a pod deployed from a chart by k8s. It later 
uses the Spark API to submit jobs. I guess we find spark-submit to be a 
roadblock to our use of Spark and the k8s support is fine but how do you run 
our Driver and Executors considering that the Driver is part of the Server 
process?

 

Maybe we are talking past each other with some mistaken assumptions (on my part 
perhaps).

 

 

 

From: Pat Ferrel 

Reply: Pat Ferrel 
Date: July 1, 2019 at 4:57:20 PM
To: user@spark.apache.org , Matt Cheah 

Subject:  Re: k8s orchestrating Spark service 



k8s as master would be nice but doesn’t solve the problem of running the full 
cluster and is an orthogonal issue.

 

We’d like to deploy Spark Workers/Executors and Master (whatever master is 
easiest to talk about since we really don’t care) in pods as we do with the 
other services we use. Replace Spark Master with k8s if you insist. How do the 
executors get deployed?

 

We have our own containers that almost work for 2.3.3. We have used this before 
with older Spark so we are reasonably sure it makes sense. We just wonder if 
our own image builds and charts are the best starting point.

 

Does anyone have something they like? 

 


From: Matt Cheah 
Reply: Matt Cheah 
Date: July 1, 2019 at 4:45:55 PM
To: Pat Ferrel , user@spark.apache.org 

Subject:  Re: k8s orchestrating Spark service 



Sorry, I don’t quite follow – why use the Spark standalone cluster as an 
in-between layer when one can just deploy the Spark application directly inside 
the Helm chart? I’m curious as to what the use case is, since I’m wondering if 
there’s something we can improve with respect to the native integration with 
Kubernetes here. Deploying on Spark standalone mode in Kubernetes is, to my 
understanding, meant to be superseded by the native integration introduced in 
Spark 2.4.

 

From: Pat Ferrel 
Date: Monday, July 1, 2019 at 4:40 PM
To: "user@spark.apache.org" , Matt Cheah 

Subject: Re: k8s orchestrating Spark service

 

Thanks Matt,

 

Actually I can’t use spark-submit. We submit the Driver programmatically 
through the API. But this is not the issue and using k8s as the master is also 
not the issue though you may be right about it being easier, it doesn’t quite 
get to the heart.

 

We want to orchestrate a bunch of services including Spark. The rest work, we 
are asking if anyone has seen a good starting point for adding Spark as a k8s 
managed service.

 


From: Matt Cheah 
Reply: Ma

Re: k8s orchestrating Spark service

2019-07-01 Thread Matt Cheah
Sorry, I don’t quite follow – why use the Spark standalone cluster as an 
in-between layer when one can just deploy the Spark application directly inside 
the Helm chart? I’m curious as to what the use case is, since I’m wondering if 
there’s something we can improve with respect to the native integration with 
Kubernetes here. Deploying on Spark standalone mode in Kubernetes is, to my 
understanding, meant to be superseded by the native integration introduced in 
Spark 2.4.

 

From: Pat Ferrel 
Date: Monday, July 1, 2019 at 4:40 PM
To: "user@spark.apache.org" , Matt Cheah 

Subject: Re: k8s orchestrating Spark service

 

Thanks Matt,

 

Actually I can’t use spark-submit. We submit the Driver programmatically 
through the API. But this is not the issue and using k8s as the master is also 
not the issue though you may be right about it being easier, it doesn’t quite 
get to the heart.

 

We want to orchestrate a bunch of services including Spark. The rest work, we 
are asking if anyone has seen a good starting point for adding Spark as a k8s 
managed service.

 


From: Matt Cheah 
Reply: Matt Cheah 
Date: July 1, 2019 at 3:26:20 PM
To: Pat Ferrel , user@spark.apache.org 

Subject:  Re: k8s orchestrating Spark service 



I would recommend looking into Spark’s native support for running on 
Kubernetes. One can just start the application against Kubernetes directly 
using spark-submit in cluster mode or starting the Spark context with the right 
parameters in client mode. See 
https://spark.apache.org/docs/latest/running-on-kubernetes.html 
[spark.apache.org]

 

I would think that building Helm around this architecture of running Spark 
applications would be easier than running a Spark standalone cluster. But 
admittedly I’m not very familiar with the Helm technology – we just use 
spark-submit.

 

-Matt Cheah

From: Pat Ferrel 
Date: Sunday, June 30, 2019 at 12:55 PM
To: "user@spark.apache.org" 
Subject: k8s orchestrating Spark service

 

We're trying to setup a system that includes Spark. The rest of the services 
have good Docker containers and Helm charts to start from.

 

Spark on the other hand is proving difficult. We forked a container and have 
tried to create our own chart but are having several problems with this.

 

So back to the community… Can anyone recommend a Docker Container + Helm Chart 
for use with Kubernetes to orchestrate:
Spark standalone Master
several Spark Workers/Executors
This not a request to use k8s to orchestrate Spark Jobs, but the service 
cluster itself.

 

Thanks

 



smime.p7s
Description: S/MIME cryptographic signature


Re: k8s orchestrating Spark service

2019-07-01 Thread Matt Cheah
I would recommend looking into Spark’s native support for running on 
Kubernetes. One can just start the application against Kubernetes directly 
using spark-submit in cluster mode or starting the Spark context with the right 
parameters in client mode. See 
https://spark.apache.org/docs/latest/running-on-kubernetes.html

 

I would think that building Helm around this architecture of running Spark 
applications would be easier than running a Spark standalone cluster. But 
admittedly I’m not very familiar with the Helm technology – we just use 
spark-submit.

 

-Matt Cheah

From: Pat Ferrel 
Date: Sunday, June 30, 2019 at 12:55 PM
To: "user@spark.apache.org" 
Subject: k8s orchestrating Spark service

 

We're trying to setup a system that includes Spark. The rest of the services 
have good Docker containers and Helm charts to start from.

 

Spark on the other hand is proving difficult. We forked a container and have 
tried to create our own chart but are having several problems with this.

 

So back to the community… Can anyone recommend a Docker Container + Helm Chart 
for use with Kubernetes to orchestrate:
Spark standalone Master
several Spark Workers/Executors
This not a request to use k8s to orchestrate Spark Jobs, but the service 
cluster itself.

 

Thanks

 



smime.p7s
Description: S/MIME cryptographic signature


[PSA] Sharing our Experiences With Kubernetes

2019-05-17 Thread Matt Cheah
Hi everyone,

 

I would like to share the experiences my organization has had with deploying 
Kubernetes and migrating our Spark applications to Kubernetes over from YARN. 
We are publishing a series of blog posts that describe what we have learned and 
what we have built.

 

Our introduction post is here: 
https://medium.com/palantir/introducing-rubix-kubernetes-at-palantir-ab0ce16ea42e

 

We then discuss our custom scheduling work here: 
https://medium.com/palantir/spark-scheduling-in-kubernetes-4976333235f3

 

Stay tuned for more posts in this space! I hope that this is helpful for you.

 

-Matt Cheah



smime.p7s
Description: S/MIME cryptographic signature


Re: Spark on k8s - map persistentStorage for data spilling

2019-03-01 Thread Matt Cheah
Ah I see: We always force the local directory to use emptyDir and it cannot be 
configured to use any other volume type. See here.

 

I am a bit conflicted on this. On one hand, it makes sense to allow for users 
to be able to mount their own volumes to handle spill data. On the other hand, 
I get the impression that emptyDir is the right kind of volume for this in a 
majority of cases – emptyDir is meant to be used for temporary storage and is 
meant to be fast to make workflows like Spark performant. Finally, a 
significant benefit of emptyDir is that Kubernetes will handle the cleanup of 
the directory for you if the pod exits – if you use a persistent volume claim 
you will need to ensure the files are cleaned up in the case that the pod exits 
abruptly.

 

I’d wonder if your organization can consider modifying your Kubernetes setup to 
make your emptyDir volumes larger and faster?

 

-Matt Cheah

 

From: Tomasz Krol 
Date: Friday, March 1, 2019 at 10:53 AM
To: Matt Cheah 
Cc: "user@spark.apache.org" 
Subject: Re: Spark on k8s - map persistentStorage for data spilling

 

Hi Matt, 

 

Thanks for coming back to me. Yeah that doesn't work. Basically in the 
properties I set Volume and mounting point as below;

 

spark.kubernetes.driver.volumes.persistentVolumeClaim.checkvolume.mount.path=/checkpoint

spark.kubernetes.driver.volumes.persistentVolumeClaim.checkvolume.mount.readOnly=false

spark.kubernetes.driver.volumes.persistentVolumeClaim.checkvolume.mount.claimName=sparkstorage
 

 

spark.kubernetes.executor.volumes.persistentVolumeClaim.checkvolume.mount.path=/checkpoint

spark.kubernetes.executor.volumes.persistentVolumeClaim.checkvolume.mount.readOnly=false

spark.kubernetes.executor.volumes.persistentVolumeClaim.checkvolume.mount.claimName=sparkstorage
 

 

That works as expected and PVC is mounted in the driver and executor PODs on 
/checkpoint directory.

 

As you suggested, first thing what I was trying it was set spark.local.dir or 
env SPARK_LOCAL_DIRS to directory /checkpoint. As my expectation was that it 
will be spilling to my PVC. However this is throwing following error:

 

"spark-kube-driver" is invalid: spec.containers[0].volumeMounts[3].mountPath: 
Invalid value: "/checkpoint": must be unique"

 

It seems like it's trying to mount emptyDir with mounting point "/checkpoint", 
but it can't because "/checkpoint" is the directory where the PVC is already 
mounted.

 

At the moment it looks like to me, the emptyDir is always used for spilling 
data. The question is how to mount it on the PVC. Unless I miss something here.

I can't really run any bigger jobs at the moment because of that. Appreciate 
any feedback :)

 

Thanks 

 

Tom

 

On Thu, 28 Feb 2019 at 17:23, Matt Cheah  wrote:

I think we want to change the value of spark.local.dir to point to where your 
PVC is mounted. Can you give that a try and let us know if that moves the 
spills as expected?

 

-Matt Cheah

 

From: Tomasz Krol 
Date: Wednesday, February 27, 2019 at 3:41 AM
To: "user@spark.apache.org" 
Subject: Spark on k8s - map persistentStorage for data spilling

 

Hey Guys,

 

I hope someone will be able to help me, as I've stuck with this for a while:) 
Basically I am running some jobs on kubernetes as per documentation

 

https://spark.apache.org/docs/latest/running-on-kubernetes.html 
[spark.apache.org]

 

All works fine, however if I run queries on bigger data volume, then jobs 
failing that there is not enough space in /var/data/spark-1xxx directory.

 

Obviously the reason for this is that emptyDir mounted doesnt have enough space.

 

I also mounted pvc to the driver and executors pods which I can see during the 
runtime. I am wondering if someone knows how to set that data will be spilled 
to different directory (i.e my persistent storage directory) instead of empyDir 
with some limitted space. Or if I can mount the empyDir somehow on my pvc. 
Basically at the moment I cant run any jobs as they are failing due to 
insufficient space in that /var/data directory.

 

Thanks

-- 

Tomasz Krol
patric...@gmail.com


 

-- 

Tomasz Krol
patric...@gmail.com



smime.p7s
Description: S/MIME cryptographic signature


Re: Spark on k8s - map persistentStorage for data spilling

2019-02-28 Thread Matt Cheah
I think we want to change the value of spark.local.dir to point to where your 
PVC is mounted. Can you give that a try and let us know if that moves the 
spills as expected?

 

-Matt Cheah

 

From: Tomasz Krol 
Date: Wednesday, February 27, 2019 at 3:41 AM
To: "user@spark.apache.org" 
Subject: Spark on k8s - map persistentStorage for data spilling

 

Hey Guys,

 

I hope someone will be able to help me, as I've stuck with this for a while:) 
Basically I am running some jobs on kubernetes as per documentation

 

https://spark.apache.org/docs/latest/running-on-kubernetes.html 
[spark.apache.org]

 

All works fine, however if I run queries on bigger data volume, then jobs 
failing that there is not enough space in /var/data/spark-1xxx directory.

 

Obviously the reason for this is that emptyDir mounted doesnt have enough space.

 

I also mounted pvc to the driver and executors pods which I can see during the 
runtime. I am wondering if someone knows how to set that data will be spilled 
to different directory (i.e my persistent storage directory) instead of empyDir 
with some limitted space. Or if I can mount the empyDir somehow on my pvc. 
Basically at the moment I cant run any jobs as they are failing due to 
insufficient space in that /var/data directory.

 

Thanks

-- 

Tomasz Krol
patric...@gmail.com



smime.p7s
Description: S/MIME cryptographic signature


Re: Problem running Spark on Kubernetes: Certificate error

2018-12-13 Thread Matt Cheah
Hi Steven,

 

What I think is happening is that your machine has a CA certificate that is 
used for communicating with your API server, particularly because you’re using 
Digital Ocean’s cluster manager. However, it’s unclear if your pod has the same 
CA certificate or if the pod needs that certificate file. You can use 
configurations to have your pod use a particular CA certificate file to 
communicate with the APi server. If you set 
spark.kubernetes.authenticate.driver.caCertFile to the path of your CA 
certificate on your local disk, spark-submit will create a secret that contains 
that certificate file and use that certificate to configure TLS for the driver 
pod’s communication with the API server.

 

It's clear that your driver pod doesn’t have the right TLS certificate to 
communicate with the API server, so I would try to introspect the driver pod 
and see what certificate it’s using for that communication. If there’s a fix 
that needs to happen in Spark, feel free to indicate as such.

 

-Matt Cheah

 

From: Steven Stetzler 
Date: Thursday, December 13, 2018 at 1:49 PM
To: "user@spark.apache.org" 
Subject: Problem running Spark on Kubernetes: Certificate error

 

Hello, 

I am following the tutorial here 
(https://spark.apache.org/docs/latest/running-on-kubernetes.html 
[spark.apache.org]) to get spark running on a Kubernetes cluster. My Kubernetes 
cluster is hosted with Digital Ocean's kubernetes cluster manager. I have 
change the KUBECONFIG environment variable to point to my cluster access 
credentials, so both Spark and kubectl can speak with the nodes. 

I am running into an issue when trying to run the SparkPi example as described 
in the Spark on Kubernetes tutorials. The command I am running is: 

./bin/spark-submit --master k8s://$CLUSTERIP --deploy-mode cluster --name 
spark-pi --class org.apache.spark.examples.SparkPi --conf 
spark.executor.instances=1 --conf spark.kubernetes.container.image=$IMAGEURL 
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark 
local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar 

where CLUSTERIP contains the ip of my cluster and IMAGEURL contains the URL of 
the Spark docker image I am using 
(https://hub.docker.com/r/stevenstetzler/spark/ [hub.docker.com]). This docker 
image was built and pushed with the script included in the Spark 2.4 
distribution. I have created a service account for Spark to ensure that it has 
proper permissions to create pods etc., which I checked using 

kubectl auth can-i create pods --as=system:serviceaccount:default:spark 

When I try to run the SparkPi example using the above command, I get the 
following output: 

2018-12-12 06:26:15 WARN  Utils:66 - Your hostname, docker-test resolves to a 
loopback address: 127.0.1.1; using 10.46.0.6 instead (on interface eth0) 
2018-12-12 06:26:15 WARN  Utils:66 - Set SPARK_LOCAL_IP if you need to bind to 
another address 
2018-12-12 06:26:19 INFO  LoggingPodStatusWatcherImpl:54 - State changed, new 
state: 
 pod name: spark-pi-1544595975520-driver 
 namespace: default 
 labels: spark-app-selector -> spark-ec5eb54644d348e7a213f8178b8ef61f, 
spark-role -> driver 
 pod uid: d5d6bdc7-fdd6-11e8-b666-8e815d3815b2 
 creation time: 2018-12-12T06:26:18Z 
 service account name: spark 
 volumes: spark-local-dir-1, spark-conf-volume, spark-token-qf9dt 
 node name: N/A 
 start time: N/A 
 container images: N/A 
 phase: Pending 
 status: [] 
2018-12-12 06:26:19 INFO  LoggingPodStatusWatcherImpl:54 - State changed, new 
state: 
 pod name: spark-pi-1544595975520-driver 
 namespace: default 
 labels: spark-app-selector -> spark-ec5eb54644d348e7a213f8178b8ef61f, 
spark-role -> driver 
 pod uid: d5d6bdc7-fdd6-11e8-b666-8e815d3815b2 
 creation time: 2018-12-12T06:26:18Z 
 service account name: spark 
 volumes: spark-local-dir-1, spark-conf-volume, spark-token-qf9dt 
 node name: flamboyant-darwin-3rhc 
 start time: N/A 
 container images: N/A 
 phase: Pending 
 status: [] 
2018-12-12 06:26:19 INFO  LoggingPodStatusWatcherImpl:54 - State changed, new 
state: 
 pod name: spark-pi-1544595975520-driver 
 namespace: default 
 labels: spark-app-selector -> spark-ec5eb54644d348e7a213f8178b8ef61f, 
spark-role -> driver 
 pod uid: d5d6bdc7-fdd6-11e8-b666-8e815d3815b2 
 creation time: 2018-12-12T06:26:18Z 
 service account name: spark 
 volumes: spark-local-dir-1, spark-conf-volume, spark-token-qf9dt 
 node name: flamboyant-darwin-3rhc 
 start time: 2018-12-12T06:26:18Z 
 container images: docker.io/stevenstetzler/spark:v1 [docker.io] 
 phase: Pending 
 status: [ContainerStatus(containerID=null, 
image=docker.io/stevenstetzler/spark:v1 [docker.io], imageID=, 
lastState

Re: External shuffle service on K8S

2018-10-26 Thread Matt Cheah
Hi there,

 

Please see https://issues.apache.org/jira/browse/SPARK-25299 for more 
discussion around this matter.

 

-Matt Cheah

 

From: Li Gao 
Date: Friday, October 26, 2018 at 9:10 AM
To: "vincent.gromakow...@gmail.com" 
Cc: "caolijun1...@gmail.com" , "user@spark.apache.org" 

Subject: Re: External shuffle service on K8S

 

There are existing 2.2 based ext shuffle on the fork: 

https://apache-spark-on-k8s.github.io/userdocs/running-on-kubernetes.html 
[apache-spark-on-k8s.github.io]

 

You can modify it to suit your needs.

 

-Li

 

 

On Fri, Oct 26, 2018 at 3:22 AM vincent gromakowski 
 wrote:

No it's on the roadmap >2.4

 

Le ven. 26 oct. 2018 à 11:15, 曹礼俊  a écrit :

Hi all: 

 

Does Spark 2.3.2 supports external shuffle service on Kubernetes? 

 

I have looked up the 
documentation(https://spark.apache.org/docs/latest/running-on-kubernetes.html 
[spark.apache.org]), but couldn't find related suggestions.

 

If suppports, how can I enable it?

 

Best Regards

 

Lijun Cao

 

 



smime.p7s
Description: S/MIME cryptographic signature


Re: [Spark for kubernetes] Azure Blob Storage credentials issue

2018-10-24 Thread Matt Cheah
Hi there,

 

Can you check if HADOOP_CONF_DIR is being set on the executors to 
/opt/spark/conf? One should set an executor environment variable for that.

 

A kubectl describe pod output for the executors would be helpful here.

 

-Matt Cheah

 

From: Oscar Bonilla 
Date: Friday, October 19, 2018 at 1:03 AM
To: "user@spark.apache.org" 
Subject: [Spark for kubernetes] Azure Blob Storage credentials issue

 

Hello,

I'm having the following issue while trying to run Spark for kubernetes 
[spark.apache.org]:
2018-10-18 08:48:54 INFO  DAGScheduler:54 - Job 0 failed: reduce at 
SparkPi.scala:38, took 1.743177 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost 
task 0.3 in stage 0.0 (TID 6, 10.244.1.11, executor 2): 
org.apache.hadoop.fs.azure.AzureException: 
org.apache.hadoop.fs.azure.AzureException: No credentials found for account 
datasets83d858296fd0c49b.blob.core.windows.net 
[datasets83d858296fd0c49b.blob.core.windows.net] in the configuration, and its 
container datasets is not accessible using anonymous credentials. Please check 
if the container exists first. If it is not publicly available, you have to 
provide account credentials.
    at 
org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:1086)
    at 
org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:538)
    at 
org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1366)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3242)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:121)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3291)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3259)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:470)
    at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1897)
    at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:694)
    at org.apache.spark.util.Utils$.fetchFile(Utils.scala:476)
    at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:755)
    at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:747)
    at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
    at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
    at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
    at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
    at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
    at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
    at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
    at org.apache.spark.executor.Executor.org 
[org.apache.spark.executor.executor.org]$apache$spark$executor$Executor$$updateDependencies(Executor.scala:747)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:312)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.fs.azure.AzureException: No credentials found for 
account datasets83d858296fd0c49b.blob.core.windows.net 
[datasets83d858296fd0c49b.blob.core.windows.net] in the configuration, and its 
container datasets is not accessible using anonymous credentials. Please check 
if the container exists first. If it is not publicly available, you have to 
provide account credentials.
    at 
org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.connectUsingAnonymousCredentials(AzureNativeFileSystemStore.java:863)
    at 
org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:1081)
    ... 24 more
The command I use to launch the job is:
/opt/spark/bin/spark-submit
    --master k8s://
    --deploy-mode cluster
    --name spark-pi
    --class org.apache.spark.examples.SparkPi
    --conf spark.executor.instances=5
    --conf spark.kubernetes.container.image=
    --conf spark.kubernetes.namespace=
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
    --conf spark.kubernetes.driver.secrets.spark=/opt/spark/conf
    --conf spark.kubernetes.executor.secrets.spark=/opt/spark/conf
wasb://@.blob.core.windows.net/spark-examples_2.11-2.3.2.jar
 [blob.core.windows.net] 1
I have a k8s secret named spark with the following content:
apiVersion: v1
kind: Secret
metadata:
  name: spark
  labels:
    app: spark
    stack: service
type: Opaque
data:
  core-site.xml: |-
   

Re: Spark on Kubernetes: Kubernetes killing executors because of overallocation of memory

2018-08-02 Thread Matt Cheah
Hi there,

 

You may want to look at setting the memory overhead settings higher. Spark will 
then start containers with a higher memory limit (spark.executor.memory + 
spark.executor.memoryOverhead, to be exact) while the heap is still locked to 
spark.executor.memory. There’s some memory used by offheap storage from Spark 
that won’t be accounted for in just the heap size.

 

Hope this helps,

 

-Matt Cheah

 

From: Jayesh Lalwani 
Date: Thursday, August 2, 2018 at 12:35 PM
To: "user@spark.apache.org" 
Subject: Spark on Kubernetes: Kubernetes killing executors because of 
overallocation of memory

 

We are running Spark 2.3 on a Kubernetes cluster. We have set the following 
spark configuration options

"spark.executor.memory": "7g",

"spark.driver.memory": "2g",

"spark.memory.fraction": "0.75"

 

WHat we see is

a) In the SPark UI, 5G has been allocated to each executor, which makes sense 
because we set spark.memory.fraction=0.75
b) Kubernetes reports the pod memory usage as 7.6G

 

WHen we run a lot of jobs on the Kubernetes cluster, Kubernetes starts killing 
the executor pods, because it thinks that the pod is misbehaving.

 

We logged into a running pod, and ran the top command, and most of the 7.6G is 
being allocated to the executor's java process

 

Why is Spark taking 7.6G instead of 7 G? Where is the 600MB being allocated to? 
Is there some configuration that controls how much of the executor memory gets 
allocated to Permgen vs the memory that gets allocated to the heap?

 

 

 

The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.



smime.p7s
Description: S/MIME cryptographic signature


Re: Structured Streaming on Kubernetes

2018-04-13 Thread Matt Cheah
We don’t provide any Kubernetes-specific mechanisms for streaming, such as 
checkpointing to persistent volumes. But as long as streaming doesn’t require 
persisting to the executor’s local disk, streaming ought to work out of the 
box. E.g. you can checkpoint to HDFS, but not to the pod’s local directories.

 

However, I’m unaware of any specific use of streaming with the Spark on 
Kubernetes integration right now. Would be curious to get feedback on the 
failover behavior right now.

 

-Matt Cheah

 

From: Tathagata Das <t...@databricks.com>
Date: Friday, April 13, 2018 at 1:27 AM
To: Krishna Kalyan <krishnakaly...@gmail.com>
Cc: user <user@spark.apache.org>
Subject: Re: Structured Streaming on Kubernetes

 

Structured streaming is stable in production! At Databricks, we and our 
customers collectively process almost 100s of billions of records per day using 
SS. However, we are not using kubernetes :) 

 

Though I don't think it will matter too much as long as kubes are correctly 
provisioned+configured and you are checkpointing to HDFS (for fault-tolerance 
guarantees).

 

TD

 

On Fri, Apr 13, 2018, 12:28 AM Krishna Kalyan <krishnakaly...@gmail.com> wrote:

Hello All, 

We were evaluating Spark Structured Streaming on Kubernetes (Running on GCP). 
It would be awesome if the spark community could share their experience around 
this. I would like to know more about you production experience and the 
monitoring tools you are using.

 

Since spark on kubernetes is a relatively new addition to spark, I was 
wondering if structured streaming is stable in production. We were also 
evaluating Apache Beam with Flink.

 

Regards,

Krishna

 

 



smime.p7s
Description: S/MIME cryptographic signature


Re: Spark on K8s resource staging server timeout

2018-03-29 Thread Matt Cheah
Hello Jenna,

 

Are there any logs from the resource staging server pod? They might show 
something interesting.

 

Unfortunately, we haven’t been maintaining the resource staging server because 
we’ve moved all of our effort to the main repository instead of the fork. When 
we consider the submission of local files in the official release we should 
probably create a mechanism that’s more resilient. Using a single HTTP server 
isn’t ideal – would ideally like something that’s highly available, replicated, 
etc.

 

-Matt Cheah

 

From: Jenna Hoole <jenna.ho...@gmail.com>
Date: Thursday, March 29, 2018 at 10:37 AM
To: "user@spark.apache.org" <user@spark.apache.org>
Subject: Re: Spark on K8s resource staging server timeout

 

I added overkill high timeouts to the OkHttpClient.Builder() in 
RetrofitClientFactory.scala and I don't seem to be timing out anymore. 

 

val okHttpClientBuilder = new OkHttpClient.Builder()

  .dispatcher(dispatcher)

  .proxy(resolvedProxy)

  .connectTimeout(120, TimeUnit.SECONDS)

  .writeTimeout(120, TimeUnit.SECONDS)

  .readTimeout(120, TimeUnit.SECONDS)

 

-Jenna

 

On Tue, Mar 27, 2018 at 10:48 AM, Jenna Hoole <jenna.ho...@gmail.com> wrote:

So I'm running into an issue with my resource staging server that's producing a 
stacktrace like Issue 342 [github.com], but I don't think for the same reasons. 
What's happening is that every time after I start up a resource staging server, 
the first job submitted that uses it will fail with a java.net 
[java.net].SocketTimeoutException: timeout, and then every subsequent job will 
run perfectly. Including with different jars and different users. It's only 
ever the first job that fails and it always fails. I know I'm also running into 
Issue 577 [github.com] in that it takes about three minutes before the resource 
staging server is accessible, but I'm still failing waiting over ten minutes or 
in one case overnight. And I'm just using the examples jar, so it's not a super 
large jar like in Issue 342. 

 

This isn't great for our CI process, so has anyone seen anything like this 
before or know how to increase the timeout if it just takes a while on initial 
contact? Using spark.network.timeout has no effect.

 

[jhoole@nid6 spark]$ kubectl get pods | grep jhoole-spark

jhoole-spark-resource-staging-server-6475c8-w5cdm   1/1   Running   
  13m

[jhoole@nid6 spark]$ kubectl get svc | grep jhoole-spark

jhoole-spark-resource-staging-service   NodePort10.96.143.55   
1:30622/TCP 13m

[jhoole@nid6 spark]$ bin/spark-submit --class 
org.apache.spark.examples.SparkPi --conf spark.app.name 
[spark.app.name]=spark-pi --conf 
spark.kubernetes.resourceStagingServer.uri=http://192.168.0.1:30622 
[192.168.0.1] 
./examples/target/scala-2.11/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar 

2018-03-27 12:30:13 WARN  NativeCodeLoader:62 - Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable

2018-03-27 12:30:13 INFO  UserGroupInformation:966 - Login successful for user 
jhoole@local using keytab file /security/secrets/jhoole.keytab

2018-03-27 12:30:14 INFO  HadoopStepsOrchestrator:54 - Hadoop Conf directory: 
/etc/hadoop/conf

2018-03-27 12:30:14 INFO  SecurityManager:54 - Changing view acls to: jhoole

2018-03-27 12:30:14 INFO  SecurityManager:54 - Changing modify acls to: jhoole

2018-03-27 12:30:14 INFO  SecurityManager:54 - Changing view acls groups to: 

2018-03-27 12:30:14 INFO  SecurityManager:54 - Changing modify acls groups to: 

2018-03-27 12:30:14 INFO  SecurityManager:54 - SecurityManager: authentication 
disabled; ui acls disabled; users  with view permissions: Set(jhoole); groups 
with view permissions: Set(); users  with modify permissions: Set(jhoole); 
groups with modify permissions: Set()

Exception in thread "main" java.net [java.net].SocketTimeoutException: timeout

at okio.Okio$4.newTimeoutException(Okio.java:230)

at okio.AsyncTimeout.exit(AsyncTimeout.java:285)

at okio.AsyncTimeout$2.read(AsyncTimeout.java:241)

at okio.RealBufferedSource.indexOf(RealBufferedSource.java:345)

at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:217)

at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:211)

at okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189)

at 
okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:75)

at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)

at 
okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)

at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)

at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)

at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)

at 
okhttp3.inte

Re: UnresolvedAddressException in Kubernetes Cluster

2017-10-12 Thread Matt Cheah
Hi there,



This closely resembles https://github.com/apache-spark-on-k8s/spark/issues/523, 
and we’re having some discussion there to find some possible root causes. 
However, what release of the fork are you working off of? Are you using the 
HEAD of branch-2.2-kubernetes, or something else?



-Matt Cheah


From: Suman Somasundar <suman.somasun...@oracle.com>
Sent: Monday, October 9, 2017 3:42:37 PM
To: user@spark.apache.org
Subject: UnresolvedAddressException in Kubernetes Cluster

Hi,

I am trying to deploy a Spark app in a Kubernetes Cluster. The cluster consists 
of 2 machines - 1 master and 1 slave, each of them with the following config:
RHEL 7.2
Docker 17.03.1
K8S 1.7.

I am following the steps provided in 
https://apache-spark-on-k8s.github.io/userdocs/running-on-kubernetes.html[apache-spark-on-k8s.github.io]<https://urldefense.proofpoint.com/v2/url?u=https-3A__apache-2Dspark-2Don-2Dk8s.github.io_userdocs_running-2Don-2Dkubernetes.html=DwMFAg=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs=zlu5rpKoNz_UVPQO7fiOZsC4zDcywhTcymQFzm4fwtI=oc2RZklavV0xU4gLilQBN3DocwjpwddYI3TNgL7CBhk=>

When I submit an application (SparkPi), a driver pod is created on the slave 
machine of the cluster. But it exits with an exception:

2017-10-09 22:13:24 INFO  SecurityManager:54 - SecurityManager: authentication 
disabled; ui acls disabled; users  with view permissions: Set(root); groups 
with view permissions: Set(); users  with modify permissions: Set(root); groups 
with modify permissions: Set()
2017-10-09 22:13:30 ERROR SparkContext:91 - Error initializing SparkContext.
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:101)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218)
at 
io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:127)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:501)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1218)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:496)
at 
io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:481)
at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:965)
at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:210)
at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:353)
at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)
2017-10-09 22:13:30 INFO  SparkContext:54 - Successfully stopped SparkContext

Has anyone come across this problem or know why this might be happening?

Thanks,
Suman.


Re: spark 2.0 issue with yarn?

2016-05-09 Thread Matt Cheah
@Marcelo: Interesting - why would this manifest on the YARN-client side
though (as Spark is the client to YARN in this case)? Spark as a client
shouldn’t care about what auxiliary services are on the YARN cluster.

@Jesse: The change I wrote excludes all artifacts from the com.sun.jersey
group. So to allow usage of the timeline service, we specifically need to
find the thing that we don’t want to exclude and exclude everything but
that from the YARN dependencies.

I think the class that’s missing in this particular stack trace comes from
com.sun.jersey:jersey-client. Refactoring the exclusions in pom.xml under
hadoop-yarn-api to not exclude jersey-client should fix at least this.
Again – we want to exclude all the other Jersey things that YARN is
pulling in though, unless further errors indicate otherwise.

In general I’m wary of having both Jersey 1 and Jersey 2 jars on the class
path at all. However jersey-client looks relatively harmless since it does
not bundle in JAX-RS classes, nor does it appear to have anything weird in
its META-INF folder.

-Matt Cheah



On 5/9/16, 3:10 PM, "Marcelo Vanzin" <van...@cloudera.com> wrote:

>Hi Jesse,
>
>On Mon, May 9, 2016 at 2:52 PM, Jesse F Chen <jfc...@us.ibm.com> wrote:
>> Sean - thanks. definitely related to SPARK-12154.
>> Is there a way to continue use Jersey 1 for existing working
>>environment?
>
>The error you're getting is because of a third-party extension that
>tries to talk to the YARN ATS; that's not part of upstream Spark,
>although I believe it's part of HDP. So you may have to talk to
>Hortonworks about that, or disable that extension in Spark 2.0 for the
>moment.
>
>
>-- 
>Marcelo


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Specify number of partitions with which to run DataFrame.join?

2015-06-18 Thread Matt Cheah
Hi everyone,

I¹m looking into switching raw RDD operations to DataFrames operations. When
I used JavaPairRDD.join(), I had the option to specify the number of
partitions with which to do the join. However, I don¹t see an equivalent
option in DataFrame.join(). Is there a way to specify the partitioning for a
DataFrame join operation as it is being computed? Or do I have to compute
the join and repartition separately after?

Thanks,

-Matt Cheah




smime.p7s
Description: S/MIME cryptographic signature


Cross-compatibility of YARN shuffle service

2015-03-25 Thread Matt Cheah
Hi everyone,

I am considering moving from Spark-Standalone to YARN. The context is that
there are multiple Spark applications that are using different versions of
Spark that all want to use the same YARN cluster.

My question is: if I use a single Spark YARN shuffle service jar on the Node
Manager, will the service work properly with all of the Spark applications,
regardless of the specific versions of the applications? Or, is it it the
case that, if I want to use the external shuffle service, I need to have all
of my applications using the same version of Spark?

Thanks,

-Matt Cheah




smime.p7s
Description: S/MIME cryptographic signature


Reading from Kerberos Secured HDFS in Spark?

2014-12-02 Thread Matt Cheah
Hi everyone,

I¹ve been trying to set up Spark so that it can read data from HDFS, when
the HDFS cluster is integrated with Kerberos authentication.

I¹ve been using the Spark shell to attempt to read from HDFS, in local mode.
I¹ve set all of the appropriate properties in core-site.xml and
hdfs-site.xml as I can access and write data using the Hadoop command line
utilities. I¹ve also set HADOOP_CONF_DIR to point to the directory where
core-site.xml and hdfs-site.xml live.

I used UserGroupInformation.setConfiguration(conf) and
UserGroupInformation.loginUserFromKeytab(Š) to set up the token, and then
SparkContext.newAPIHadoopFile(Š conf) (instead of SparkContext.textFile()
which I would think not pass the appropriate configurations with the
Kerberos credentials). When I do that, I get the stack trace (sorry about
the color):

java.io.IOException: Can't get Master Kerberos principal for use as renewer

at 
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInte
rnal(TokenCache.java:116)

at 
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInte
rnal(TokenCache.java:100)

at 
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(Tok
enCache.java:80)

at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFo
rmat.java:242)

at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFor
mat.java:385)

at 
org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:94)

at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)

at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)


I was wondering if anyone has had any experience setting up Spark to read
from Kerberized HDFS. What configurations needed to be set in spark-env.sh?
What am I missing?

Also, will I have an issue if I try to access HDFS in distributed mode,
using a standalone setup?

Thanks,

-Matt Cheah




smime.p7s
Description: S/MIME cryptographic signature