Re: Spark on k8s - map persistentStorage for data spilling

2019-03-01 Thread Tomasz Krol
Yeah, seems like that option with making emptyDir larger is something that
we need to consider.

Cheers

Tomasz Krol

On Fri, 1 Mar 2019 at 19:30, Matt Cheah  wrote:

> Ah I see: We always force the local directory to use emptyDir and it
> cannot be configured to use any other volume type. See here
> <https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/LocalDirsFeatureStep.scala>
> .
>
>
>
> I am a bit conflicted on this. On one hand, it makes sense to allow for
> users to be able to mount their own volumes to handle spill data. On the
> other hand, I get the impression that emptyDir is the right kind of
> volume for this in a majority of cases – emptyDir is meant to be used for
> temporary storage and is meant to be fast to make workflows like Spark
> performant. Finally, a significant benefit of emptyDir is that Kubernetes
> will handle the cleanup of the directory for you if the pod exits – if you
> use a persistent volume claim you will need to ensure the files are cleaned
> up in the case that the pod exits abruptly.
>
>
>
> I’d wonder if your organization can consider modifying your Kubernetes
> setup to make your emptyDir volumes larger and faster?
>
>
>
> -Matt Cheah
>
>
>
> *From: *Tomasz Krol 
> *Date: *Friday, March 1, 2019 at 10:53 AM
> *To: *Matt Cheah 
> *Cc: *"user@spark.apache.org" 
> *Subject: *Re: Spark on k8s - map persistentStorage for data spilling
>
>
>
> Hi Matt,
>
>
>
> Thanks for coming back to me. Yeah that doesn't work. Basically in the
> properties I set Volume and mounting point as below;
>
>
>
>
> spark.kubernetes.driver.volumes.persistentVolumeClaim.checkvolume.mount.path=/checkpoint
>
>
> spark.kubernetes.driver.volumes.persistentVolumeClaim.checkvolume.mount.readOnly=false
>
>
> spark.kubernetes.driver.volumes.persistentVolumeClaim.checkvolume.mount.claimName=sparkstorage
>
>
>
>
> spark.kubernetes.executor.volumes.persistentVolumeClaim.checkvolume.mount.path=/checkpoint
>
>
> spark.kubernetes.executor.volumes.persistentVolumeClaim.checkvolume.mount.readOnly=false
>
>
> spark.kubernetes.executor.volumes.persistentVolumeClaim.checkvolume.mount.claimName=sparkstorage
>
>
>
> That works as expected and PVC is mounted in the driver and executor PODs
> on /checkpoint directory.
>
>
>
> As you suggested, first thing what I was trying it was set spark.local.dir
> or env SPARK_LOCAL_DIRS to directory /checkpoint. As my expectation was
> that it will be spilling to my PVC. However this is throwing following
> error:
>
>
>
> "spark-kube-driver" is invalid:
> spec.containers[0].volumeMounts[3].mountPath: Invalid value: "/checkpoint":
> must be unique"
>
>
>
> It seems like it's trying to mount emptyDir with mounting point
> "/checkpoint", but it can't because "/checkpoint" is the directory where
> the PVC is already mounted.
>
>
>
> At the moment it looks like to me, the emptyDir is always used for
> spilling data. The question is how to mount it on the PVC. Unless I miss
> something here.
>
> I can't really run any bigger jobs at the moment because of that.
> Appreciate any feedback :)
>
>
>
> Thanks
>
>
>
> Tom
>
>
>
> On Thu, 28 Feb 2019 at 17:23, Matt Cheah  wrote:
>
> I think we want to change the value of spark.local.dir to point to where
> your PVC is mounted. Can you give that a try and let us know if that moves
> the spills as expected?
>
>
>
> -Matt Cheah
>
>
>
> *From: *Tomasz Krol 
> *Date: *Wednesday, February 27, 2019 at 3:41 AM
> *To: *"user@spark.apache.org" 
> *Subject: *Spark on k8s - map persistentStorage for data spilling
>
>
>
> Hey Guys,
>
>
>
> I hope someone will be able to help me, as I've stuck with this for a
> while:) Basically I am running some jobs on kubernetes as per documentation
>
>
>
> https://spark.apache.org/docs/latest/running-on-kubernetes.html
> [spark.apache.org]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_running-2Don-2Dkubernetes.html&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=pl7iQpYOLmjHJrzMaSyfQ56-lmUgnrE-__71VnhN_t0&s=KRpveGOSKlQ8zkPxuwZCAiXRMqVh9nE7B2aU_fN-bFg&e=>
>
>
>
> All works fine, however if I run queries on bigger data volume, then jobs
> failing that there is not enough space in /var/data/spark-1xxx directory.
>
>
>
> Obviously the reason for this is that emptyDir mounted doesnt have enoug

Re: Spark on k8s - map persistentStorage for data spilling

2019-03-01 Thread Matt Cheah
Ah I see: We always force the local directory to use emptyDir and it cannot be 
configured to use any other volume type. See here.

 

I am a bit conflicted on this. On one hand, it makes sense to allow for users 
to be able to mount their own volumes to handle spill data. On the other hand, 
I get the impression that emptyDir is the right kind of volume for this in a 
majority of cases – emptyDir is meant to be used for temporary storage and is 
meant to be fast to make workflows like Spark performant. Finally, a 
significant benefit of emptyDir is that Kubernetes will handle the cleanup of 
the directory for you if the pod exits – if you use a persistent volume claim 
you will need to ensure the files are cleaned up in the case that the pod exits 
abruptly.

 

I’d wonder if your organization can consider modifying your Kubernetes setup to 
make your emptyDir volumes larger and faster?

 

-Matt Cheah

 

From: Tomasz Krol 
Date: Friday, March 1, 2019 at 10:53 AM
To: Matt Cheah 
Cc: "user@spark.apache.org" 
Subject: Re: Spark on k8s - map persistentStorage for data spilling

 

Hi Matt, 

 

Thanks for coming back to me. Yeah that doesn't work. Basically in the 
properties I set Volume and mounting point as below;

 

spark.kubernetes.driver.volumes.persistentVolumeClaim.checkvolume.mount.path=/checkpoint

spark.kubernetes.driver.volumes.persistentVolumeClaim.checkvolume.mount.readOnly=false

spark.kubernetes.driver.volumes.persistentVolumeClaim.checkvolume.mount.claimName=sparkstorage
 

 

spark.kubernetes.executor.volumes.persistentVolumeClaim.checkvolume.mount.path=/checkpoint

spark.kubernetes.executor.volumes.persistentVolumeClaim.checkvolume.mount.readOnly=false

spark.kubernetes.executor.volumes.persistentVolumeClaim.checkvolume.mount.claimName=sparkstorage
 

 

That works as expected and PVC is mounted in the driver and executor PODs on 
/checkpoint directory.

 

As you suggested, first thing what I was trying it was set spark.local.dir or 
env SPARK_LOCAL_DIRS to directory /checkpoint. As my expectation was that it 
will be spilling to my PVC. However this is throwing following error:

 

"spark-kube-driver" is invalid: spec.containers[0].volumeMounts[3].mountPath: 
Invalid value: "/checkpoint": must be unique"

 

It seems like it's trying to mount emptyDir with mounting point "/checkpoint", 
but it can't because "/checkpoint" is the directory where the PVC is already 
mounted.

 

At the moment it looks like to me, the emptyDir is always used for spilling 
data. The question is how to mount it on the PVC. Unless I miss something here.

I can't really run any bigger jobs at the moment because of that. Appreciate 
any feedback :)

 

Thanks 

 

Tom

 

On Thu, 28 Feb 2019 at 17:23, Matt Cheah  wrote:

I think we want to change the value of spark.local.dir to point to where your 
PVC is mounted. Can you give that a try and let us know if that moves the 
spills as expected?

 

-Matt Cheah

 

From: Tomasz Krol 
Date: Wednesday, February 27, 2019 at 3:41 AM
To: "user@spark.apache.org" 
Subject: Spark on k8s - map persistentStorage for data spilling

 

Hey Guys,

 

I hope someone will be able to help me, as I've stuck with this for a while:) 
Basically I am running some jobs on kubernetes as per documentation

 

https://spark.apache.org/docs/latest/running-on-kubernetes.html 
[spark.apache.org]

 

All works fine, however if I run queries on bigger data volume, then jobs 
failing that there is not enough space in /var/data/spark-1xxx directory.

 

Obviously the reason for this is that emptyDir mounted doesnt have enough space.

 

I also mounted pvc to the driver and executors pods which I can see during the 
runtime. I am wondering if someone knows how to set that data will be spilled 
to different directory (i.e my persistent storage directory) instead of empyDir 
with some limitted space. Or if I can mount the empyDir somehow on my pvc. 
Basically at the moment I cant run any jobs as they are failing due to 
insufficient space in that /var/data directory.

 

Thanks

-- 

Tomasz Krol
patric...@gmail.com


 

-- 

Tomasz Krol
patric...@gmail.com



smime.p7s
Description: S/MIME cryptographic signature


Re: Spark on k8s - map persistentStorage for data spilling

2019-03-01 Thread Tomasz Krol
Hi Matt,

Thanks for coming back to me. Yeah that doesn't work. Basically in the
properties I set Volume and mounting point as below;

spark.kubernetes.driver.volumes.persistentVolumeClaim.checkvolume.mount.path=/checkpoint
spark.kubernetes.driver.volumes.persistentVolumeClaim.checkvolume.mount.readOnly=false
spark.kubernetes.driver.volumes.persistentVolumeClaim.checkvolume.mount.claimName=sparkstorage

spark.kubernetes.executor.volumes.persistentVolumeClaim.checkvolume.mount.path=/checkpoint
spark.kubernetes.executor.volumes.persistentVolumeClaim.checkvolume.mount.readOnly=false
spark.kubernetes.executor.volumes.persistentVolumeClaim.checkvolume.mount.claimName=sparkstorage

That works as expected and PVC is mounted in the driver and executor PODs
on /checkpoint directory.

As you suggested, first thing what I was trying it was set spark.local.dir
or env SPARK_LOCAL_DIRS to directory /checkpoint. As my expectation was
that it will be spilling to my PVC. However this is throwing following
error:

"spark-kube-driver" is invalid:
spec.containers[0].volumeMounts[3].mountPath: Invalid value: "/checkpoint":
must be unique"

It seems like it's trying to mount emptyDir with mounting point
"/checkpoint", but it can't because "/checkpoint" is the directory where
the PVC is already mounted.

At the moment it looks like to me, the emptyDir is always used for spilling
data. The question is how to mount it on the PVC. Unless I miss something
here.
I can't really run any bigger jobs at the moment because of that.
Appreciate any feedback :)

Thanks

Tom

On Thu, 28 Feb 2019 at 17:23, Matt Cheah  wrote:

> I think we want to change the value of spark.local.dir to point to where
> your PVC is mounted. Can you give that a try and let us know if that moves
> the spills as expected?
>
>
>
> -Matt Cheah
>
>
>
> *From: *Tomasz Krol 
> *Date: *Wednesday, February 27, 2019 at 3:41 AM
> *To: *"user@spark.apache.org" 
> *Subject: *Spark on k8s - map persistentStorage for data spilling
>
>
>
> Hey Guys,
>
>
>
> I hope someone will be able to help me, as I've stuck with this for a
> while:) Basically I am running some jobs on kubernetes as per documentation
>
>
>
> https://spark.apache.org/docs/latest/running-on-kubernetes.html
> [spark.apache.org]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_running-2Don-2Dkubernetes.html&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=pl7iQpYOLmjHJrzMaSyfQ56-lmUgnrE-__71VnhN_t0&s=KRpveGOSKlQ8zkPxuwZCAiXRMqVh9nE7B2aU_fN-bFg&e=>
>
>
>
> All works fine, however if I run queries on bigger data volume, then jobs
> failing that there is not enough space in /var/data/spark-1xxx directory.
>
>
>
> Obviously the reason for this is that emptyDir mounted doesnt have enough
> space.
>
>
>
> I also mounted pvc to the driver and executors pods which I can see during
> the runtime. I am wondering if someone knows how to set that data will be
> spilled to different directory (i.e my persistent storage directory)
> instead of empyDir with some limitted space. Or if I can mount the empyDir
> somehow on my pvc. Basically at the moment I cant run any jobs as they are
> failing due to insufficient space in that /var/data directory.
>
>
>
> Thanks
>
> --
>
> Tomasz Krol
> patric...@gmail.com
>


-- 
Tomasz Krol
patric...@gmail.com


Re: Spark on k8s - map persistentStorage for data spilling

2019-02-28 Thread Matt Cheah
I think we want to change the value of spark.local.dir to point to where your 
PVC is mounted. Can you give that a try and let us know if that moves the 
spills as expected?

 

-Matt Cheah

 

From: Tomasz Krol 
Date: Wednesday, February 27, 2019 at 3:41 AM
To: "user@spark.apache.org" 
Subject: Spark on k8s - map persistentStorage for data spilling

 

Hey Guys,

 

I hope someone will be able to help me, as I've stuck with this for a while:) 
Basically I am running some jobs on kubernetes as per documentation

 

https://spark.apache.org/docs/latest/running-on-kubernetes.html 
[spark.apache.org]

 

All works fine, however if I run queries on bigger data volume, then jobs 
failing that there is not enough space in /var/data/spark-1xxx directory.

 

Obviously the reason for this is that emptyDir mounted doesnt have enough space.

 

I also mounted pvc to the driver and executors pods which I can see during the 
runtime. I am wondering if someone knows how to set that data will be spilled 
to different directory (i.e my persistent storage directory) instead of empyDir 
with some limitted space. Or if I can mount the empyDir somehow on my pvc. 
Basically at the moment I cant run any jobs as they are failing due to 
insufficient space in that /var/data directory.

 

Thanks

-- 

Tomasz Krol
patric...@gmail.com



smime.p7s
Description: S/MIME cryptographic signature


Spark on k8s - map persistentStorage for data spilling

2019-02-27 Thread Tomasz Krol
Hey Guys,

I hope someone will be able to help me, as I've stuck with this for a
while:) Basically I am running some jobs on kubernetes as per documentation

https://spark.apache.org/docs/latest/running-on-kubernetes.html

All works fine, however if I run queries on bigger data volume, then jobs
failing that there is not enough space in /var/data/spark-1xxx directory.

Obviously the reason for this is that emptyDir mounted doesnt have enough
space.

I also mounted pvc to the driver and executors pods which I can see during
the runtime. I am wondering if someone knows how to set that data will be
spilled to different directory (i.e my persistent storage directory)
instead of empyDir with some limitted space. Or if I can mount the empyDir
somehow on my pvc. Basically at the moment I cant run any jobs as they are
failing due to insufficient space in that /var/data directory.

Thanks
-- 
Tomasz Krol
patric...@gmail.com