Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

Mich Talebzadeh Mon, 08 Mar 2021 07:01:53 -0800

Hi Ranju,

In your statement:


"What is the best shared storage can be used to collate all executors part
files at one place."

Are you looking for performance or durability?

In general, every executor on every node should have access to GCP buckets
created under project (assuming you are using service account to run the
spark job):

gs://tmp_storage_bucket/


So you can try it and see if it works (create it first). Of course Spark
needs to be aware of it.


HTH


LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 8 Mar 2021 at 14:46, Ranju Jain <ranju.j...@ericsson.com> wrote:

> Hi Mich,
>
>
>
> Purpose is all spark executors running on K8s worker nodes writes their
> processed task data [part files] to some shared storage , and now the
> Driver pod
>
> running on same kubernetes Cluster will access that shared storage and
> convert all those part files to single file.
>
>
>
> So I am looking for Shared Storage Options available to persist the part
> files.
>
> What is the best shared storage can be used to collate all executors part
> files at one place.
>
>
>
> Regards
>
> Ranju
>
>
>
> *From:* Mich Talebzadeh <mich.talebza...@gmail.com>
> *Sent:* Monday, March 8, 2021 8:06 PM
> *To:* Ranju Jain <ranju.j...@ericsson.com.invalid>
> *Cc:* Attila Zsolt Piros <piros.attila.zs...@gmail.com>;
> user@spark.apache.org
> *Subject:* Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor
> Part Files Storage
>
>
>
> If the purpose is to use for temporary work and write put it in temporary
> sub-directory under a give bucket
>
>
>
> spark.conf.set("temporaryGcsBucket", config['GCPVariables']['tmp_bucket'])
>
>
>
> That dict reference is to this yml file entry
>
>
>
> CPVariables:
>
>    tmp_bucket: "tmp_storage_bucket/tmp"
>
>
>
>
>
> just create a temporary bucket and sub-directory tmp underneath
>
>
>
> tmp_storage_bucket/tmp
>
>
>
>
>
> HTH
>
>
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Sun, 7 Mar 2021 at 16:23, Ranju Jain <ranju.j...@ericsson.com.invalid>
> wrote:
>
> Hi,
>
>
>
> I need to save the Executors processed data in the form of part files ,
> but I think persistent Volume is not an option for this as Executors
> terminates after their work completes.
>
> So I am thinking to use shared volume across executor pods.
>
>
>
> Should I go with NFS or is there any other Volume option as well to
> explore?
>
>
>
> Regards
>
> Ranju
>
>

Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

Reply via email to