Re: Submit Python Beam on Spark Dataproc

Yu Watanabe Tue, 10 Aug 2021 06:53:09 -0700

Hello .

Would this page help ? I hope it helps.

https://beam.apache.org/documentation/runners/spark/

> Running on a pre-deployed Spark cluster

1- What's spark-master-url in case of a remote cluster on Dataproc? Is 7077
the master url port?
* Yes.

2- Should we ssh tunnel to sparkMasterUrl port using gcloud compute ssh?
* Job server should be able to communicate with Spark master node port
7077. So I believe it is Yes.

3- What's the environment_type? Can we use DOCKER? Then what's the SDK
Harness Configuration?
* This is the configuration of how you want  your harness container to spin
up.

https://beam.apache.org/documentation/runtime/sdk-harness-config/

For DOCKER , you will need docker deployed on all spark worker nodes.
> User code is executed within a container started on each worker node

I used EXTERNAL when I did it with flink cluster before.

e.g
https://github.com/yuwtennis/apache-beam/blob/master/flink-session-cluster/docker/samples/src/sample.py#L14

4- Should we run the job-server outside of the Dataproc cluster or should
we run it in the master node?
* Depends. It could be inside or outside the master node. But if you are
connecting to full managed service, then outside might be better.

https://beam.apache.org/documentation/runners/spark/

> Start JobService that will connect with the Spark master

Thanks,
Yu

On Tue, Aug 10, 2021 at 7:53 PM Mahan Hosseinzadeh <[email protected]>
wrote:

> Hi,
>
> I have a Python Beam job that works on Dataflow but we would like to
> submit it on a Spark Dataproc cluster with no Flink involvement.
> I already spent days but failed to figure out how to use PortableRunner
> with the beam_spark_job_server to submit my Python Beam job to Spark
> Dataproc. All the Beam docs are about Flink and there is no guideline about
> Spark with Dataproc.
> Some relevant questions might be:
> 1- What's spark-master-url in case of a remote cluster on Dataproc? Is
> 7077 the master url port?
> 2- Should we ssh tunnel to sparkMasterUrl port using gcloud compute ssh?
> 3- What's the environment_type? Can we use DOCKER? Then what's the SDK
> Harness Configuration?
> 4- Should we run the job-server outside of the Dataproc cluster or should
> we run it in the master node?
>
> Thanks,
> Mahan
>

-- 
Yu Watanabe

linkedin: www.linkedin.com/in/yuwatanabe1/
twitter:   twitter.com/yuwtennis

Re: Submit Python Beam on Spark Dataproc

Reply via email to