Hello . Would this page help ? I hope it helps.
https://beam.apache.org/documentation/runners/spark/ > Running on a pre-deployed Spark cluster 1- What's spark-master-url in case of a remote cluster on Dataproc? Is 7077 the master url port? * Yes. 2- Should we ssh tunnel to sparkMasterUrl port using gcloud compute ssh? * Job server should be able to communicate with Spark master node port 7077. So I believe it is Yes. 3- What's the environment_type? Can we use DOCKER? Then what's the SDK Harness Configuration? * This is the configuration of how you want your harness container to spin up. https://beam.apache.org/documentation/runtime/sdk-harness-config/ For DOCKER , you will need docker deployed on all spark worker nodes. > User code is executed within a container started on each worker node I used EXTERNAL when I did it with flink cluster before. e.g https://github.com/yuwtennis/apache-beam/blob/master/flink-session-cluster/docker/samples/src/sample.py#L14 4- Should we run the job-server outside of the Dataproc cluster or should we run it in the master node? * Depends. It could be inside or outside the master node. But if you are connecting to full managed service, then outside might be better. https://beam.apache.org/documentation/runners/spark/ > Start JobService that will connect with the Spark master Thanks, Yu On Tue, Aug 10, 2021 at 7:53 PM Mahan Hosseinzadeh <[email protected]> wrote: > Hi, > > I have a Python Beam job that works on Dataflow but we would like to > submit it on a Spark Dataproc cluster with no Flink involvement. > I already spent days but failed to figure out how to use PortableRunner > with the beam_spark_job_server to submit my Python Beam job to Spark > Dataproc. All the Beam docs are about Flink and there is no guideline about > Spark with Dataproc. > Some relevant questions might be: > 1- What's spark-master-url in case of a remote cluster on Dataproc? Is > 7077 the master url port? > 2- Should we ssh tunnel to sparkMasterUrl port using gcloud compute ssh? > 3- What's the environment_type? Can we use DOCKER? Then what's the SDK > Harness Configuration? > 4- Should we run the job-server outside of the Dataproc cluster or should > we run it in the master node? > > Thanks, > Mahan > -- Yu Watanabe linkedin: www.linkedin.com/in/yuwatanabe1/ twitter: twitter.com/yuwtennis
