Hello Mahan. Sorry for the late reply.
> Still waiting for startup of environment from localhost:50000 for worker id 1-1 >From the message , it seems that something is wrong with connection between Worker node in spark cluster and SDK harness. According to this slide runner worker (in your context spark worker) , should also have connectivity with sdk harness container. https://docs.google.com/presentation/d/1Cso0XP9dmj77OD9Bd53C1M3W1sPJF0ZnA20gzb2BPhE/edit#slide=id.g42e4c9aad6_1_0 Could you please also try setting ssh tunneling to spark worker node as well ? Thanks, Yu On Thu, Aug 12, 2021 at 9:07 PM Mahan Hosseinzadeh <[email protected]> wrote: > Thanks Yu for the help and the tips. > > I ran the following steps but my job is stuck and can't get submitted to > Dataproc and I keep getting this message in job-server: > Still waiting for startup of environment from localhost:50000 for worker > id 1-1 > > > --------------------------------------------------------------------------------------------------------- > *Beam code:* > pipeline_options = PipelineOptions([ > "--runner=PortableRunner", > "--job_endpoint=localhost:8099", > "--environment_type=EXTERNAL", > "--environment_config=localhost:50000" > ]) > > --------------------------------------------------------------------------------------------------------- > *Job Server:* > I couldn't use Docker because host networking doesn't work on Mac OS and I > used Gradle instead > > ./gradlew :runners:spark:3:job-server:runShadow > > --------------------------------------------------------------------------------------------------------- > *Beam Worker Pool:* > docker run -p=50000:50000 apache/beam_python3.7_sdk --worker_pool > > --------------------------------------------------------------------------------------------------------- > *SSH tunnel to the master node:* > gcloud compute ssh <my-master-node-m> \ > --project <my-gcp-project> \ > --zone <my-zone> \ > -- -NL 7077:localhost:7077 > > --------------------------------------------------------------------------------------------------------- > > Thanks, > Mahan > > On Tue, Aug 10, 2021 at 3:53 PM Yu Watanabe <[email protected]> wrote: > >> Hello . >> >> Would this page help ? I hope it helps. >> >> https://beam.apache.org/documentation/runners/spark/ >> >> > Running on a pre-deployed Spark cluster >> >> 1- What's spark-master-url in case of a remote cluster on Dataproc? Is >> 7077 the master url port? >> * Yes. >> >> 2- Should we ssh tunnel to sparkMasterUrl port using gcloud compute ssh? >> * Job server should be able to communicate with Spark master node port >> 7077. So I believe it is Yes. >> >> 3- What's the environment_type? Can we use DOCKER? Then what's the SDK >> Harness Configuration? >> * This is the configuration of how you want your harness container to >> spin up. >> >> https://beam.apache.org/documentation/runtime/sdk-harness-config/ >> >> For DOCKER , you will need docker deployed on all spark worker nodes. >> > User code is executed within a container started on each worker node >> >> I used EXTERNAL when I did it with flink cluster before. >> >> e.g >> >> https://github.com/yuwtennis/apache-beam/blob/master/flink-session-cluster/docker/samples/src/sample.py#L14 >> >> 4- Should we run the job-server outside of the Dataproc cluster or should >> we run it in the master node? >> * Depends. It could be inside or outside the master node. But if you are >> connecting to full managed service, then outside might be better. >> >> https://beam.apache.org/documentation/runners/spark/ >> >> > Start JobService that will connect with the Spark master >> >> Thanks, >> Yu >> >> On Tue, Aug 10, 2021 at 7:53 PM Mahan Hosseinzadeh <[email protected]> >> wrote: >> >>> Hi, >>> >>> I have a Python Beam job that works on Dataflow but we would like to >>> submit it on a Spark Dataproc cluster with no Flink involvement. >>> I already spent days but failed to figure out how to use PortableRunner >>> with the beam_spark_job_server to submit my Python Beam job to Spark >>> Dataproc. All the Beam docs are about Flink and there is no guideline about >>> Spark with Dataproc. >>> Some relevant questions might be: >>> 1- What's spark-master-url in case of a remote cluster on Dataproc? Is >>> 7077 the master url port? >>> 2- Should we ssh tunnel to sparkMasterUrl port using gcloud compute ssh? >>> 3- What's the environment_type? Can we use DOCKER? Then what's the SDK >>> Harness Configuration? >>> 4- Should we run the job-server outside of the Dataproc cluster or >>> should we run it in the master node? >>> >>> Thanks, >>> Mahan >>> >> >> >> -- >> Yu Watanabe >> >> linkedin: www.linkedin.com/in/yuwatanabe1/ >> twitter: twitter.com/yuwtennis >> >> > -- Yu Watanabe linkedin: www.linkedin.com/in/yuwatanabe1/ twitter: twitter.com/yuwtennis
