Re: Submit Python Beam on Spark Dataproc

Yu Watanabe Sun, 15 Aug 2021 07:52:38 -0700

Hello Mahan.

Sorry for the late reply.


> Still waiting for startup of environment from localhost:50000 for worker
id 1-1

>From the message , it seems that something is wrong with connection between
Worker node in spark cluster and SDK harness.

According to this slide runner worker (in your context spark worker) ,
should also have connectivity with sdk harness container.

https://docs.google.com/presentation/d/1Cso0XP9dmj77OD9Bd53C1M3W1sPJF0ZnA20gzb2BPhE/edit#slide=id.g42e4c9aad6_1_0

Could you please also try setting ssh tunneling to spark worker node as
well ?

Thanks,
Yu

On Thu, Aug 12, 2021 at 9:07 PM Mahan Hosseinzadeh <[email protected]>
wrote:

> Thanks Yu for the help and the tips.
>
> I ran the following steps but my job is stuck and can't get submitted to
> Dataproc and I keep getting this message in job-server:
> Still waiting for startup of environment from localhost:50000 for worker
> id 1-1
>
>
> ---------------------------------------------------------------------------------------------------------
> *Beam code:*
> pipeline_options = PipelineOptions([
>             "--runner=PortableRunner",
>             "--job_endpoint=localhost:8099",
>             "--environment_type=EXTERNAL",
>             "--environment_config=localhost:50000"
>         ])
>
> ---------------------------------------------------------------------------------------------------------
> *Job Server:*
> I couldn't use Docker because host networking doesn't work on Mac OS and I
> used Gradle instead
>
> ./gradlew :runners:spark:3:job-server:runShadow
>
> ---------------------------------------------------------------------------------------------------------
> *Beam Worker Pool:*
> docker run -p=50000:50000 apache/beam_python3.7_sdk --worker_pool
>
> ---------------------------------------------------------------------------------------------------------
> *SSH tunnel to the master node:*
> gcloud compute ssh <my-master-node-m> \
>     --project <my-gcp-project> \
>     --zone <my-zone>  \
>     -- -NL 7077:localhost:7077
>
> ---------------------------------------------------------------------------------------------------------
>
> Thanks,
> Mahan
>
> On Tue, Aug 10, 2021 at 3:53 PM Yu Watanabe <[email protected]> wrote:
>
>> Hello .
>>
>> Would this page help ? I hope it helps.
>>
>> https://beam.apache.org/documentation/runners/spark/
>>
>> > Running on a pre-deployed Spark cluster
>>
>> 1- What's spark-master-url in case of a remote cluster on Dataproc? Is
>> 7077 the master url port?
>> * Yes.
>>
>> 2- Should we ssh tunnel to sparkMasterUrl port using gcloud compute ssh?
>> * Job server should be able to communicate with Spark master node port
>> 7077. So I believe it is Yes.
>>
>> 3- What's the environment_type? Can we use DOCKER? Then what's the SDK
>> Harness Configuration?
>> * This is the configuration of how you want  your harness container to
>> spin up.
>>
>> https://beam.apache.org/documentation/runtime/sdk-harness-config/
>>
>> For DOCKER , you will need docker deployed on all spark worker nodes.
>> > User code is executed within a container started on each worker node
>>
>> I used EXTERNAL when I did it with flink cluster before.
>>
>> e.g
>>
>> https://github.com/yuwtennis/apache-beam/blob/master/flink-session-cluster/docker/samples/src/sample.py#L14
>>
>> 4- Should we run the job-server outside of the Dataproc cluster or should
>> we run it in the master node?
>> * Depends. It could be inside or outside the master node. But if you are
>> connecting to full managed service, then outside might be better.
>>
>> https://beam.apache.org/documentation/runners/spark/
>>
>> > Start JobService that will connect with the Spark master
>>
>> Thanks,
>> Yu
>>
>> On Tue, Aug 10, 2021 at 7:53 PM Mahan Hosseinzadeh <[email protected]>
>> wrote:
>>
>>> Hi,
>>>
>>> I have a Python Beam job that works on Dataflow but we would like to
>>> submit it on a Spark Dataproc cluster with no Flink involvement.
>>> I already spent days but failed to figure out how to use PortableRunner
>>> with the beam_spark_job_server to submit my Python Beam job to Spark
>>> Dataproc. All the Beam docs are about Flink and there is no guideline about
>>> Spark with Dataproc.
>>> Some relevant questions might be:
>>> 1- What's spark-master-url in case of a remote cluster on Dataproc? Is
>>> 7077 the master url port?
>>> 2- Should we ssh tunnel to sparkMasterUrl port using gcloud compute ssh?
>>> 3- What's the environment_type? Can we use DOCKER? Then what's the SDK
>>> Harness Configuration?
>>> 4- Should we run the job-server outside of the Dataproc cluster or
>>> should we run it in the master node?
>>>
>>> Thanks,
>>> Mahan
>>>
>>
>>
>> --
>> Yu Watanabe
>>
>> linkedin: www.linkedin.com/in/yuwatanabe1/
>> twitter:   twitter.com/yuwtennis
>>
>>
>

-- 
Yu Watanabe

linkedin: www.linkedin.com/in/yuwatanabe1/
twitter:   twitter.com/yuwtennis

Re: Submit Python Beam on Spark Dataproc

Reply via email to