> 2. Is it possible to pre-run SDK Harness containers and reuse them for
every Portable Runner pipeline? I could win quite a lot of time on this for
more complicated pipelines.

Yes, you can start docker containers before hand using the worker_pool
option:

docker run -p=50000:50000 apachebeam/python3.7_sdk --worker_pool # or some
other port publishing

and then in your pipeline options set:

--environment_type=EXTERNAL --environment_config=localhost:50000

On Fri, May 15, 2020 at 11:47 AM Alexey Romanenko <[email protected]>
wrote:

> Hello,
>
> I’m trying to optimize my pipeline runtime while using it with Portable
> Runner and I have some related questions.
>
> This is a cross-language pipeline, written in Java SDK, and which executes
> some Python code through “External.of()” transform and my custom Python
> Expansion Service. I use Docker-based SDK Harness for Java and Python. In a
> primitive form the pipeline would look like this:
>
>
> [Source (Java)] -> [MyTransform1 (Java)] ->  [External (Execute Python
> code with Python SDK) ] - >  [MyTransform2 (Java SDK)]
>
>
>
> While running this pipeline with Portable Spark Runner, I see that quite a
> lot of time we spend for artifacts staging (in our case, we have quite a
> lot of artifacts in real pipeline) and launching a Docker container for
> every Spark stage. So, my questions are the following:
>
> 1. Is there any internal Beam functionality to pre-stage or, at least
> cache, already staged artifacts? Since the same pipeline will be executed
> many times in a row, there is no reason to stage the same artifacts every
> run.
>
> 2. Is it possible to pre-run SDK Harness containers and reuse them for
> every Portable Runner pipeline? I could win quite a lot of time on this for
> more complicated pipelines.
>
>
>
> Well, I guess I can find some workarounds for that but I wished to ask
> before that perhaps there is a better way to do that in Beam.
>
>
> Regards,
> Alexey

Reply via email to