> 2. Is it possible to pre-run SDK Harness containers and reuse them for every Portable Runner pipeline? I could win quite a lot of time on this for more complicated pipelines.
Yes, you can start docker containers before hand using the worker_pool option: docker run -p=50000:50000 apachebeam/python3.7_sdk --worker_pool # or some other port publishing and then in your pipeline options set: --environment_type=EXTERNAL --environment_config=localhost:50000 On Fri, May 15, 2020 at 11:47 AM Alexey Romanenko <[email protected]> wrote: > Hello, > > I’m trying to optimize my pipeline runtime while using it with Portable > Runner and I have some related questions. > > This is a cross-language pipeline, written in Java SDK, and which executes > some Python code through “External.of()” transform and my custom Python > Expansion Service. I use Docker-based SDK Harness for Java and Python. In a > primitive form the pipeline would look like this: > > > [Source (Java)] -> [MyTransform1 (Java)] -> [External (Execute Python > code with Python SDK) ] - > [MyTransform2 (Java SDK)] > > > > While running this pipeline with Portable Spark Runner, I see that quite a > lot of time we spend for artifacts staging (in our case, we have quite a > lot of artifacts in real pipeline) and launching a Docker container for > every Spark stage. So, my questions are the following: > > 1. Is there any internal Beam functionality to pre-stage or, at least > cache, already staged artifacts? Since the same pipeline will be executed > many times in a row, there is no reason to stage the same artifacts every > run. > > 2. Is it possible to pre-run SDK Harness containers and reuse them for > every Portable Runner pipeline? I could win quite a lot of time on this for > more complicated pipelines. > > > > Well, I guess I can find some workarounds for that but I wished to ask > before that perhaps there is a better way to do that in Beam. > > > Regards, > Alexey
