Re: [DISCUSS] Client SDK/Job Server/Worker Pool Lifecycle Management on Kubernetes

2021-06-03 Thread Ke Wu
There is also a talk [1] which introduces dynamic scaling a stream processing job at LinkedIn with Samza runner as well [1] https://www.usenix.org/conference/hotcloud20/presentation/singh > On Jun 3, 2021, at 1:59 PM, Ke Wu wro

Re: [DISCUSS] Client SDK/Job Server/Worker Pool Lifecycle Management on Kubernetes

2021-06-03 Thread Kyle Weaver
> > However, not all runners follow the pattern where a predefined number of > workers are brought up before job submission, for example, for Samza > runner, the number of workers needed for a job is determined after the job > submission happens, in which case, in the Samza worker Pod, which is > s

Re: [DISCUSS] Client SDK/Job Server/Worker Pool Lifecycle Management on Kubernetes

2021-06-02 Thread Ke Wu
Very good point. We are actually talking about the same high level approach where Task Manager Pod has two containers inside running, one is task manager container while the other is worker pool service container. I believe the disconnect probably lies in how a job is being deployed/started. In

Re: [DISCUSS] Client SDK/Job Server/Worker Pool Lifecycle Management on Kubernetes

2021-06-02 Thread Kyle Weaver
> > Therefore, if we bring up the external worker pool container together with > the runner container, which is one the supported approach by Flink Runner > on K8s Exactly which approach are you talking about in the doc? I feel like there could be some misunderstanding here. Here is the configura

Re: [DISCUSS] Client SDK/Job Server/Worker Pool Lifecycle Management on Kubernetes

2021-06-02 Thread Ke Wu
I do agree that it usually takes longer for runner before tries to connect than external worker to become available, I suppose that is probably why we have the external service pool in the current way. However, I am not 100% confident to say it won’t happen in practice because the design does s

Re: [DISCUSS] Client SDK/Job Server/Worker Pool Lifecycle Management on Kubernetes

2021-06-02 Thread Kyle Weaver
As far as I'm aware there's nothing strictly guaranteeing the worker pool has been started. But in practice it takes a while for the job to start up - the pipeline needs to be constructed, sent to the job server, translated, and then the runner needs to start the job, etc. before the external envir

Re: [DISCUSS] Client SDK/Job Server/Worker Pool Lifecycle Management on Kubernetes

2021-06-02 Thread Ke Wu
Hi Kyle, Thanks for reviewing https://github.com/apache/beam/pull/14923 . I would like to follow up with the deadline & waitForReady on ExternalEnvironment here. In Kubernetes, if my understanding is correct, there is no ordering support when bringing

Re: [DISCUSS] Client SDK/Job Server/Worker Pool Lifecycle Management on Kubernetes

2021-05-27 Thread Ke Wu
Good to know. We are working on running java portable pipeline for Samza runner and I believe we could take on the task to enhance the java workflow to support timeout/retry etc on gRPC calls. Created BEAM-12419 to track the work. Best, Ke >

Re: [DISCUSS] Client SDK/Job Server/Worker Pool Lifecycle Management on Kubernetes

2021-05-27 Thread Kyle Weaver
I don't think there's any specific reason we don't set a timeout, I'm guessing it was just never worth the effort of implementing. If it's stuck it should be pretty obvious from the logs: "Still waiting for startup of environment from {} for worker id {}" On Thu, May 27, 2021 at 4:04 PM Ke Wu wro

Re: [DISCUSS] Client SDK/Job Server/Worker Pool Lifecycle Management on Kubernetes

2021-05-27 Thread Ke Wu
Hi Kyle, Thank you for the prompt response and apologize for the late reply. [1] seems to be only available in python portable_runner but not java PortableRunner, is it intended or we could add similar changes in java as well? [2] makes sense to block since the wait/retry is handled in the pre

Re: [DISCUSS] Client SDK/Job Server/Worker Pool Lifecycle Management on Kubernetes

2021-05-14 Thread Kyle Weaver
1. and 2. are both facilitated by GRPC, which takes care of most of the retry/wait logic. In some places we have a configurable timeout (which defaults to 60s) [1], while in other places we block [2][3]. [1] https://issues.apache.org/jira/browse/BEAM-7933 [2] https://github.com/apache/beam/blob/51

[DISCUSS] Client SDK/Job Server/Worker Pool Lifecycle Management on Kubernetes

2021-05-14 Thread Ke Wu
Hello All, I came across this question when I am reading Beam on Flink on Kubernetes and flink-on-k8s-operator