Hi Eugene!
I’m struggling to find complete documentation on how to do this. There
seems to be lots of conflicting or incomplete information: several ways to
deploy Flink, several ways to get Beam working with it, bizarre
StackOverflow questions, and no documentation explaining a complete working
example.
This *is* possible and I went through all the same frustrations of sparse
and confusing documentation. I’m glossing over a lot of details, but the
key thing was setting up the flink taskworker(s) to run docker. This
requires running docker-in-docker as the taskworker itself is a docker
container in k8s.
First create a custom flink container with docker:
# docker-flink Dockerfile
FROM flink:1.10
# install docker
RUN apt-get ...
Then setup the taskmanager deployment to use a sidecar docker-in-docker
service. This dind service is where the python sdk harness container
actually runs.
kind: Deployment
...
containers:
- name: docker
image: docker:19.03.5-dind
...
- name: taskmanger
image: myregistry:5000/docker-flink:1.10
env:
- name: DOCKER_HOST
value: tcp://localhost:2375
...
I quickly threw all these pieces together in a repo here:
https://github.com/sambvfx/beam-flink-k8s
I added a working (via minikube) step-by-step in the README to prove to
myself that I didn’t miss anything, but feel free to submit any PRs if you
want to add anything useful.
The documents you linked are very informative. It would be great to
aggregate all this into digestible documentation. Let me know if you have
any further questions!
Cheers,
Sam
On Thu, Aug 27, 2020 at 10:25 AM Eugene Kirpichov <[email protected]>
wrote:
> Hi Kyle,
>
> Thanks for the response!
>
> On Wed, Aug 26, 2020 at 5:28 PM Kyle Weaver <[email protected]> wrote:
>
>> > - With the Flink operator, I was able to submit a Beam job, but hit the
>> issue that I need Docker installed on my Flink nodes. I haven't yet tried
>> changing the operator's yaml files to add Docker inside them.
>>
>> Running Beam workers via Docker on the Flink nodes is not recommended
>> (and probably not even possible), since the Flink nodes are themselves
>> already running inside Docker containers. Running workers as sidecars
>> avoids that problem. For example:
>> https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/examples/beam/with_job_server/beam_flink_cluster.yaml#L17-L20
>>
>> The main problem with the sidecar approach is that I can't use the Flink
> cluster as a "service" for anybody to submit their jobs with custom
> containers - the container version is fixed.
> Do I understand it correctly?
> Seems like the Docker-in-Docker approach is viable, and is mentioned in
> the Beam Flink K8s design doc
> <https://docs.google.com/document/d/1z3LNrRtr8kkiFHonZ5JJM_L4NWNBBNcqRc_yAf6G0VI/edit#heading=h.dtj1gnks47dq>
> .
>
>
>> > I also haven't tried this
>> <https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/docs/beam_guide.md>
>> yet
>> because it implies submitting jobs using "kubectl apply" which is weird -
>> why not just submit it through the Flink job server?
>>
>> I'm guessing it goes through k8s for monitoring purposes. I see no reason
>> it shouldn't be possible to submit to the job server directly through
>> Python, network permitting, though I haven't tried this.
>>
>>
>>
>> On Wed, Aug 26, 2020 at 4:10 PM Eugene Kirpichov <[email protected]>
>> wrote:
>>
>>> Hi folks,
>>>
>>> I'm still working with Pachama <https://pachama.com/> right now; we
>>> have a Kubernetes Engine cluster on GCP and want to run Beam Python batch
>>> pipelines with custom containers against it.
>>> Flink and Cloud Dataflow are the two options; Cloud Dataflow doesn't
>>> support custom containers for batch pipelines yes so we're going with Flink.
>>>
>>> I'm struggling to find complete documentation on how to do this. There
>>> seems to be lots of conflicting or incomplete information: several ways to
>>> deploy Flink, several ways to get Beam working with it, bizarre
>>> StackOverflow questions, and no documentation explaining a complete working
>>> example.
>>>
>>> == My requests ==
>>> * Could people briefly share their working setup? Would be good to know
>>> which directions are promising.
>>> * It would be particularly helpful if someone could volunteer an hour of
>>> their time to talk to me about their working Beam/Flink/k8s setup. It's for
>>> a good cause (fixing the planet :) ) and on my side I volunteer to write up
>>> the findings to share with the community so others suffer less.
>>>
>>> == Appendix: My findings so far ==
>>> There are multiple ways to deploy Flink on k8s:
>>> - The GCP marketplace Flink operator
>>> <https://cloud.google.com/blog/products/data-analytics/open-source-processing-engines-for-kubernetes>
>>> (couldn't
>>> get it to work) and the respective CLI version
>>> <https://github.com/GoogleCloudPlatform/flink-on-k8s-operator> (buggy,
>>> but I got it working)
>>> - https://github.com/lyft/flinkk8soperator (haven't tried)
>>> - Flink's native k8s support
>>> <https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/deployment/native_kubernetes.html>
>>> (super
>>> easy to get working)
>>> <https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/deployment/native_kubernetes.html>
>>>
>>> I confirmed that my Flink cluster was operational by running a simple
>>> Wordcount job, initiated from my machine. However I wasn't yet able to get
>>> Beam working:
>>>
>>> - With the Flink operator, I was able to submit a Beam job, but hit the
>>> issue that I need Docker installed on my Flink nodes. I haven't yet tried
>>> changing the operator's yaml files to add Docker inside them. I also
>>> haven't tried this
>>> <https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/docs/beam_guide.md>
>>> yet because it implies submitting jobs using "kubectl apply" which is
>>> weird - why not just submit it through the Flink job server?
>>>
>>> - With Flink's native k8s support, I tried two things:
>>> - Creating a fat portable jar using --output_executable_path. The jar
>>> is huge (200+MB) and takes forever to upload to my Flink cluster - this is
>>> a non-starter. But if I actually upload it, then I hit the same issue with
>>> lacking Docker. Haven't tried fixing it yet.
>>> - Simply running my pipeline --runner=FlinkRunner
>>> --environment_type=DOCKER --flink_master=$PUBLIC_IP:8081. The Java process
>>> appears to send 1+GB of data to somewhere, but the job never even starts.
>>>
>>> I looked at a few conference talks:
>>> -
>>> https://www.cncf.io/wp-content/uploads/2020/02/CNCF-Webinar_-Apache-Flink-on-Kubernetes-Operator-1.pdf
>>> - seems to imply that I need to add a Beam worker "sidecar" to the Flink
>>> workers; and that I need to submit my job using "kubectl apply".
>>> - https://www.youtube.com/watch?v=8k1iezoc5Sc which also mentions the
>>> sidecar, but also mentions the fat jar option
>>>
>>> --
>>> Eugene Kirpichov
>>> http://www.linkedin.com/in/eugenekirpichov
>>>
>>
>
> --
> Eugene Kirpichov
> http://www.linkedin.com/in/eugenekirpichov
>