Deployment of beam pipelines on flink cluster

Koffman, Noa (Nokia - IL/Kfar Sava) Thu, 03 Feb 2022 12:33:35 -0800

Hi all,
We are currently using beam to create a few pipelines, and we then deploy some 
of them on our on-prem Flink cluster, and some on GCP Dataflow, and we have a 
few questions regarding the automation of the pipelines deployment:


Flink deployment:
What are currently running beam as a k8s pod running beam which starts a java 
process for each pipeline and has the flinkrunner parameters, the java 
processes deploy the pipelines as flink jobs (running on a flink cluster 
deployed on k8s as well)
The main issues we have with the current deployment is – every time the beam 
pod gets restarted, it re-creates the flink jobs, and this causes duplications 
and errors.
We were wondering what the recommended way is to automate deploying the beam 
pipelines to flink, and is there any documentation on this?
We also tried to see how to generate jars from the beam pipelines and then 
running the on flink, but we had issues with the Fat Jars, as they included 
some flink libraries, and we were not able to run them as is on flink, is there 
a way to get a ‘leaner’ fat jar?


Dataflow deployment:
We are trying to deploy and run a pipeline in Dataflow. We have a jar with a 
pipeline that runs locally and we are deploying it using:

java $JVM_OPT -Dlog4j.configuration=file:log4j.properties -cp 
"/opt/apache/beam/jars/*" XXX.PipelinePubSubToBigQuery --runner=DataflowRunner 
--project=<our project> --region=<our region>  --stagingLocation=<CS staging 
bucket> --tempLocation=<CS temp bucket> --templateLocation=<CS location path> &

We noticed that it uploads to CS over 200 jars, and even a minor change in one 
jar results in a ~15 min deployment.

We have a few questions:

  1.  What is the recommended way of handling the deployment and job creation? 
Should we start the deployment from a microservice, or is there another 
preferred way?
  2.  How do we programmatically start running the job after the deployment?
  3.  Prior to using Dataflow, our pipeline ended with 
result.waitUntilFinish(), but with dataflow there is an exception of missing 
jobId. Should we still try and call this method?
Any help regarding this subject would be appreciated
Thanks
Noa

Deployment of beam pipelines on flink cluster

Reply via email to