Hi Folks,

I'm using a patched version of apache Beam 2.35 python and running on Flink
on Kubernetes using the PortableJobRunner.

It looks like when submitting the job, the runner tries to upload a large
274 Mb flink job server jar to the staging service.

This doesn't seem right. I already have an instance of the JobServer
running and my program is talking to the job server so why is it trying to
stage the JobServer?

I believe this problem started when I upgraded to 2.35.

When I debugged this I found the runner was getting stuck in offer_artifacts
https://github.com/apache/beam/blob/38b21a71a76aad902bd903d525b25a5ff464df55/sdks/python/apache_beam/runners/portability/artifact_service.py#L235

When I looked at the rolePayload of the request in question it was
rolePayload=beam-runners-flink-job-server-quH8FRP1-liJ7K9es-qBV9wguz4oBRlVegwqlW3OpqU.jar

How does the runner decide which artifacts to upload? Could this be caused
by using a patched version of the Python SDK but an unpatched version of
the job server jar; so the python version (2.35.0.dev2) doesn't match the
JobServer version 2.35.0? As a result, the portable runner thinks it needs
to upload the Jar?

We build our own version of the Python SDK because we need a fix for
https://issues.apache.org/jira/browse/BEAM-12244.

When we were using 2.33 we were also building our own Flink Jars in order
to pull in Kafka patches which hadn't been released yet.

Thanks
J

Reply via email to