Re: Deployment difficulties with Python apps on Kubernetes Flink cluster

Janek Bevendorff Thu, 02 Dec 2021 05:23:17 -0800

Hi,

I know that there are various options for configuring Flink using Beamwith the Java SDK, but are there any options to do the same with thePython SDK? The FlinkRunnerOptions class offers only a fraction of whatthe Java FlinkPipelineOptions class provides. I would like to be able toset the parallelism when I submit an Uber JAR and I would also like tobe able to set the task retry count.



On 30/11/2021 17:49, Janek Bevendorff wrote:

Again, one step closer to getting this thing running:
I have to set --artifact_endpoint as well to submit a job to a remotejob server, not just --job_endpoint. Would be great if the docs atleast mentioned that.
I also don't understand really why there is no single option forsetting both the job and artifact endpoint address (without the portnumber). They must both run in the same container, otherwise I geterrors about invalid job staging IDs, so having two options is kind ofredundant configuration.
Janek


On 30/11/2021 16:08, Janek Bevendorff wrote:
Any ideas here? I am a few steps further now, but not quite there yet.
The main deployment issue can be solved by sharing only/tmp/beam-artifact-staging among job and taskmanagers, not the wholetmp directory (this is knowledge from countless hours of googling andtrying things out, no documentation here whatsover). It does notsolve the remote deployment issue, but so far I am at least able tosubmit jobs with a locally running Beam job server.
Unfortunately, I am getting random gRPC terminations with thisexception:
File "venv/lib/python3.7/site-packages/apache_beam/pipeline.py",line 597, in __exit__
    self.result.wait_until_finish()
File"venv/lib/python3.7/site-packages/apache_beam/runners/portability/portable_runner.py",line 600, in wait_until_finish
    raise self._runtime_exception
RuntimeError: PipelineBeamApp-root-1130144051-d0476877_4d32d3e8-1fbb-4f2d-88d9-c2e05fd624ebfailed in state FAILED:org.apache.beam.vendor.grpc.v1p36p0.io.grpc.StatusRuntimeException:CANCELLED: client cancelled
The error occurs randomly without any indication why. Do you have anyidea what may be wrong with the gRPC connection or also what I may bemissing for the remote job server deployment?
Besides these questions, I also have a little rant (sorry for that,but I have to get this off my chest):
I am getting a extremely frustrated with the Python documentation,which is often incomplete, sometimes outdated and occasionally plainwrong. I can tell that many examples were never actually run, becausethey contain invalid Python code, function names differ from what theAPI actually offers, parentheses are in the wrong places etc. Oneparticular example is the splittable DoFn documentation. The originalblog post is entirely outdated and also contains invalid Python code(missing self parameters of methods and such), but also the onlinemanual is wrong (missing constructor parameters or required methodoverrides here, wrong parentheses there...). To understand howeverything works, I am basically reverse engineering the code, takinginto account the little API documentation there is. This is beyondannoying.
I also noticed that stateful processing is completely brokenapparently. With the local runner, it doesn't run at all (variousexceptions thrown) and with the FlinkRunner or PortableRunner,BagStateSpecs or other state parameters are always empty andwatermark timers fire after each invocation of process(). I triedcountless potential solutions, but nothing works, so I gave up andresorted to using a CombineFn-based PTransform instead.
Janek


On 26/11/2021 17:01, Janek Bevendorff wrote:
I am one step further, but also not really.
When I mount the shared drive that serves /tmp on the Flink job andtask managers also on my local machine and then spin up a local Beamjob server with this volume mounted on /tmp as well, I can get myjob to start. This is ugly as hell, because it requires so manyextra steps, but at least it's progress.
Unfortunately, the job doesn't run properly and fails with
File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line462, in find_class
    return StockUnpickler.find_class(self, module, name)
ModuleNotFoundError: No module named 'XXXX'
where XXX is the my application module that I deploy with--setup_file. When I download the workflow.tar.gz from the stagingdirectory, I can confirm that the module is present.
This isn't working as intended at all. Also what happens if multipleusers submit applications at the same time? All the Beam stuff in/tmp has random names, but the stages/workflow.tar.gz file that isprovided for the Python SDK sidecar container has the same name foreach job. Hence it would be impossible to serve multiple users withthis setup.
Janek


On 26/11/2021 15:32, Janek Bevendorff wrote:
Hi,
Currently, I am struggling with getting Beam to run on aKubernetes-hosted Flink cluster and there is very little to nodocumentation on how to resolve my deployment issues (besides a fewStackoverflow threads without solutions).
I have a Flink job server running on Kubernetes that creates newtaskmanager pods from a pod template when I submit a job. Eachtaskmanager pod has a sidecar container running the Beam Python SDKimage.
With this setup in place, I tried multiple methods to submit aPython Beam job and all of them fail for different reasons:
1) Run my Python job via the FlinkRunner and set--environment_type=EXTERNAL
This works perfectly fine locally, but fails when I set--flink_master to the Kubernetes load balancer IP to submit to theremote Kubernetes cluster. It allows me to submit the job itselfsuccessfully, but not the necessary staging data. The Flinkcontainer shows
java.io.FileNotFoundException:/tmp/beam-temp7hxxe2gs/artifacts2liu9b8y/779b17e6efab2bbfcba170762d1096fe2451e0e76c4361af9a68296f23f4a4ec/1-ref_Environment_default_e-workflow.tar.gz(No such file or directory)
and the Python worker shows
2021/11/26 14:16:24 Failed to retrieve staged files: failed toretrieve /tmp/staged in 3 attempts: failed to retrieve chunk for/tmp/staged/workflow.tar.gz
        caused by:
rpc error: code = Unknown desc = ; failed to retrieve chunk for/tmp/staged/workflow.tar.gz
        caused by:
rpc error: code = Unknown desc = ; failed to retrieve chunk for/tmp/staged/workflow.tar.gz
        caused by:
rpc error: code = Unknown desc = ; failed to retrieve chunk for/tmp/staged/workflow.tar.gz
        caused by:
rpc error: code = Unknown desc =
I found a Stackoverflow issue with the exact same issue, butwithout a solution. The file seems to exist only under /tmp on mylocal client machine, which is useless.
2) Submit the job with --flink_submit_uber_jar=True
This will submit the staging information correctly, but I cannotset the amount of parallelism. Instead I get the following warning:
WARNING:apache_beam.options.pipeline_options:Discarding invalidoverrides: {'parallelism': 100}
and the job runs with only a single worker (useless as well).
3) Spawn another job manager sidecar container running the Beam jobserver and submit via the PortableRunner
This works (somewhat) when I run the job server image locally with--network=host, but I cannot get it to work on Kubernetes. Iexposed the ports 8097-8099 on the load balancer IP, but when Isubmit a job, I only get
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvousof RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
debug_error_string ="{"created":"@1637934464.499518882","description":"Failed to picksubchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3158,"referenced_errors":[{"created":"@1637934464.499518362","description":"failed to connect to alladdresses","file":"src/core/lib/transport/error_utils.cc","file_line":147,"grpc_status
":14}]}"
This method also seems to suffer from the same issue as 2) that Iam unable to control the amount of parallelism.
Is there anything that I am doing fundamentally wrong? I cannotreally imagine that it is this difficult to submit a simple Pythonjob to a Beam/Flink cluster.
Thanks for any help
Janek

--

Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de

Re: Deployment difficulties with Python apps on Kubernetes Flink cluster

Reply via email to