Re: Deployment difficulties with Python apps on Kubernetes Flink cluster

Janek Bevendorff Wed, 15 Dec 2021 13:08:26 -0800

Once again: one step further.

The "Client cancelled" error seems to stem from a Python interpretercrash due to an error in a native library. The crash itself is notlogged at all, so I only get this totally misleading gRPC error instead.Could this get a better error message perhaps?

Another big issue that is still unsolved is the problem that with--setup-file, the dependencies get installed globally in the SDKcontainer and stay there for all following jobs. That includes the wheelbuilt from my local Python module and when I resubmit the job, itdoesn't get reinstalled, because pip figures it's already there. Onlythe main Python file gets resubmitted. This is a big issue and at themoment I can only solve it by killing the SDK container after each jobexecution. It would be much better if dependencies got installed onlyinto a temporary venv that is discarded upon job completion or failure.


Janek


On 09/12/2021 19:45, Janek Bevendorff wrote:

Hi Kyle,

Thank you for your response.
There are a few working Beam+Flink+k8s configurations that have beenpublished, such as [1] [2] and [3]. If these meet your requirements,I would recommend starting from one of them before reinventing yourown. Otherwise, you can look to them for clues, since they’ve likelyhad to solve many of the same problems.
Of course I designed my deployment after those resources and forthe most part working (I’m not using the operator at the moment,because the Helm chart is broken and the whole project seems barelymaintained). The problems lie with what is neither documented norsolved by any of those example deployments and also with bugs ineither Beam or Flink or the interaction thereof.
I’m not sure what progress you have made at this point; did you geteverything to work, aside from the options issue? Did setting--artifact_endpoint resolve the “client cancelled” issue?
No, that problem is unsolved. It occurs randomly, but fortunately notvery often (and usually within the first few minutes aftersubmission). Unfortunately, there is another more common problem thatkeeps creeping up that looks like some totally random Flink failurewith the error “Partition not found” followed by a long partition IDhash. I think it occurs after some other failure where Flink is unableto reschedule a task properly (not sure what, though). It may be to dowith the fact that the Beam Python SDK runner (a sidecar containerinside the task manager pod) is persistent for the lifetime of thetask manager itself. At least I had issues with that earlier when Isubmitted a new version of my job before Flink terminated the old taskmanager (occurs about a minute after a job has finished). The resultwas that, among other things, the submitted Python wheel wasn’treinstalled, because the SDK container still had the old version(definitely a Beam bug, pip should be called with —reinstall).
I don’t know how to trigger the problem exactly, nor how to solve it.But it usually crashes my job after a few hours of processing time.
If I understand correctly, the “Discarding invalid overrides”warning is a red herring; the option should still be passed on. So Ithink there may be an issue elsewhere. If you could share as much ofyour Flink/Beam configuration as possible, it may help to debug theproblem.
No, it’s not passed on. I tried. This is only an issue when I submituber JARs from the client machine. It works with the FlinkRunnerwithout uber JARs as well as PortableRunner+Job Server.
I recommend starting a separate thread regarding the incorrect Pythondocumentation, since I fear it might get buried in this thread. Themore specific incorrect examples you can point out, the better. I’dalso be happy to review PRs if you’re willing to update thedocumentation yourself (it is all open source [4]).
That’d be quite a few places. Also one very annoying thing is thatmost of the time, the imports are missing, so I have to grep thePython sources to find the correct imports or guess them from the JavaAPI. I’ll see if I find the time. That would also depend on whether Ican solve these issues or whether I have to scrap Beam and use Flinkdirectly (or have revert back to Spark, yuck!).
Regarding stateful processing, please provide code snippets so we canreproduce the issue(s). Again, it may be better to start a separatethread since stateful processing should be mostly orthogonal to theFlink deployment architecture.
I stopped using it, because I couldn’t get it to run. Perhaps I ammissing some sort of configuration for persisting the state, but I amneither getting an error nor can I find any documentation about thispart. The error message thrown by the DirectRunner is also totallynon-descriptive. Some “Not supported” error would have saved some timehere. The FlinkRunner doesn’t throw errors, but shows the behaviour Idescribed (timers triggered after each process() and no statepersistence).
Janek

Re: Deployment difficulties with Python apps on Kubernetes Flink cluster

Reply via email to