Hi,

I am currently evaluating PyFlink in comparison to Java and did some various 
tests, mainly comparing identical pipelines with focus on throughput.
For me it seems, that PyFlink is generally worse for wear and seems to reach 
its limits in throughput at a point where Java still has resources left (and 
can easily handle double the amount of data).
After seeing the benchmarks at [0], I also tried larger data sizes, but I could 
not reproduce any of those findings. The only parameter seemingly to help was 
'python.fn-execution.bundle.size', but even there limits were reached rather 
quickly.

I would mainly like to know, if this is expected/normal, or if maybe there are 
parameters and resources to adjust to help bring PyFlink /somewhat/ on par with 
the pure Java implementation.

I appreciate any feedback on this. Thank you in advance.

Best
  David

[0]: https://flink.apache.org/2022/05/06/exploring-the-thread-mode-in-pyflink/

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to