I was about to ask if cython would work with the Beam SDK. I just started building the pipes to support cython in modules.
On Wed, Oct 30, 2019 at 2:53 PM Robert Bradshaw <rober...@google.com> wrote: > Python does not allow as much customization of serialization as is > available in Java, in part due to often not explicitly knowing the > types at each point in the pipeline (though Udi is working on making > this better, and there's ongoing work for adding explicit schema > support as well). Somewhat to compensate for this, we default to using > a special "Fast Primitives Coder" [1] which special cases generic > python types like dictionaries, lists, tuples, etc. Significant > optimization has gone into this and so it is quite fast and works > pretty well in practice due to these types being quite versatile and > common (whereas in Java one would write often write a custom POJO and > serializer). Only when we encounter something not in this list of > "primitives" do we fall back to pickle which is, as expected, much > slower. > > [1] > https://github.com/apache/beam/blob/release-2.16.0/sdks/python/apache_beam/coders/coder_impl.py#L319 > > On Wed, Oct 30, 2019 at 12:36 PM Luke Cwik <lc...@google.com> wrote: > > > > To my knowledge we haven't compared the cost of the "dill/pickle/..." > coder to Java's SerializableCoder but even then you always have the power > to write your own coders if you don't believe the default coders perform > well in Python. > > > > Note that a lot of the Beam Python coders use cython to go fast so it > may be less of a concern then you think. > > > > But please try it out and report any perf issues that you discover since > they can be fixed within the Python SDK. > > > > On Mon, Oct 14, 2019 at 6:52 AM Shannon Duncan < > joseph.dun...@liveramp.com> wrote: > >> > >> Has anyone done any testing around the performance difference of Python > SDK vs Java SDK on Google Dataflow? > >> > >> We recently dropped our requirement for sequence files in our pipeline > which opens the door to using the python SDK vs the Java SDK. But my > concern is loss of performance. > >> > >> In Java we control our serialization very carefully between pipeline > items and my fear is loosing control of that in Python, so I'm curious > about the speed of serialization of generic python items like dictionaries, > lists, tuples, etc in context of dataflow. > >> > >> Thanks! > >> Shannon Duncan >