Python does not allow as much customization of serialization as is available in Java, in part due to often not explicitly knowing the types at each point in the pipeline (though Udi is working on making this better, and there's ongoing work for adding explicit schema support as well). Somewhat to compensate for this, we default to using a special "Fast Primitives Coder" [1] which special cases generic python types like dictionaries, lists, tuples, etc. Significant optimization has gone into this and so it is quite fast and works pretty well in practice due to these types being quite versatile and common (whereas in Java one would write often write a custom POJO and serializer). Only when we encounter something not in this list of "primitives" do we fall back to pickle which is, as expected, much slower.
[1] https://github.com/apache/beam/blob/release-2.16.0/sdks/python/apache_beam/coders/coder_impl.py#L319 On Wed, Oct 30, 2019 at 12:36 PM Luke Cwik <lc...@google.com> wrote: > > To my knowledge we haven't compared the cost of the "dill/pickle/..." coder > to Java's SerializableCoder but even then you always have the power to write > your own coders if you don't believe the default coders perform well in > Python. > > Note that a lot of the Beam Python coders use cython to go fast so it may be > less of a concern then you think. > > But please try it out and report any perf issues that you discover since they > can be fixed within the Python SDK. > > On Mon, Oct 14, 2019 at 6:52 AM Shannon Duncan <joseph.dun...@liveramp.com> > wrote: >> >> Has anyone done any testing around the performance difference of Python SDK >> vs Java SDK on Google Dataflow? >> >> We recently dropped our requirement for sequence files in our pipeline which >> opens the door to using the python SDK vs the Java SDK. But my concern is >> loss of performance. >> >> In Java we control our serialization very carefully between pipeline items >> and my fear is loosing control of that in Python, so I'm curious about the >> speed of serialization of generic python items like dictionaries, lists, >> tuples, etc in context of dataflow. >> >> Thanks! >> Shannon Duncan