Re: Python vs Java SDK Performance

Shannon Duncan Wed, 30 Oct 2019 14:34:28 -0700

I was about to ask if cython would work with the Beam SDK. I just started
building the pipes to support cython in modules.


On Wed, Oct 30, 2019 at 2:53 PM Robert Bradshaw <rober...@google.com> wrote:

> Python does not allow as much customization of serialization as is
> available in Java, in part due to often not explicitly knowing the
> types at each point in the pipeline (though Udi is working on making
> this better, and there's ongoing work for adding explicit schema
> support as well). Somewhat to compensate for this, we default to using
> a special "Fast Primitives Coder" [1] which special cases generic
> python types like dictionaries, lists, tuples, etc. Significant
> optimization has gone into this and so it is quite fast and works
> pretty well in practice due to these types being quite versatile and
> common (whereas in Java one would write often write a custom POJO and
> serializer).  Only when we encounter something not in this list of
> "primitives" do we fall back to pickle which is, as expected, much
> slower.
>
> [1]
> https://github.com/apache/beam/blob/release-2.16.0/sdks/python/apache_beam/coders/coder_impl.py#L319
>
> On Wed, Oct 30, 2019 at 12:36 PM Luke Cwik <lc...@google.com> wrote:
> >
> > To my knowledge we haven't compared the cost of the "dill/pickle/..."
> coder to Java's SerializableCoder but even then you always have the power
> to write your own coders if you don't believe the default coders perform
> well in Python.
> >
> > Note that a lot of the Beam Python coders use cython to go fast so it
> may be less of a concern then you think.
> >
> > But please try it out and report any perf issues that you discover since
> they can be fixed within the Python SDK.
> >
> > On Mon, Oct 14, 2019 at 6:52 AM Shannon Duncan <
> joseph.dun...@liveramp.com> wrote:
> >>
> >> Has anyone done any testing around the performance difference of Python
> SDK vs Java SDK on Google Dataflow?
> >>
> >> We recently dropped our requirement for sequence files in our pipeline
> which opens the door to using the python SDK vs the Java SDK. But my
> concern is loss of performance.
> >>
> >> In Java we control our serialization very carefully between pipeline
> items and my fear is loosing control of that in Python, so I'm curious
> about the speed of serialization of generic python items like dictionaries,
> lists, tuples, etc in context of dataflow.
> >>
> >> Thanks!
> >> Shannon Duncan
>

Re: Python vs Java SDK Performance

Reply via email to