Re: Python vs Java SDK Performance

Robert Bradshaw Wed, 30 Oct 2019 14:41:50 -0700

Yep.

What we don't (yet?) have is a Cython interface for writing DoFns that
allows us to avoid calling the process method using python calling
semantics. But Cython is used by Beam and installed on the workers
ready to go to work for user code.


On Wed, Oct 30, 2019 at 2:33 PM Shannon Duncan
<joseph.dun...@liveramp.com> wrote:
>
> I was about to ask if cython would work with the Beam SDK. I just started 
> building the pipes to support cython in modules.
>
> On Wed, Oct 30, 2019 at 2:53 PM Robert Bradshaw <rober...@google.com> wrote:
>>
>> Python does not allow as much customization of serialization as is
>> available in Java, in part due to often not explicitly knowing the
>> types at each point in the pipeline (though Udi is working on making
>> this better, and there's ongoing work for adding explicit schema
>> support as well). Somewhat to compensate for this, we default to using
>> a special "Fast Primitives Coder" [1] which special cases generic
>> python types like dictionaries, lists, tuples, etc. Significant
>> optimization has gone into this and so it is quite fast and works
>> pretty well in practice due to these types being quite versatile and
>> common (whereas in Java one would write often write a custom POJO and
>> serializer).  Only when we encounter something not in this list of
>> "primitives" do we fall back to pickle which is, as expected, much
>> slower.
>>
>> [1] 
>> https://github.com/apache/beam/blob/release-2.16.0/sdks/python/apache_beam/coders/coder_impl.py#L319
>>
>> On Wed, Oct 30, 2019 at 12:36 PM Luke Cwik <lc...@google.com> wrote:
>> >
>> > To my knowledge we haven't compared the cost of the "dill/pickle/..." 
>> > coder to Java's SerializableCoder but even then you always have the power 
>> > to write your own coders if you don't believe the default coders perform 
>> > well in Python.
>> >
>> > Note that a lot of the Beam Python coders use cython to go fast so it may 
>> > be less of a concern then you think.
>> >
>> > But please try it out and report any perf issues that you discover since 
>> > they can be fixed within the Python SDK.
>> >
>> > On Mon, Oct 14, 2019 at 6:52 AM Shannon Duncan 
>> > <joseph.dun...@liveramp.com> wrote:
>> >>
>> >> Has anyone done any testing around the performance difference of Python 
>> >> SDK vs Java SDK on Google Dataflow?
>> >>
>> >> We recently dropped our requirement for sequence files in our pipeline 
>> >> which opens the door to using the python SDK vs the Java SDK. But my 
>> >> concern is loss of performance.
>> >>
>> >> In Java we control our serialization very carefully between pipeline 
>> >> items and my fear is loosing control of that in Python, so I'm curious 
>> >> about the speed of serialization of generic python items like 
>> >> dictionaries, lists, tuples, etc in context of dataflow.
>> >>
>> >> Thanks!
>> >> Shannon Duncan

Re: Python vs Java SDK Performance

Reply via email to