Hi Jason,How/what are you going to benchmark?I have been doing it for
sometime.Want to make sure I know the objective gaps, if there is
any.ThanksAmir-
From: Jason Kuster <[email protected]>
To: [email protected]
Sent: Friday, November 18, 2016 11:28 AM
Subject: Re: Apache Beam Java vs Python performance on Google Cloud
Hi Matthias!
Glad to hear you're interested in performance. I've been doing some
investigation into benchmarking Beam over the last couple of weeks and I'm
getting fairly close to having something I think will be workable, probably
next week or the week after. I'm very interested in hearing opinions from the
community (I solicited feedback from the dev list a few weeks ago but neglected
to include user@), so I'd love to hear any thoughts you have.
Best,
Jason
On Fri, Nov 18, 2016 at 11:11 AM, Lukasz Cwik <[email protected]> wrote:
I would like to point out that the Java code has been around a lot longer and
has had more time to be optimized while Python has been much more recent and is
still having lots of changes with much larger improvements in performance. That
gap between Python and Java has been steadily decreasing over the past couple
of months.
On Fri, Nov 18, 2016 at 11:42 AM, Matthias Baetens <matthias.baetens@datatonic.
com> wrote:
Hi Apache Beam users!
The last months I played around a bit with Google Dataflow/Apache Beam (first
in Java and lately in Python as well).
This week I did a quick implementation of the same pipeline in both Java and
Python involving some processing (String operations and int operations) and a
GroupBy using a Accumulator.
When running the pipeline on Google Cloud, the Java pipeline performed 4-5
times faster than the Python pipeline. Now, this probably makes sense since
Python is in general slower than Java, but I was wondering if there is more to
it and how I could potentially profile the pipelines in a (semi)-scientific
way... Maybe some of you have thoughts/input or had similar experiences? Happy
to hear your input!
Best regards,
Matthias
--
-------Jason KusterApache Beam (Incubating) / Google Cloud Dataflow