Hi Matthias! Glad to hear you're interested in performance. I've been doing some investigation into benchmarking Beam over the last couple of weeks and I'm getting fairly close to having something I think will be workable, probably next week or the week after. I'm very interested in hearing opinions from the community (I solicited feedback from the dev list a few weeks ago but neglected to include user@), so I'd love to hear any thoughts you have.
Best, Jason On Fri, Nov 18, 2016 at 11:11 AM, Lukasz Cwik <[email protected]> wrote: > I would like to point out that the Java code has been around a lot longer > and has had more time to be optimized while Python has been much more > recent and is still having lots of changes with much larger improvements in > performance. That gap between Python and Java has been steadily decreasing > over the past couple of months. > > On Fri, Nov 18, 2016 at 11:42 AM, Matthias Baetens < > [email protected]> wrote: > >> Hi Apache Beam users! >> >> The last months I played around a bit with Google Dataflow/Apache Beam >> (first in Java and lately in Python as well). >> >> This week I did a quick implementation of the same pipeline in both Java >> and Python involving some processing (String operations and int operations) >> and a GroupBy using a Accumulator. >> >> When running the pipeline on Google Cloud, the Java pipeline performed >> 4-5 times faster than the Python pipeline. Now, this probably makes sense >> since Python is in general slower than Java, but I was wondering if there >> is more to it and how I could potentially profile the pipelines in a >> (semi)-scientific way... Maybe some of you have thoughts/input or had >> similar experiences? Happy to hear your input! >> >> Best regards, >> >> Matthias >> > > -- ------- Jason Kuster Apache Beam (Incubating) / Google Cloud Dataflow
