The Java DirectRunner enforces additional strictness checks that are expensive (such as re-encoding all elements to make sure that the coder is compatible).
Retry your run with --enforceImmutability=false --enforceEncodability=false On Sat, Apr 11, 2020 at 11:45 AM Krystian Kichewko < [email protected]> wrote: > Hello! > > I'm trying to learn Apache Beam, and I was looking into examples,, when I > noticed something unusual: > > It seems that "word count" example is much faster using Python than Java. > > Python example pipeline on King Lear: > > real 0m9.294s > user 0m2.822s > sys 0m0.370s > > Java example pipeline on King Lear: > > real 1m35.780s > user 4m10.089s > sys 0m1.743s > > As you can see it is 10 sec vs 105 sec real time, and it uses even more > CPU time because it uses all of CPU cores. > > Is this some kind of limitation of Java's direct runner? Or am I doing > something wrong? Is this intended? Should I file a bug? > > Or maybe this difference is eliminated on real life pipelines? > > I got similar results when testing using Google Colab: > https://beam.apache.org/get-started/try-apache-beam/ > > When you execute on all Shakespeare's books in the bucket > (gs://dataflow-samples/shakespeare/*) the difference is even greater: > > Python: > > real 0m47.900s > user 0m18.350s > sys 0m0.579s > > Java: > > real 14m28.201s > user 28m3.206s > sys 0m7.597s > > > How to reproduce: > > Python 3.7: > > docker run -it --rm python:3.7-buster /bin/bash > pip3 install apache-beam[gcp] > mkdir -p /tmp/foo > cd /tmp/foo > time python -m apache_beam.examples.wordcount --input > gs://dataflow-samples/shakespeare/kinglear.txt --output ./count > > real 0m9.294s > user 0m2.822s > sys 0m0.370s > > > Java: > > docker run -it --rm ubuntu:16.04 /bin/bash > apt update > apt install default-jdk maven > mkdir -p /tmp/foo > cd /tmp/foo > mvn archetype:generate -DarchetypeGroupId=org.apache.beam > -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples > -DarchetypeVersion=2.19.0 -DgroupId=org.example > -DartifactId=word-count-beam -Dversion="0.1" > -Dpackage=org.apache.beam.examples -DinteractiveMode=false > cd word-count-beam > mvn compile > time mvn exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount > -Dexec.args="--inputFile=gs://apache-beam-samples/shakespeare/kinglear.txt > --output=counts" -Pdirect-runner > > Execute twice because the first time maven will download dependencies: > > time mvn exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount > -Dexec.args="--inputFile=gs://apache-beam-samples/shakespeare/kinglear.txt > --output=counts" -Pdirect-runner > > real 1m35.780s > user 4m10.089s > sys 0m1.743s > > > Thanks, > Krystian Kichewko >
