There is an extension in Beam to support TPC-DS benchmark [1] that basically runs TPC-DS SQL queries via Beam SQL. Though, I’m not sure if it runs regularly and, IIRC (when I took a look on this last time, maybe I’m mistaken), it requires some adjustments to run on any other runners than Dataflow. Also, when I tried to run it on SparkRunner many queries failed because of different reasons [2].
I believe that if we will manage to make it running for most of the queries on any runner then it will be a good addition to Nexmark benchmark that we have for now since TPC-DS results can be used to compare with other data processing systems as well. [1] https://github.com/apache/beam/tree/master/sdks/java/testing/tpcds [2] https://issues.apache.org/jira/browse/BEAM-9891 > On 22 Mar 2021, at 18:00, Tao Li <[email protected]> wrote: > > Hi Beam community, > > I am wondering if there is a doc to compare perf of Beam (on Spark) and > native spark for batch processing? For example using TPCDS benmark. > > I did find some relevant links like this > <https://archive.fosdem.org/2018/schedule/event/nexmark_benchmarking_suite/attachments/slides/2494/export/events/attachments/nexmark_benchmarking_suite/slides/2494/Nexmark_Suite_for_Apache_Beam_(FOSDEM18).pdf> > but it’s old and it mostly covers the streaming scenarios. > > Thanks!
