Comparing the performance of systems is not easy and the results depend on a lot of things as the configuration, data, and jobs.
That being said, the numbers that Bill reported for WordCount make absolutely sense as Stephan pointed out in his response (Flink does not feature hash-based aggregations yet). So there are definitely use cases where Spark outperforms Flink, but there are also other cases where both systems perform similar or Flink is faster. For example more complex jobs benefit a lot from Flink's pipelined execution and Flink's build-in iterations are very fast, especially delta-iterations. Best, Fabian 2015-06-10 0:53 GMT+02:00 Hawin Jiang <hawin.ji...@gmail.com>: > Hey Aljoscha > > I also sent an email to Bill for asking the latest test results. From > Bill's email, Apache Spark performance looks like better than Flink. > How about your thoughts. > > > > Best regards > Hawin > > > > On Tue, Jun 9, 2015 at 2:29 AM, Aljoscha Krettek <aljos...@apache.org> > wrote: > >> Hi, >> we don't have any current performance numbers. But the queries mentioned >> on the benchmark page should be easy to implement in Flink. It could be >> interesting if someone ported these queries and ran them with exactly the >> same data on the same machines. >> >> Bill Sparks wrote on the mailing list some days ago ( >> http://mail-archives.apache.org/mod_mbox/flink-user/201506.mbox/%3cd1972778.64426%25jspa...@cray.com%3e). >> He seems to be running some tests to compare Flink, Spark and MapReduce. >> >> Regards, >> Aljoscha >> >> On Mon, Jun 8, 2015 at 9:09 PM, Hawin Jiang <hawin.ji...@gmail.com> >> wrote: >> >>> Hi Aljoscha >>> >>> I want to know what is the apache flink performance if I run the same >>> SQL as below. >>> Do you have any apache flink benchmark information? >>> Such as: https://amplab.cs.berkeley.edu/benchmark/ >>> Thanks. >>> >>> >>> >>> SELECT pageURL, pageRank FROM rankings WHERE pageRank > X >>> >>> Query 1A >>> 32,888 resultsQuery 1B >>> 3,331,851 resultsQuery 1C >>> 89,974,976 results05101520253035404550Redshift (HDD)Impala - DiskImpala >>> - MemShark - DiskShark - MemHiveTez0510152025303540455055Redshift >>> (HDD)Impala - DiskImpala - MemShark - DiskShark - >>> MemHiveTez0510152025303540Redshift >>> (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTezOld DataMedian >>> Response Time (s)Redshift (HDD) - Current2.492.619.46Impala - Disk - >>> 1.2.312.01512.01537.085Impala - Mem - 1.2.32.173.0136.04Shark - Disk - >>> 0.8.16.6722.4Shark - Mem - 0.8.11.71.83.6Hive - 0.12 YARN50.4959.9343.34Tez >>> - 0.2.028.2236.3526.44 >>> >>> >>> On Mon, Jun 8, 2015 at 2:03 AM, Aljoscha Krettek <aljos...@apache.org> >>> wrote: >>> >>>> Hi, >>>> actually, what do you want to know about Flink SQL? >>>> >>>> Aljoscha >>>> >>>> On Sat, Jun 6, 2015 at 2:22 AM, Hawin Jiang <hawin.ji...@gmail.com> >>>> wrote: >>>> > Thanks all >>>> > >>>> > Actually, I want to know more info about Flink SQL and Flink >>>> performance >>>> > Here is the Spark benchmark. Maybe you already saw it before. >>>> > https://amplab.cs.berkeley.edu/benchmark/ >>>> > >>>> > Thanks. >>>> > >>>> > >>>> > >>>> > Best regards >>>> > Hawin >>>> > >>>> > >>>> > >>>> > On Fri, Jun 5, 2015 at 1:35 AM, Fabian Hueske <fhue...@gmail.com> >>>> wrote: >>>> >> >>>> >> If you want to append data to a data set that is store as files >>>> (e.g., on >>>> >> HDFS), you can go for a directory structure as follows: >>>> >> >>>> >> dataSetRootFolder >>>> >> - part1 >>>> >> - 1 >>>> >> - 2 >>>> >> - ... >>>> >> - part2 >>>> >> - 1 >>>> >> - ... >>>> >> - partX >>>> >> >>>> >> Flink's file format supports recursive directory scans such that you >>>> can >>>> >> add new subfolders to dataSetRootFolder and read the full data set. >>>> >> >>>> >> 2015-06-05 9:58 GMT+02:00 Aljoscha Krettek <aljos...@apache.org>: >>>> >>> >>>> >>> Hi, >>>> >>> I think the example could be made more concise by using the Table >>>> API. >>>> >>> >>>> http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html >>>> >>> >>>> >>> Please let us know if you have questions about that, it is still >>>> quite >>>> >>> new. >>>> >>> >>>> >>> On Fri, Jun 5, 2015 at 9:03 AM, hawin <hawin.ji...@gmail.com> >>>> wrote: >>>> >>> > Hi Aljoscha >>>> >>> > >>>> >>> > Thanks for your reply. >>>> >>> > Do you have any tips for Flink SQL. >>>> >>> > I know that Spark support ORC format. How about Flink SQL? >>>> >>> > BTW, for TPCHQuery10 example, you have implemented it by 231 >>>> lines of >>>> >>> > code. >>>> >>> > How to make that as simple as possible by flink. >>>> >>> > I am going to use Flink in my future project. Sorry for so many >>>> >>> > questions. >>>> >>> > I believe that you guys will make a world difference. >>>> >>> > >>>> >>> > >>>> >>> > @Chiwan >>>> >>> > You made a very good example for me. >>>> >>> > Thanks a lot >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> > -- >>>> >>> > View this message in context: >>>> >>> > >>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html >>>> >>> > Sent from the Apache Flink User Mailing List archive. mailing list >>>> >>> > archive at Nabble.com. >>>> >> >>>> >> >>>> > >>>> >>> >>> >> >