I would guess the opposite is true for highly iterative benchmarks (common in 
graph processing and data-science).

Spark has a pretty large overhead per iteration, more optimisations and 
planning only makes this worse.

Sure people implemented things like dijkstra's algorithm in spark
(a problem where the number of iterations is bounded by the circumference of 
the input graph),
but all the datasets I've seen it running on had a very small circumference 
(which is common for e.g. social networks).

Take sparkSQL for example. Catalyst is a really good query optimiser, but it 
introduces significant overhead.
Since spark has no iterative semantics on its own (unlike flink),
one has to materialise the intermediary dataframe at each iteration boundary to 
determine if a termination criterion is reached.
This causes a huge amount of planning, especially since it looks like catalyst 
will try to optimise the dependency graph
regardless of caching. A dependency graph that grows in the number of 
iterations and thus the size of the input dataset.

In flink on the other hand, you can describe you entire iterative program 
through transformations without ever calling an action.
This means that the optimiser will only have to do planing once.

Just my 2 cents :)
Cheers, Jan

> On 06 Jul 2015, at 06:10, n...@reactor8.com wrote:
> 
> Maybe some flink benefits from some pts they outline here:
>  
> http://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html 
> <http://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html>
>  
> Probably if re-ran the benchmarks with 1.5/tungsten line would close the gap 
> a bit(or a lot) with spark moving towards similar style off-heap memory mgmt, 
> more planning optimizations
>  
>  
> From: Jerry Lam [mailto:chiling...@gmail.com] 
> Sent: Sunday, July 5, 2015 6:28 PM
> To: Ted Yu
> Cc: Slim Baltagi; user
> Subject: Re: Benchmark results between Flink and Spark
>  
> Hi guys,
>  
> I just read the paper too. There is no much information regarding why Flink 
> is faster than Spark for data science type of workloads in the benchmark. It 
> is very difficult to generalize the conclusion of a benchmark from my point 
> of view. How much experience the author has with Spark is in comparisons to 
> Flink is one of the immediate questions I have. It would be great if they 
> have the benchmark software available somewhere for other people to 
> experiment.
>  
> just my 2 cents,
>  
> Jerry
>  
> On Sun, Jul 5, 2015 at 4:35 PM, Ted Yu <yuzhih...@gmail.com 
> <mailto:yuzhih...@gmail.com>> wrote:
>> There was no mentioning of the versions of Flink and Spark used in 
>> benchmarking.
>>  
>> The size of cluster is quite small.
>>  
>> Cheers
>>  
>> On Sun, Jul 5, 2015 at 10:24 AM, Slim Baltagi <sbalt...@gmail.com 
>> <mailto:sbalt...@gmail.com>> wrote:
>>> Hi
>>> 
>>> Apache Flink outperforms Apache Spark in processing machine learning & graph
>>> algorithms and relational queries but not in batch processing!
>>> 
>>> The results were published in the proceedings of the 18th International
>>> Conference, Business Information Systems 2015, PoznaƄ, Poland, June 24-26,
>>> 2015.
>>> 
>>> Thanks to our friend Google, Chapter 3: 'Evaluating New Approaches of Big
>>> Data Analytics Frameworks' by Norman Spangenberg, Martin Roth and Bogdan
>>> Franczyk is available for preview at http://goo.gl/WocQci 
>>> <http://goo.gl/WocQci> on pages 28-37.
>>> 
>>> Enjoy!
>>> 
>>> Slim Baltagi
>>> http://www.SparkBigData.com <http://www.sparkbigdata.com/>
>>> 
>>> 
>>> 
>>> 
>>> --
>>> View this message in context: 
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Benchmark-results-between-Flink-and-Spark-tp23626.html
>>>  
>>> <http://apache-spark-user-list.1001560.n3.nabble.com/Benchmark-results-between-Flink-and-Spark-tp23626.html>
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>>> <mailto:user-unsubscr...@spark.apache.org>
>>> For additional commands, e-mail: user-h...@spark.apache.org 
>>> <mailto:user-h...@spark.apache.org>

Reply via email to