Hi Yun, that is a very complete answer. Thanks!
I was also wondering about the mini-batches that Spark creates when we have to create a SparkStream context. It still remains for all versions of stream processing in Spark, isn't it? And because that I Spark shuffles data  to wide-dependent operators every time mini-batch ends, doesn't it? In this way Flink does not have mini-batches, hence I will shuffle data to wide-dependent operators only when a window is triggered. Am I correct?  http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/rdd-programming-guide.html#shuffle-operations Thanks, Felipe *--* *-- Felipe Gutierrez* *-- skype: felipe.o.gutierrez* *--* *https://felipeogutierrez.blogspot.com <https://felipeogutierrez.blogspot.com>* On Thu, Oct 10, 2019 at 7:25 PM Yun Tang <myas...@live.com> wrote: > Hi Felipe > > Generally speaking, the key difference which impacts the performance is > where they store data within windows. > For Flink, it would store data and its related time-stamp within windows > in state backend. Once window is triggered, it would pull all the stored > timer with coupled record-key, and then use the record-key to query state > backend for next actions. > > For Spark, first of all, we would talk about structured streaming  as > it's better than previous spark streaming especially on window scenario. > Unlike Flink built-in supported rocksDB state backend, Spark has only one > implementation of state store providers. It's HDFSBackedStateStoreProvider > which stores all of the data in memory, what is a very memory consuming > approach and might come across OOM errors. > > To avoid this, Databricks Runtime offer a 'RocksDBStateStoreProvider' but > not open-source. We're lucky that open-source Flink already offers built-in > RocksDB state backend to avoid OOM problem. Moreover, Flink community > recently are developing spill-able memory state backend . > >  > https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/state_backends.html >  > https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#window-operations-on-event-time >  > https://medium.com/@chandanbaranwal/state-management-in-spark-structured-streaming-aaa87b6c9d31 >  > http://apache-spark-user-list.1001560.n3.nabble.com/use-rocksdb-for-spark-structured-streaming-SSS-td34776.html#a34779 >  https://github.com/chermenin/spark-states >  > https://docs.databricks.com/spark/latest/structured-streaming/production.html#optimize-performance-of-stateful-streaming-queries >  https://issues.apache.org/jira/browse/FLINK-12692 > > Best > Yun Tang > > > > ------------------------------ > *From:* Felipe Gutierrez <felipe.o.gutier...@gmail.com> > *Sent:* Thursday, October 10, 2019 20:39 > *To:* user <email@example.com> > *Subject:* Difference between windows in Spark and Flink > > Hi all, > > I am trying to think about the essential differences between operators in > Flink and Spark. Especially when I am using Keyed Windows then a reduce > operation. > In Flink we develop an application that can logically separate these two > operators. Thus after a keyed window I can use > .reduce/aggregate/fold/apply() functions . > In Spark we have window/reduceByKeyAndWindow functions which to me appears > it is less flexible in the options to use with a keyed window operation . > Moreover, when these two applications are deployed in a Flink and Spark > cluster respectively, what are the differences between their physical > operators running in the cluster? > >  > https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html#windows >  > https://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations > > Thanks, > Felipe > *--* > *-- Felipe Gutierrez* > > *-- skype: felipe.o.gutierrez * > *--* *https://felipeogutierrez.blogspot.com > <https://felipeogutierrez.blogspot.com>* >