Hi Yun,

that is a very complete answer. Thanks!

I was also wondering about the mini-batches that Spark creates when we have
to create a SparkStream context. It still remains for all versions of
stream processing in Spark, isn't it? And because that I Spark shuffles
data [1] to wide-dependent operators every time mini-batch ends, doesn't it?
In this way Flink does not have mini-batches, hence I will shuffle data to
wide-dependent operators only when a window is triggered. Am I correct?

[1]
http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/rdd-programming-guide.html#shuffle-operations

Thanks,
Felipe
*--*
*-- Felipe Gutierrez*

*-- skype: felipe.o.gutierrez*
*--* *https://felipeogutierrez.blogspot.com
<https://felipeogutierrez.blogspot.com>*


On Thu, Oct 10, 2019 at 7:25 PM Yun Tang <myas...@live.com> wrote:

> Hi Felipe
>
> Generally speaking, the key difference which impacts the performance is
> where they store data within windows.
> For Flink, it would store data and its related time-stamp within windows
> in state backend[1]. Once window is triggered, it would pull all the stored
> timer with coupled record-key, and then use the record-key to query state
> backend for next actions.
>
> For Spark, first of all, we would talk about structured streaming [2] as
> it's better than previous spark streaming especially on window scenario.
> Unlike Flink built-in supported rocksDB state backend, Spark has only one
> implementation of state store providers. It's HDFSBackedStateStoreProvider
> which stores all of the data in memory, what is a very memory consuming
> approach and might come across OOM errors[3][4][5].
>
> To avoid this, Databricks Runtime offer a 'RocksDBStateStoreProvider' but
> not open-source. We're lucky that open-source Flink already offers built-in
> RocksDB state backend to avoid OOM problem. Moreover, Flink community
> recently are developing spill-able memory state backend [7].
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/state_backends.html
> [2]
> https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#window-operations-on-event-time
> [3]
> https://medium.com/@chandanbaranwal/state-management-in-spark-structured-streaming-aaa87b6c9d31
> [4]
> http://apache-spark-user-list.1001560.n3.nabble.com/use-rocksdb-for-spark-structured-streaming-SSS-td34776.html#a34779
> [5] https://github.com/chermenin/spark-states
> [6]
> https://docs.databricks.com/spark/latest/structured-streaming/production.html#optimize-performance-of-stateful-streaming-queries
> [7] https://issues.apache.org/jira/browse/FLINK-12692
>
> Best
> Yun Tang
>
>
>
> ------------------------------
> *From:* Felipe Gutierrez <felipe.o.gutier...@gmail.com>
> *Sent:* Thursday, October 10, 2019 20:39
> *To:* user <user@flink.apache.org>
> *Subject:* Difference between windows in Spark and Flink
>
> Hi all,
>
> I am trying to think about the essential differences between operators in
> Flink and Spark. Especially when I am using Keyed Windows then a reduce
> operation.
> In Flink we develop an application that can logically separate these two
> operators. Thus after a keyed window I can use
> .reduce/aggregate/fold/apply() functions [1].
> In Spark we have window/reduceByKeyAndWindow functions which to me appears
> it is less flexible in the options to use with a keyed window operation [2].
> Moreover, when these two applications are deployed in a Flink and Spark
> cluster respectively, what are the differences between their physical
> operators running in the cluster?
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html#windows
> [2]
> https://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations
>
> Thanks,
> Felipe
> *--*
> *-- Felipe Gutierrez*
>
> *-- skype: felipe.o.gutierrez *
> *--* *https://felipeogutierrez.blogspot.com
> <https://felipeogutierrez.blogspot.com>*
>

Reply via email to