HI Burak, My response is inline.
Thanks a lot! On Mon, May 22, 2017 at 9:26 AM, Burak Yavuz <brk...@gmail.com> wrote: > Hi Kant, > >> >> >> 1. Can we use Spark Structured Streaming for stateless transformations >> just like we would do with DStreams or Spark Structured Streaming is only >> meant for stateful computations? >> > > Of course you can do stateless transformations. Any map, filter, select, > type of transformation is stateless. Aggregations are generally stateful. > You could also perform arbitrary stateless aggregations with " > flatMapGroups > <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala#L145>" > or make them stateful with "flatMapGroupsWithState > <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala#L376> > ". > *Got it. so Spark Structured Streaming does both Stateful and Stateless tranformations. In that case I am assuming DStreams API will be deprecated? How about groupBy ? That is stateful right?* > > > >> 2. When we use groupBy and Window operations for event time processing >> and specify a watermark does this mean the timestamp field in each message >> is compared to the processing time of that machine/node and discard that >> events that are late than the specified threshold? If we don't specify a >> watermark I am assuming the processing time wont come into the picture. is >> that right? Just trying to understand the interplay between processing time >> and even time when we do even time processing. >> >> Watermarks are tracked with respect to the event time of your data, not > the processing time of the machine. Please take a look at the blog below > for more details > https://databricks.com/blog/2017/05/08/event-time- > aggregation-watermarking-apache-sparks-structured-streaming.html > *Thanks for this article. I am not sure if I am interpreting the article incorrectly buy Looks Like that Article shows there is indeed a relationship between Processing time and event time. For example* *say I set an Watermark of 10 minutes and * *1. I send one message which has an event time stamp of May 22 2017 1PM and Processing Time as May 22 2017 1:02 PM* *2. I send another message which has an event time of May 22 2017 12:55 PM and Processing Time as May 23 2017 1PM* *Simply put, say I am just faking my event timestamp's to meet the cut off specified by the watermark but I am actually sending them a day or week later. How does Spark Structured Streaming handle this case? * > > > Best, > Burak >