Re: couple naive questions on Spark Structured Streaming

2017-05-22 Thread kant kodali
HI Burak,

My response is inline.

Thanks a lot!

On Mon, May 22, 2017 at 9:26 AM, Burak Yavuz  wrote:

> Hi Kant,
>
>>
>>
>> 1. Can we use Spark Structured Streaming for stateless transformations
>> just like we would do with DStreams or Spark Structured Streaming is only
>> meant for stateful computations?
>>
>
> Of course you can do stateless transformations. Any map, filter, select,
> type of transformation is stateless. Aggregations are generally stateful.
> You could also perform arbitrary stateless aggregations with "
> flatMapGroups
> "
> or make them stateful with "flatMapGroupsWithState
> 
> ".
>

*Got it. so Spark Structured Streaming does both Stateful and Stateless
tranformations. In that case I am assuming DStreams API will be deprecated?
  How about groupBy ? That is stateful right?*

>
>
>
>> 2. When we use groupBy and Window operations for event time processing
>> and specify a watermark does this mean the timestamp field in each message
>> is compared to the processing time of that machine/node and discard that
>> events that are late than the specified threshold? If we don't specify a
>> watermark I am assuming the processing time wont come into the picture. is
>> that right? Just trying to understand the interplay between processing time
>> and even time when we do even time processing.
>>
>> Watermarks are tracked with respect to the event time of your data, not
> the processing time of the machine. Please take a look at the blog below
> for more details
> https://databricks.com/blog/2017/05/08/event-time-
> aggregation-watermarking-apache-sparks-structured-streaming.html
>

*Thanks for this article. I am not sure if I am interpreting the article
incorrectly buy Looks Like that Article shows there is indeed a
relationship between Processing time and event time. For example*
*say I set an Watermark of 10 minutes and *

*1. I send one message which has an event time stamp of May 22 2017 1PM and
Processing Time as May 22 2017 1:02 PM*


*2. I send another message which has an event time of May 22 2017 12:55 PM
and Processing Time as May 23 2017 1PM*

*Simply put, say I am just faking my event timestamp's to meet the cut off
specified by the watermark but I am actually sending them a day or week
later. How does Spark Structured Streaming handle this case? *

>
>
> Best,
> Burak
>


Re: couple naive questions on Spark Structured Streaming

2017-05-22 Thread Burak Yavuz
Hi Kant,

>
>
> 1. Can we use Spark Structured Streaming for stateless transformations
> just like we would do with DStreams or Spark Structured Streaming is only
> meant for stateful computations?
>

Of course you can do stateless transformations. Any map, filter, select,
type of transformation is stateless. Aggregations are generally stateful.
You could also perform arbitrary stateless aggregations with "flatMapGroups
"
or make them stateful with "flatMapGroupsWithState

".



> 2. When we use groupBy and Window operations for event time processing and
> specify a watermark does this mean the timestamp field in each message is
> compared to the processing time of that machine/node and discard that
> events that are late than the specified threshold? If we don't specify a
> watermark I am assuming the processing time wont come into the picture. is
> that right? Just trying to understand the interplay between processing time
> and even time when we do even time processing.
>
> Watermarks are tracked with respect to the event time of your data, not
the processing time of the machine. Please take a look at the blog below
for more details
https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html

Best,
Burak