Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-02-06 Thread M Singh
Hi Jacek:
Thanks for your response.
I am just trying to understand the fundamentals of watermarking and how it 
behaves in aggregation vs non-aggregation scenarios.



 

On Tuesday, February 6, 2018 9:04 AM, Jacek Laskowski  
wrote:
 

 Hi,
What would you expect? The data is simply dropped as that's the purpose of 
watermarking it. That's my understanding at least.
Pozdrawiam,Jacek Laskowskihttps://about.me/JacekLaskowskiMastering Spark 
SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Mon, Feb 5, 2018 at 8:11 PM, M Singh  wrote:

Just checking if anyone has more details on how watermark works in cases where 
event time is earlier than processing time stamp. 

On Friday, February 2, 2018 8:47 AM, M Singh  wrote:
 

 Hi Vishu/Jacek:
Thanks for your responses.
Jacek - At the moment, the current time for my use case is processing time.

Vishnu - Spark documentation (https://spark.apache.org/ docs/latest/structured- 
streaming-programming-guide. html) does indicate that it can dedup using 
watermark.  So I believe there are more use cases for watermark and that is 
what I am trying to find.
I am hoping that TD can clarify or point me to the documentation.
Thanks
 

On Wednesday, January 31, 2018 6:37 AM, Vishnu Viswanath 
 wrote:
 

 Hi Mans,
Watermark is Spark is used to decide when to clear the state, so if the even it 
delayed more than when the state is cleared by Spark, then it will be ignored.I 
recently wrote a blog post on this : http://vishnuviswanath.com/ 
spark_structured_streaming. html#watermark

Yes, this State is applicable for aggregation only. If you are having only a 
map function and don't want to process it, you could do a filter based on its 
EventTime field, but I guess you will have to compare it with the processing 
time since there is no API to access Watermark by the user. 
-Vishnu
On Fri, Jan 26, 2018 at 1:14 PM, M Singh  wrote:

Hi:
I am trying to filter out records which are lagging behind (based on event 
time) by a certain amount of time.  
Is the watermark api applicable to this scenario (ie, filtering lagging 
records) or it is only applicable with aggregation ?  I could not get a clear 
understanding from the documentation which only refers to it's usage with 
aggregation.

Thanks
Mans



   

   



   

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-02-06 Thread Jacek Laskowski
Hi,

What would you expect? The data is simply dropped as that's the purpose of
watermarking it. That's my understanding at least.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Mon, Feb 5, 2018 at 8:11 PM, M Singh  wrote:

> Just checking if anyone has more details on how watermark works in cases
> where event time is earlier than processing time stamp.
>
>
> On Friday, February 2, 2018 8:47 AM, M Singh  wrote:
>
>
> Hi Vishu/Jacek:
>
> Thanks for your responses.
>
> Jacek - At the moment, the current time for my use case is processing time.
>
> Vishnu - Spark documentation (https://spark.apache.org/
> docs/latest/structured-streaming-programming-guide.html) does indicate
> that it can dedup using watermark.  So I believe there are more use cases
> for watermark and that is what I am trying to find.
>
> I am hoping that TD can clarify or point me to the documentation.
>
> Thanks
>
>
> On Wednesday, January 31, 2018 6:37 AM, Vishnu Viswanath <
> vishnu.viswanat...@gmail.com> wrote:
>
>
> Hi Mans,
>
> Watermark is Spark is used to decide when to clear the state, so if the
> even it delayed more than when the state is cleared by Spark, then it will
> be ignored.
> I recently wrote a blog post on this : http://vishnuviswanath.com/
> spark_structured_streaming.html#watermark
>
> Yes, this State is applicable for aggregation only. If you are having only
> a map function and don't want to process it, you could do a filter based on
> its EventTime field, but I guess you will have to compare it with the
> processing time since there is no API to access Watermark by the user.
>
> -Vishnu
>
> On Fri, Jan 26, 2018 at 1:14 PM, M Singh 
> wrote:
>
> Hi:
>
> I am trying to filter out records which are lagging behind (based on event
> time) by a certain amount of time.
>
> Is the watermark api applicable to this scenario (ie, filtering lagging
> records) or it is only applicable with aggregation ?  I could not get a
> clear understanding from the documentation which only refers to it's usage
> with aggregation.
>
> Thanks
>
> Mans
>
>
>
>
>
>
>


Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-02-05 Thread M Singh
Just checking if anyone has more details on how watermark works in cases where 
event time is earlier than processing time stamp. 

On Friday, February 2, 2018 8:47 AM, M Singh  wrote:
 

 Hi Vishu/Jacek:
Thanks for your responses.
Jacek - At the moment, the current time for my use case is processing time.

Vishnu - Spark documentation 
(https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)
 does indicate that it can dedup using watermark.  So I believe there are more 
use cases for watermark and that is what I am trying to find.
I am hoping that TD can clarify or point me to the documentation.
Thanks
 

On Wednesday, January 31, 2018 6:37 AM, Vishnu Viswanath 
 wrote:
 

 Hi Mans,
Watermark is Spark is used to decide when to clear the state, so if the even it 
delayed more than when the state is cleared by Spark, then it will be ignored.I 
recently wrote a blog post on this : 
http://vishnuviswanath.com/spark_structured_streaming.html#watermark

Yes, this State is applicable for aggregation only. If you are having only a 
map function and don't want to process it, you could do a filter based on its 
EventTime field, but I guess you will have to compare it with the processing 
time since there is no API to access Watermark by the user. 
-Vishnu
On Fri, Jan 26, 2018 at 1:14 PM, M Singh  wrote:

Hi:
I am trying to filter out records which are lagging behind (based on event 
time) by a certain amount of time.  
Is the watermark api applicable to this scenario (ie, filtering lagging 
records) or it is only applicable with aggregation ?  I could not get a clear 
understanding from the documentation which only refers to it's usage with 
aggregation.

Thanks
Mans



   

   

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-02-02 Thread M Singh
Hi Vishu/Jacek:
Thanks for your responses.
Jacek - At the moment, the current time for my use case is processing time.

Vishnu - Spark documentation 
(https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)
 does indicate that it can dedup using watermark.  So I believe there are more 
use cases for watermark and that is what I am trying to find.
I am hoping that TD can clarify or point me to the documentation.
Thanks
 

On Wednesday, January 31, 2018 6:37 AM, Vishnu Viswanath 
 wrote:
 

 Hi Mans,
Watermark is Spark is used to decide when to clear the state, so if the even it 
delayed more than when the state is cleared by Spark, then it will be ignored.I 
recently wrote a blog post on this : 
http://vishnuviswanath.com/spark_structured_streaming.html#watermark

Yes, this State is applicable for aggregation only. If you are having only a 
map function and don't want to process it, you could do a filter based on its 
EventTime field, but I guess you will have to compare it with the processing 
time since there is no API to access Watermark by the user. 
-Vishnu
On Fri, Jan 26, 2018 at 1:14 PM, M Singh  wrote:

Hi:
I am trying to filter out records which are lagging behind (based on event 
time) by a certain amount of time.  
Is the watermark api applicable to this scenario (ie, filtering lagging 
records) or it is only applicable with aggregation ?  I could not get a clear 
understanding from the documentation which only refers to it's usage with 
aggregation.

Thanks
Mans



   

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-01-31 Thread Vishnu Viswanath
Hi Mans,

Watermark is Spark is used to decide when to clear the state, so if the
even it delayed more than when the state is cleared by Spark, then it will
be ignored.
I recently wrote a blog post on this :
http://vishnuviswanath.com/spark_structured_streaming.html#watermark

Yes, this State is applicable for aggregation only. If you are having only
a map function and don't want to process it, you could do a filter based on
its EventTime field, but I guess you will have to compare it with the
processing time since there is no API to access Watermark by the user.

-Vishnu

On Fri, Jan 26, 2018 at 1:14 PM, M Singh 
wrote:

> Hi:
>
> I am trying to filter out records which are lagging behind (based on event
> time) by a certain amount of time.
>
> Is the watermark api applicable to this scenario (ie, filtering lagging
> records) or it is only applicable with aggregation ?  I could not get a
> clear understanding from the documentation which only refers to it's usage
> with aggregation.
>
> Thanks
>
> Mans
>


Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-01-26 Thread Jacek Laskowski
Hi,

I'm curious how would you do the requirement "by a certain amount of time"
without a watermark? How would you know what's current and compute the lag?
Let's forget about watermark for a moment and see if it pops up as an
inevitable feature :)

"I am trying to filter out records which are lagging behind (based on event
time) by a certain amount of time."

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Fri, Jan 26, 2018 at 7:14 PM, M Singh 
wrote:

> Hi:
>
> I am trying to filter out records which are lagging behind (based on event
> time) by a certain amount of time.
>
> Is the watermark api applicable to this scenario (ie, filtering lagging
> records) or it is only applicable with aggregation ?  I could not get a
> clear understanding from the documentation which only refers to it's usage
> with aggregation.
>
> Thanks
>
> Mans
>


Apache Spark - Spark Structured Streaming - Watermark usage

2018-01-26 Thread M Singh
Hi:
I am trying to filter out records which are lagging behind (based on event 
time) by a certain amount of time.  
Is the watermark api applicable to this scenario (ie, filtering lagging 
records) or it is only applicable with aggregation ?  I could not get a clear 
understanding from the documentation which only refers to it's usage with 
aggregation.

Thanks
Mans