[jira] [Commented] (SPARK-24699) Watermark / Append mode should work with Trigger.Once

2018-07-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539784#comment-16539784
 ] 

Apache Spark commented on SPARK-24699:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/21746

> Watermark / Append mode should work with Trigger.Once
> -
>
> Key: SPARK-24699
> URL: https://issues.apache.org/jira/browse/SPARK-24699
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Chris Horn
>Priority: Major
> Attachments: watermark-once.scala, watermark-stream.scala
>
>
> I have a use case where I would like to trigger a structured streaming job 
> from an external scheduler (once every 15 minutes or so) and have it write 
> window aggregates to Kafka.
> I am able to get my code to work when running with `Trigger.ProcessingTime` 
> but when I switch to `Trigger.Once` the watermarking feature of structured 
> streams does not persist to (or is not recollected from) the checkpoint state.
> This causes the stream to never generate output because the watermark is 
> perpetually stuck at `1970-01-01T00:00:00Z`.
> I have created a failing test case in the `EventTimeWatermarkSuite`, I will 
> create a [WIP] pull request on github and link it here.
>  
> It seems that even if it generated the watermark, and given the current 
> streaming behavior, I would have to trigger the job twice to generate any 
> output.
>  
> The microbatcher only calculates the watermark off of the previous batch's 
> input and emits new aggs based off of that timestamp.
> This state is not available to a newly started `MicroBatchExecution` stream.
> Would it be an appropriate strategy to create a new checkpoint file with the 
> most up to watermark or watermark and query stats?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24699) Watermark / Append mode should work with Trigger.Once

2018-06-29 Thread Chris Horn (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16528356#comment-16528356
 ] 

Chris Horn commented on SPARK-24699:


I have attached two scala repl scripts for reproducing this behavior. The 
"once" variant fails to produce output or updated watermarks, the "stream" 
variant behaves mostly as expected.

> Watermark / Append mode should work with Trigger.Once
> -
>
> Key: SPARK-24699
> URL: https://issues.apache.org/jira/browse/SPARK-24699
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Chris Horn
>Priority: Major
> Attachments: watermark-once.scala, watermark-stream.scala
>
>
> I have a use case where I would like to trigger a structured streaming job 
> from an external scheduler (once every 15 minutes or so) and have it write 
> window aggregates to Kafka.
> I am able to get my code to work when running with `Trigger.ProcessingTime` 
> but when I switch to `Trigger.Once` the watermarking feature of structured 
> streams does not persist to (or is not recollected from) the checkpoint state.
> This causes the stream to never generate output because the watermark is 
> perpetually stuck at `1970-01-01T00:00:00Z`.
> I have created a failing test case in the `EventTimeWatermarkSuite`, I will 
> create a [WIP] pull request on github and link it here.
>  
> It seems that even if it generated the watermark, and given the current 
> streaming behavior, I would have to trigger the job twice to generate any 
> output. 
> It seems like the microbatcher only calculates the watermark off of the 
> previous batch's input and emits new aggs based off of that timestamp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24699) Watermark / Append mode should work with Trigger.Once

2018-06-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16528355#comment-16528355
 ] 

Apache Spark commented on SPARK-24699:
--

User 'c-horn' has created a pull request for this issue:
https://github.com/apache/spark/pull/21676

> Watermark / Append mode should work with Trigger.Once
> -
>
> Key: SPARK-24699
> URL: https://issues.apache.org/jira/browse/SPARK-24699
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Chris Horn
>Priority: Major
> Attachments: watermark-once.scala, watermark-stream.scala
>
>
> I have a use case where I would like to trigger a structured streaming job 
> from an external scheduler (once every 15 minutes or so) and have it write 
> window aggregates to Kafka.
> I am able to get my code to work when running with `Trigger.ProcessingTime` 
> but when I switch to `Trigger.Once` the watermarking feature of structured 
> streams does not persist to (or is not recollected from) the checkpoint state.
> This causes the stream to never generate output because the watermark is 
> perpetually stuck at `1970-01-01T00:00:00Z`.
> I have created a failing test case in the `EventTimeWatermarkSuite`, I will 
> create a [WIP] pull request on github and link it here.
>  
> It seems that even if it generated the watermark, and given the current 
> streaming behavior, I would have to trigger the job twice to generate any 
> output. 
> It seems like the microbatcher only calculates the watermark off of the 
> previous batch's input and emits new aggs based off of that timestamp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org