[ 
https://issues.apache.org/jira/browse/SPARK-26008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Bar Yacov updated SPARK-26008:
----------------------------------
       Shepherd: Tathagata Das
    Description: 
Structured streaming Internal {color:#333333}StreamTest{color} class allows to 
test incremental logic and verify outputs between multiple triggers. It support 
changing the internal spark clock to get full deterministic simulation of the 
incremental state and APIs. This is not possible outside tests since 
{color:#333333}DataStreamWriter{color} hides the triggerClock parameter and is 
final.

This can be very useful not only in unit test mode but also for a real running 
query. for example when you have all the Kafka historical data persisted to 
hdfs with its Kafka timestamp and you want to "play"  the data and simulate the 
streaming application output as if  running on this data in live streaming 
including incremental output between triggers.

Currently I can simulate multiple triggers and incremental logic for some of 
the APIs, but for APIs that depend on the execution clock like 
{color:#333333}mapGroupsWithState{color} with execution based timeout I did not 
find a way to do this.

I would like to allow passing an externally controlled clock as parameter to 
DataStreamWriter and to the query itself.

  was:
Structured streaming Internal {color:#333333}StreamTest{color} class allows to 
test incremental logic and verify outputs between multiple triggers. It support 
changing the internal spark clock to get full deterministic simulation of the 
incremental state and APIs. This is not possible outside tests since 
{color:#333333}DataStreamWriter{color} hides the triggerClock parameter and is 
final.

This can be very useful not only in unit test mode but also for a real running 
query. for example when you have all the Kafka historical data persisted to 
hdfs with its Kafka timestamp and you want to "play"  the data and simulate the 
streaming application output as if  running on this data in live streaming 
including incremental output between triggers.

Today I can simulate multiple triggers and incremental logic for some of the 
APIs, but for APIs that depend on the execution clock like 
{color:#333333}mapGroupsWithState{color} with execution based timeout I did not 
find a way to do this.

Question is -  Is it a possible to support a similar solution like in 
StreamTest - Allow passing an external manual clock as parameter to 
DataStreamWriter and allowing the user an external control over this clock? 
what possible failures that can occur if running with manual clock in real 
cluster mode?

Thanks

     Issue Type: Wish  (was: Question)

> Structured Streaming Manual clock for simulation
> ------------------------------------------------
>
>                 Key: SPARK-26008
>                 URL: https://issues.apache.org/jira/browse/SPARK-26008
>             Project: Spark
>          Issue Type: Wish
>          Components: Structured Streaming
>    Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>            Reporter: Tom Bar Yacov
>            Priority: Major
>
> Structured streaming Internal {color:#333333}StreamTest{color} class allows 
> to test incremental logic and verify outputs between multiple triggers. It 
> support changing the internal spark clock to get full deterministic 
> simulation of the incremental state and APIs. This is not possible outside 
> tests since {color:#333333}DataStreamWriter{color} hides the triggerClock 
> parameter and is final.
> This can be very useful not only in unit test mode but also for a real 
> running query. for example when you have all the Kafka historical data 
> persisted to hdfs with its Kafka timestamp and you want to "play"  the data 
> and simulate the streaming application output as if  running on this data in 
> live streaming including incremental output between triggers.
> Currently I can simulate multiple triggers and incremental logic for some of 
> the APIs, but for APIs that depend on the execution clock like 
> {color:#333333}mapGroupsWithState{color} with execution based timeout I did 
> not find a way to do this.
> I would like to allow passing an externally controlled clock as parameter to 
> DataStreamWriter and to the query itself.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to