[jira] [Commented] (SPARK-30657) Streaming limit after streaming dropDuplicates can throw error
[ https://issues.apache.org/jira/browse/SPARK-30657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028223#comment-17028223 ] Shixiong Zhu commented on SPARK-30657: -- [~tdas] Make sense. Agreed that the risk is high but the benefit is pretty low. We can backport it later whenever needed. > Streaming limit after streaming dropDuplicates can throw error > -- > > Key: SPARK-30657 > URL: https://issues.apache.org/jira/browse/SPARK-30657 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, > 2.4.3, 2.4.4 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > Fix For: 3.0.0 > > > {{LocalLimitExec}} does not consume the iterator of the child plan. So if > there is a limit after a stateful operator like streaming dedup in append > mode (e.g. {{streamingdf.dropDuplicates().limit(5}})), the state changes of > streaming duplicate may not be committed (most stateful ops commit state > changes only after the generated iterator is fully consumed). This leads to > the next batch failing with {{java.lang.IllegalStateException: Error reading > delta file .../N.delta does not exist}} as the state store delta file was > never generated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30657) Streaming limit after streaming dropDuplicates can throw error
[ https://issues.apache.org/jira/browse/SPARK-30657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027910#comment-17027910 ] Tathagata Das commented on SPARK-30657: --- This fix by itself (separate from the fix for SPARK-30658) may be backported. The solution that I did to always inject StreamingLocalLimitExec is safe from correctness point of view, but is a little risky from the performance point of view (which I tried to minimize using the optimization). With 2.4.4+, unless this is a serious bug that affects many users, I dont think we should backport this. And i dont think limit on streaming is that extensively used such that this is big bug (it has not been reported for 1.5 years). What do you think [~zsxwing] > Streaming limit after streaming dropDuplicates can throw error > -- > > Key: SPARK-30657 > URL: https://issues.apache.org/jira/browse/SPARK-30657 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, > 2.4.3, 2.4.4 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > Fix For: 3.0.0 > > > {{LocalLimitExec}} does not consume the iterator of the child plan. So if > there is a limit after a stateful operator like streaming dedup in append > mode (e.g. {{streamingdf.dropDuplicates().limit(5}})), the state changes of > streaming duplicate may not be committed (most stateful ops commit state > changes only after the generated iterator is fully consumed). This leads to > the next batch failing with {{java.lang.IllegalStateException: Error reading > delta file .../N.delta does not exist}} as the state store delta file was > never generated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30657) Streaming limit after streaming dropDuplicates can throw error
[ https://issues.apache.org/jira/browse/SPARK-30657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025575#comment-17025575 ] Dongjoon Hyun commented on SPARK-30657: --- Hi, [~tdas]. Can we have `2.4.5` at `Target Version`, too? > Streaming limit after streaming dropDuplicates can throw error > -- > > Key: SPARK-30657 > URL: https://issues.apache.org/jira/browse/SPARK-30657 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, > 2.4.3, 2.4.4 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > > {{LocalLimitExec}} does not consume the iterator of the child plan. So if > there is a limit after a stateful operator like streaming dedup in append > mode (e.g. {{streamingdf.dropDuplicates().limit(5}})), the state changes of > streaming duplicate may not be committed (most stateful ops commit state > changes only after the generated iterator is fully consumed). This leads to > the next batch failing with {{java.lang.IllegalStateException: Error reading > delta file .../N.delta does not exist}} as the state store delta file was > never generated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org