[ https://issues.apache.org/jira/browse/SPARK-42931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705252#comment-17705252 ]
ASF GitHub Bot commented on SPARK-42931: ---------------------------------------- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/40561 > dropDuplicates within watermark > ------------------------------- > > Key: SPARK-42931 > URL: https://issues.apache.org/jira/browse/SPARK-42931 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming > Affects Versions: 3.5.0 > Reporter: Jungtaek Lim > Priority: Major > Attachments: [External] Mini design doc_ dropDuplicates within > watermark.pdf > > > We got many reports that dropDuplicates does not clean up the state even > though they have set a watermark for the query. We document the behavior > clearly that the event time column should be a part of the subset columns for > deduplication to clean up the state, but it cannot be applied to the > customers as timestamps are not exactly the same for duplicated events in > their use cases. > We propose to deduce a new API of dropDuplicates which has following > different characteristics compared to existing dropDuplicates: > * Weaker constraints on the subset (key) > ** Does not require an event time column on the subset. > * Looser semantics on deduplication > ** Only guarantee to deduplicate events within the watermark. > Since the new API leverages event time, the new API has following new > requirements: > * The input must be streaming DataFrame. > * The watermark must be defined. > * The event time column must be defined in the input DataFrame. > More specifically on the semantic, once the operator processes the first > arrived event, events arriving within the watermark for the first event will > be deduplicated. > (Technically, the expiration time should be the “event time of the first > arrived event + watermark delay threshold”, to match up with future events.) > Users are encouraged to set the delay threshold of watermark longer than max > timestamp differences among duplicated events. (If they are unsure, they can > alternatively set the delay threshold large enough, e.g. 48 hours.) > Longer design doc will be attached. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org