HeartSaVioR opened a new pull request #28607:
URL: https://github.com/apache/spark/pull/28607


   ### What changes were proposed in this pull request?
   
   Please refer https://issues.apache.org/jira/browse/SPARK-24634 to see 
rationalization of the issue.
   
   This patch adds a new metric to count the number of inputs arrived later 
than watermark plus allowed delay. To make changes simpler, this patch doesn't 
count the exact number of input rows which are later than watermark plus 
allowed delay. Instead, this patch counts the inputs which are dropped in the 
logic of operator. The difference of twos are shown in streaming aggregation: 
to optimize the calculation, streaming aggregation "pre-aggregates" the input 
rows, and later checks the lateness against "pre-aggregated" inputs, hence the 
number might be reduced.
   
   The new metric will be provided via two places:
   
   1. On Spark UI: check the metrics in stateful operator nodes in query 
execution details page in SQL tab
   2. On Streaming Query Listener: check "numLateInputs" in "stateOperators" in 
QueryProcessEvent.
   
   ### Why are the changes needed?
   
   Dropping late inputs means that end users might not get expected outputs. 
Even end users may indicate the fact and tolerate the result (as that's what 
allowed lateness is for), but they should be able to observe whether the 
current value of allowed lateness drops inputs or not so that they can adjust 
the value.
   
   Also, whatever the chance they have multiple of stateful operators in a 
single query, if Spark drops late inputs "between" these operators, it becomes 
"correctness" issue. Spark should disallow such possibility, but given we 
already provided the flexibility, at least we should provide the way to observe 
the correctness issue and decide whether they should make correction of their 
query or not.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. End users will be able to retrieve the information of late inputs via 
two ways:
   
   1. SQL tab in Spark UI
   2. Streaming Query Listener
   
   ### How was this patch tested?
   
   New UTs added & existing UTs are modified to reflect the change.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to