[ 
https://issues.apache.org/jira/browse/SPARK-24050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-24050.
-----------------------------------
       Resolution: Fixed
    Fix Version/s: 3.0.0

Issue resolved by pull request 21126
[https://github.com/apache/spark/pull/21126]

> StreamingQuery does not calculate input / processing rates in some cases
> ------------------------------------------------------------------------
>
>                 Key: SPARK-24050
>                 URL: https://issues.apache.org/jira/browse/SPARK-24050
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1, 2.3.0
>            Reporter: Tathagata Das
>            Assignee: Tathagata Das
>            Priority: Major
>             Fix For: 3.0.0
>
>
> In some streaming queries, the input and processing rates are not calculated 
> at all (shows up as zero) because MicroBatchExecution fails to associated 
> metrics from the executed plan of a trigger with the sources in the logical 
> plan of the trigger. The way this executed-plan-leaf-to-logical-source 
> attribution works is as follows. With V1 sources, there was no way to 
> identify which execution plan leaves were generated by a streaming source. So 
> did a best-effort attempt to match logical and execution plan leaves when the 
> number of leaves were same. In cases where the number of leaves is different, 
> we just give up and report zero rates. An example where this may happen is as 
> follows.
> {code}
> val cachedStaticDF = someStaticDF.union(anotherStaticDF).cache()
> val streamingInputDF = ...
> val query = streamingInputDF.join(cachedStaticDF).writeStream....
> {code}
> In this case, the {{cachedStaticDF}} has multiple logical leaves, but in the 
> trigger's execution plan it only has leaf because a cached subplan is 
> represented as a single InMemoryTableScanExec leaf. This leads to a mismatch 
> in the number of leaves causing the input rates to be computed as zero. 
> With DataSourceV2, all inputs are represented in the executed plan using 
> {{DataSourceV2ScanExec}}, each of which has a reference to the associated 
> logical {{DataSource}} and {{DataSourceReader}}. So its easy to associate the 
> metrics to the original streaming sources. So the solution is to take 
> advantage of the presence of DataSourceV2 whenever possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to