Tathagata Das created SPARK-24050:
-------------------------------------

             Summary: StreamingQuery does not calculate input / processing 
rates in some cases
                 Key: SPARK-24050
                 URL: https://issues.apache.org/jira/browse/SPARK-24050
             Project: Spark
          Issue Type: Bug
          Components: Structured Streaming
    Affects Versions: 2.3.0, 2.2.1, 2.2.0, 2.1.2, 2.1.1, 2.1.0
            Reporter: Tathagata Das
            Assignee: Tathagata Das


In some streaming queries, the input and processing rates are not calculated at 
all (shows up as zero) because MicroBatchExecution fails to associated metrics 
from the executed plan of a trigger with the sources in the logical plan of the 
trigger. The way this executed-plan-leaf-to-logical-source attribution works is 
as follows. With V1 sources, there was no way to identify which execution plan 
leaves were generated by a streaming source. So did a best-effort attempt to 
match logical and execution plan leaves when the number of leaves were same. In 
cases where the number of leaves is different, we just give up and report zero 
rates. An example where this may happen is as follows.
{code}
val cachedStaticDF = someStaticDF.union(anotherStaticDF).cache()
val streamingInputDF = ...

val query = streamingInputDF.join(cachedStaticDF).writeStream....
{code}

In this case, the {{cachedStaticDF}} has multiple logical leaves, but in the 
trigger's execution plan it only has leaf because a cached subplan is 
represented as a single InMemoryTableScanExec leaf. This leads to a mismatch in 
the number of leaves causing the input rates to be computed as zero. 

With DataSourceV2, all inputs are represented in the executed plan using 
{{DataSourceV2ScanExec}}s, each of which has a reference to the associated 
logical {{DataSource}} and {{DataSourceReader}}. So its easy to associate the 
metrics to the original streaming sources. So the solution is to take advantage 
of the presence of DataSourceV2 whenever possible.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to