Hi all,
I would like to ask for help understanding a potential issue with the
Observation API in PySpark. I have searched StackOverflow and the Spark user@
archives but couldn’t find this specific case described.
Environment
* PySpark 3.5.0
* Scala 2.12
* Reproduced both on:
* Personal Databricks cluster
* Standard Databricks compute cluster
Description of the issue
I see inconsistent behavior when using DataFrame.observe() depending on
whether the observed DataFrame becomes empty after applying transformations.
To illustrate, here are three minimal scenarios (full code in GitHub repo with
testing notebook:
https://github.com/MatusL666/-Spark-SQL-Observation-on-empty-after-filter-DF-never-materializes-PySpark-3.5.0-Scala-2.12-).
1. Observing an originally empty DataFrame → works as expected
Calling an action (e.g., count()) correctly triggers materialization of the
observation.
Result:
count = 0
observation = {'num_rows': 0}
2. Observing a non-empty DF (with intermediate filters) → also works
Both observations fire and return correct counts.
Result:
count = 2
observations = {'num_rows': 3}, {'num_rows': 2}
3. Observing a DF that becomes empty after additional filters → observation
never materializes
This is the problematic case:
* The DF becomes empty after the second filter.
* The ob.get hangs indefinitely.
* No exception is thrown.
* Observations are never materialized.
The behavior is reproducible and consistently blocks execution.
Expected vs. actual behavior
Expected:
The observation on the empty-after-filter DataFrame should behave like in case
(1) and (2): be materialized once an action is triggered and return
information about filtrations.
Actual:
Execution hangs with no progress and no materialized observation.
Questions
1. Is this a known limitation or bug in the Observation implementation?
2. Is the behavior intentional (e.g., due to how the observation plan is
inserted into the optimized logical plan)?
3. Are there recommended workarounds for collecting metrics on DataFrames
that may become empty after filters?
Any guidance or insights are appreciated.
Thank you!
Best regards,
[signature_4286454229]
Matúš Letko
Data Scientist in Data & AI | PwC | Digital Enablement
m: +420 737 380 962 | e: [email protected]
PricewaterhouseCoopers Česká republika, s.r.o.
City Green Court Hvězdova 1734/2c, 140 00 Praha 4
"Privacy statement / Ochrana osobních údajů (hyperlink:
https://www.pwc.com/cz/cs/o-nas/ochrana-osobnich-udaju.html) The information
transmitted is intended only for the person or entity to which it is addressed
and may contain confidential and/or privileged material. Any review,
retransmission, dissemination or other use of, or taking of any action in
reliance upon, this information by persons or entities other than the intended
recipient is prohibited. If you received this in error, please contact the
sender and delete the material from any computer. Please familiarize yourself
with our privacy policy."