Thanks to all for the quick replies, they helped a lot. To answer a few of
the follow-up questions ...
> 1. How did you fix this performance which I gather programmatically
The main problem in my original code was that the logic was not being executed when it should have been. This
'if'
> Do Spark SQL queries depend directly on the RDD lineage even when the
final results have been cached?
Yes, if one of the nodes holding cached data later fails spark would need
to rebuild that state somehow.
You could try checkpointing occasionally and see if that helps
On Sat, 22 May 2021,
Hi Fred,
You said you managed to fix the problem somehow and have attributed some
issues with RDD lineage. Few things come to my mind:
1. How did you fix this performance which I gather programmatically
2. In your code have you set spark.conf.set("spark.sql.adaptive.enabled",
"true")
Hi all,
Working on a complex Spark 3.0.1 application, I noticed some unexpected
Spark behavior recently that I am hoping someone can explain. The
application is Java with many large classes, but I have tried to describe
the essential logic below.
During periodic refresh runs, the application