Re: Spark query performance of cached data affected by RDD lineage

2021-05-24 Thread fwy
Thanks to all for the quick replies, they helped a lot. To answer a few of the follow-up questions ... > 1. How did you fix this performance which I gather programmatically The main problem in my original code was that the logic was not being executed when it should have been. This 'if'

Re: Spark query performance of cached data affected by RDD lineage

2021-05-24 Thread Sebastian Piu
> Do Spark SQL queries depend directly on the RDD lineage even when the final results have been cached? Yes, if one of the nodes holding cached data later fails spark would need to rebuild that state somehow. You could try checkpointing occasionally and see if that helps On Sat, 22 May 2021,

Re: Spark query performance of cached data affected by RDD lineage

2021-05-24 Thread Mich Talebzadeh
Hi Fred, You said you managed to fix the problem somehow and have attributed some issues with RDD lineage. Few things come to my mind: 1. How did you fix this performance which I gather programmatically 2. In your code have you set spark.conf.set("spark.sql.adaptive.enabled", "true")

Spark query performance of cached data affected by RDD lineage

2021-05-22 Thread Fred Yeadon
Hi all, Working on a complex Spark 3.0.1 application, I noticed some unexpected Spark behavior recently that I am hoping someone can explain. The application is Java with many large classes, but I have tried to describe the essential logic below. During periodic refresh runs, the application