sorry i thought i gave an explanation The issue you are encountering with incorrect record numbers in the "ShuffleWrite Size/Records" column in the Spark DAG UI when data is read from cache/persist is a known limitation. This discrepancy arises due to the way Spark handles and reports shuffle data when caching is involved.
Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)". Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". On Sun, 26 May 2024 at 21:16, Prem Sahoo <prem.re...@gmail.com> wrote: > Can anyone please assist me ? > > On Fri, May 24, 2024 at 12:29 AM Prem Sahoo <prem.re...@gmail.com> wrote: > >> Does anyone have a clue ? >> >> On Thu, May 23, 2024 at 11:40 AM Prem Sahoo <prem.re...@gmail.com> wrote: >> >>> Hello Team, >>> in spark DAG UI , we have Stages tab. Once you click on each stage you >>> can view the tasks. >>> >>> In each task we have a column "ShuffleWrite Size/Records " that column >>> prints wrong data when it gets the data from cache/persist . it >>> typically will show the wrong record number though the data size is correct >>> for e.g 3.2G/ 7400 which is wrong . >>> >>> please advise. >>> >>