Hi all, I noticed something a bit strange.. When working with a cached DF, the SQL query details graph starts from when the cache takes place, and doesn't show the transformations before it. For example this code
>>> df = sc.parallelize([[1,2,3],[1,4,5]]).toDF(['id','a','b']) >>> renameCols = [f"`{col}` as `{col}_other`" for col in df.columns] >>> df_cart = df.crossJoin(df.selectExpr(renameCols)) >>> df = df_cart.groupBy("id").sum("a", "b") >>> df = df.cache() >>> df = df.selectExpr("id", "`sum(a)` * 2 as a", "`sum(b)` * 2 as b") >>> df.show() produces 1 query with this physical plan == Physical Plan == CollectLimit 21 +- *(1) Project [cast(id#0L as string) AS id#58, cast((sum(a)#26L * 2) as string) AS a#59, cast((sum(b)#27L * 2) as string) AS b#60] +- *(1) ColumnarToRow +- InMemoryTableScan [id#0L, sum(a)#26L, sum(b)#27L] +- InMemoryRelation [id#0L, sum(a)#26L, sum(b)#27L], StorageLevel(disk, memory, deserialized, 1 replicas) +- *(4) HashAggregate(keys=[id#0L], functions=[sum(a#1L), sum(b#2L)], output=[id#0L, sum(a)#26L, sum(b)#27L]) +- Exchange hashpartitioning(id#0L, 4), true, [id=#30] +- *(3) HashAggregate(keys=[id#0L], functions=[partial_sum(a#1L), partial_sum(b#2L)], output=[id#0L, sum#33L, sum#34L]) +- CartesianProduct :- *(1) Scan ExistingRDD[id#0L,a#1L,b#2L] +- *(2) Project +- *(2) Scan ExistingRDD[id#0L,a#1L,b#2L] But the visual graph representation is just [image: image.png] Is this something that's done on purpose? I'd rather see the whole thing... This is on Spark 3.0.1.