Hi all,

I noticed something a bit strange.. When working with a cached DF, the SQL
query details graph starts from when the cache takes place, and doesn't
show the transformations before it. For example this code

>>> df = sc.parallelize([[1,2,3],[1,4,5]]).toDF(['id','a','b'])
>>> renameCols = [f"`{col}` as `{col}_other`" for col in df.columns]
>>> df_cart = df.crossJoin(df.selectExpr(renameCols))
>>> df = df_cart.groupBy("id").sum("a", "b")
>>> df = df.cache()
>>> df = df.selectExpr("id", "`sum(a)` * 2 as a", "`sum(b)` * 2 as b")
>>> df.show()

produces 1 query with this physical plan

== Physical Plan ==
CollectLimit 21
+- *(1) Project [cast(id#0L as string) AS id#58, cast((sum(a)#26L * 2)
as string) AS a#59, cast((sum(b)#27L * 2) as string) AS b#60]
   +- *(1) ColumnarToRow
      +- InMemoryTableScan [id#0L, sum(a)#26L, sum(b)#27L]
            +- InMemoryRelation [id#0L, sum(a)#26L, sum(b)#27L],
StorageLevel(disk, memory, deserialized, 1 replicas)
                  +- *(4) HashAggregate(keys=[id#0L],
functions=[sum(a#1L), sum(b#2L)], output=[id#0L, sum(a)#26L,
sum(b)#27L])
                     +- Exchange hashpartitioning(id#0L, 4), true, [id=#30]
                        +- *(3) HashAggregate(keys=[id#0L],
functions=[partial_sum(a#1L), partial_sum(b#2L)], output=[id#0L,
sum#33L, sum#34L])
                           +- CartesianProduct
                              :- *(1) Scan ExistingRDD[id#0L,a#1L,b#2L]
                              +- *(2) Project
                                 +- *(2) Scan ExistingRDD[id#0L,a#1L,b#2L]

But the visual graph representation is just

[image: image.png]
Is this something that's done on purpose? I'd rather see the whole thing...
This is on Spark 3.0.1.

Reply via email to