full SQL query graph not shown in monitoring when using cache

Kalin Stoyanov Thu, 15 Apr 2021 05:55:29 -0700

Hi all,

I noticed something a bit strange.. When working with a cached DF, the SQL
query details graph starts from when the cache takes place, and doesn't
show the transformations before it. For example this code


>>> df = sc.parallelize([[1,2,3],[1,4,5]]).toDF(['id','a','b'])
>>> renameCols = [f"`{col}` as `{col}_other`" for col in df.columns]
>>> df_cart = df.crossJoin(df.selectExpr(renameCols))
>>> df = df_cart.groupBy("id").sum("a", "b")
>>> df = df.cache()
>>> df = df.selectExpr("id", "`sum(a)` * 2 as a", "`sum(b)` * 2 as b")
>>> df.show()

produces 1 query with this physical plan

== Physical Plan ==
CollectLimit 21
+- *(1) Project [cast(id#0L as string) AS id#58, cast((sum(a)#26L * 2)
as string) AS a#59, cast((sum(b)#27L * 2) as string) AS b#60]
   +- *(1) ColumnarToRow
      +- InMemoryTableScan [id#0L, sum(a)#26L, sum(b)#27L]
            +- InMemoryRelation [id#0L, sum(a)#26L, sum(b)#27L],
StorageLevel(disk, memory, deserialized, 1 replicas)
                  +- *(4) HashAggregate(keys=[id#0L],
functions=[sum(a#1L), sum(b#2L)], output=[id#0L, sum(a)#26L,
sum(b)#27L])
                     +- Exchange hashpartitioning(id#0L, 4), true, [id=#30]
                        +- *(3) HashAggregate(keys=[id#0L],
functions=[partial_sum(a#1L), partial_sum(b#2L)], output=[id#0L,
sum#33L, sum#34L])
                           +- CartesianProduct
                              :- *(1) Scan ExistingRDD[id#0L,a#1L,b#2L]
                              +- *(2) Project
                                 +- *(2) Scan ExistingRDD[id#0L,a#1L,b#2L]

But the visual graph representation is just

[image: image.png]
Is this something that's done on purpose? I'd rather see the whole thing...
This is on Spark 3.0.1.

full SQL query graph not shown in monitoring when using cache

Reply via email to