Hello all! I'm working with PySpark trying to reproduce some of the results we see on batch through streaming processes, just as a PoC for now. For this, I'm thinking of trying to interpret the execution plan and eventually write it back to Python (I'm doing something similar with pandas as well, and I'd like both approaches to be as similar as possible).
Let me clarify with an example: suppose that starting with a `geometry.csv` file with `width` and `height` I want to calculate the `area` doing this: >>> geometry = spark.read.csv('geometry.csv', header=True) >>> geometry = geometry.withColumn('area', F.col('width') * F.col('height')) I would like to extract from the execution plan the fact that area is calculated as the product of width * height. One possibility would be to parse the execution plan: >>> geometry.explain(True) ... == Optimized Logical Plan == Project [width#45, height#46, (cast(width#45 as double) * cast(height#46 as double)) AS area#64] +- Relation [width#45,height#46] csv ... >From the first line of the Logical Plan we can parse the formula "area = height * width" and then write the function back in any language. However, even though I'm getting the logical plan as a string, there has to be some internal representation that I could leverage and avoid the string parsing. Do you know if/how I can access that internal representation from Python? I've been trying to navigate the scala source code to find it, but this is definitely beyond my area of expertise, so any pointers would be more than welcome. Thanks in advance, Pablo