[Spark-Core] Spark Dry Run

Ali Behjati Thu, 30 Sep 2021 07:07:17 -0700

Hey everyone,


By dry run I mean ability to validate the execution plan but not executing
it within the code. I was wondering whether this exists in spark or not. I
couldn't find it anywhere.

If it doesn't exist I want to propose adding such a feature in spark.

Why is it useful?
1. Faster testing: When using pyspark or spark on scala/java without
DataSet we are prone to typos and mistakes about column names and other
logical problems. Unfortunately IDEs won't help much and when dealing with
Big Data, testing by running the code takes a lot of time. So this way we
can understand typos very fast.

2. (Continuous) Integrity checks: When there are upstream and downstream
pipelines, we can understand breaking changes much faster by running
downstream pipelines in "dry run" mode.

I believe it is not so hard to implement and I volunteer to work on it if
the community approves this feature request.

It can be tackled in different ways. I have two Ideas for implementation:
1. Noop (No Op) executor engine
2. On reads just infer schema and replace it with empty table with same
schema

Thanks,
Ali

[Spark-Core] Spark Dry Run

Reply via email to