Hey everyone,
By dry run I mean ability to validate the execution plan but not executing it within the code. I was wondering whether this exists in spark or not. I couldn't find it anywhere. If it doesn't exist I want to propose adding such a feature in spark. Why is it useful? 1. Faster testing: When using pyspark or spark on scala/java without DataSet we are prone to typos and mistakes about column names and other logical problems. Unfortunately IDEs won't help much and when dealing with Big Data, testing by running the code takes a lot of time. So this way we can understand typos very fast. 2. (Continuous) Integrity checks: When there are upstream and downstream pipelines, we can understand breaking changes much faster by running downstream pipelines in "dry run" mode. I believe it is not so hard to implement and I volunteer to work on it if the community approves this feature request. It can be tackled in different ways. I have two Ideas for implementation: 1. Noop (No Op) executor engine 2. On reads just infer schema and replace it with empty table with same schema Thanks, Ali