Hey Ramiro, Thank you for your detailed answer. We also have a similar framework which does the same and I saw very good results. However, pipelines using normal spark apps require change to adapt to a framework and it requires a lot of effort. This is why I'm suggesting adding it to spark core to make it available to everyone out of the box.
- Ali On Mon, Oct 4, 2021 at 1:35 PM Ramiro Laso <[email protected]> wrote: > Hello Ali!, I've implemented a dry run in my data pipeline using a schema > repository. My pipeline takes a "dataset descriptor", which is a json > describing the dataset you want to build, loads some "entities", applies > some transformations and then writes the final dataset. > Is in the "dataset descriptor" where users can commit some mistakes or if > they reimplemented some steps inside the pipeline. So, to perform a dry > run, first we separated the actions from the transformation. Each step > inside the pipeline has "input", "transform" and "write" methods. So, when > que want to "dry run" a pipeline, we obtain the schemas of the entities and > build "empty rdds" that we use as Input of the pipeline. Finally we just > trigger an action to test that all selected columns and queries in the > "dataset descriptor" are ok. > This is how you can create an empty dataset: > > emp_RDD: RDD = spark.sparkContext.emptyRDD() > df = spark.createDataFrame(emp_RDD, schema) > > Ramiro. > > On Thu, Sep 30, 2021 at 11:48 AM Mich Talebzadeh < > [email protected]> wrote: > >> Ok thanks. >> >> What is your experience of VS Code (in terms of capabilities ) as it is >> becoming a standard tool available in Cloud workspaces like Amazon >> workspace? >> >> Mich >> >> >> >> view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Thu, 30 Sept 2021 at 15:43, Ali Behjati <[email protected]> wrote: >> >>> Not anything specific in my mind. Any IDE which is open to plugins can >>> use it (e.g: VS Code and Jetbrains) to validate execution plans in the >>> background and mark syntax errors based on the result. >>> >>> On Thu, Sep 30, 2021 at 4:40 PM Mich Talebzadeh < >>> [email protected]> wrote: >>> >>>> What IDEs do you have in mind? >>>> >>>> >>>> >>>> view my Linkedin profile >>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>> >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> >>>> On Thu, 30 Sept 2021 at 15:20, Ali Behjati <[email protected]> wrote: >>>> >>>>> Yeah it doesn't remove the need of testing on sample data. It would be >>>>> more of syntax check rather than test. I have witnessed that syntax errors >>>>> occur a lot. >>>>> >>>>> Maybe after having dry-run we will be able to create some automation >>>>> around basic syntax checking for IDEs too. >>>>> >>>>> On Thu, Sep 30, 2021 at 4:15 PM Sean Owen <[email protected]> wrote: >>>>> >>>>>> If testing, wouldn't you actually want to execute things? even if at >>>>>> a small scale, on a sample of data? >>>>>> >>>>>> On Thu, Sep 30, 2021 at 9:07 AM Ali Behjati <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hey everyone, >>>>>>> >>>>>>> >>>>>>> By dry run I mean ability to validate the execution plan but not >>>>>>> executing it within the code. I was wondering whether this exists in >>>>>>> spark >>>>>>> or not. I couldn't find it anywhere. >>>>>>> >>>>>>> If it doesn't exist I want to propose adding such a feature in >>>>>>> spark. >>>>>>> >>>>>>> Why is it useful? >>>>>>> 1. Faster testing: When using pyspark or spark on scala/java without >>>>>>> DataSet we are prone to typos and mistakes about column names and other >>>>>>> logical problems. Unfortunately IDEs won't help much and when dealing >>>>>>> with >>>>>>> Big Data, testing by running the code takes a lot of time. So this way >>>>>>> we >>>>>>> can understand typos very fast. >>>>>>> >>>>>>> 2. (Continuous) Integrity checks: When there are upstream and >>>>>>> downstream pipelines, we can understand breaking changes much faster by >>>>>>> running downstream pipelines in "dry run" mode. >>>>>>> >>>>>>> I believe it is not so hard to implement and I volunteer to work on >>>>>>> it if the community approves this feature request. >>>>>>> >>>>>>> It can be tackled in different ways. I have two Ideas for >>>>>>> implementation: >>>>>>> 1. Noop (No Op) executor engine >>>>>>> 2. On reads just infer schema and replace it with empty table with >>>>>>> same schema >>>>>>> >>>>>>> Thanks, >>>>>>> Ali >>>>>>> >>>>>>
