Re: [Spark-Core] Spark Dry Run

Ali Behjati Mon, 04 Oct 2021 06:03:38 -0700

Hey Ramiro,

Thank you for your detailed answer.
We also have a similar framework which does the same and I saw very good
results. However, pipelines using normal spark apps require change to adapt
to a framework and it requires a lot of effort. This is why I'm suggesting
adding it to spark core to make it available to everyone out of the box.


-
Ali

On Mon, Oct 4, 2021 at 1:35 PM Ramiro Laso <[email protected]> wrote:

> Hello Ali!, I've implemented a dry run in my data pipeline using a schema
> repository. My pipeline takes a "dataset descriptor", which is a json
> describing the dataset you want to build, loads some "entities", applies
> some transformations and then writes the final dataset.
> Is in the "dataset descriptor" where users can commit some mistakes or if
> they reimplemented some steps inside the pipeline.  So, to perform a dry
> run, first we separated the actions from the transformation. Each step
> inside the pipeline has "input", "transform" and "write" methods. So, when
> que want to "dry run" a pipeline, we obtain the schemas of the entities and
> build "empty rdds" that we use as Input of the pipeline. Finally we just
> trigger an action to test that all selected columns and queries in the
> "dataset descriptor" are ok.
> This is how you can create an empty dataset:
>
> emp_RDD: RDD = spark.sparkContext.emptyRDD()
> df = spark.createDataFrame(emp_RDD, schema)
>
> Ramiro.
>
> On Thu, Sep 30, 2021 at 11:48 AM Mich Talebzadeh <
> [email protected]> wrote:
>
>> Ok thanks.
>>
>> What is your experience of VS Code (in terms of capabilities ) as it is
>> becoming a standard tool available in Cloud workspaces like Amazon
>> workspace?
>>
>> Mich
>>
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 30 Sept 2021 at 15:43, Ali Behjati <[email protected]> wrote:
>>
>>> Not anything specific in my mind. Any IDE which is open to plugins can
>>> use it (e.g: VS Code and Jetbrains) to validate execution plans in the
>>> background and mark syntax errors based on the result.
>>>
>>> On Thu, Sep 30, 2021 at 4:40 PM Mich Talebzadeh <
>>> [email protected]> wrote:
>>>
>>>> What IDEs do you have in mind?
>>>>
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, 30 Sept 2021 at 15:20, Ali Behjati <[email protected]> wrote:
>>>>
>>>>> Yeah it doesn't remove the need of testing on sample data. It would be
>>>>> more of syntax check rather than test. I have witnessed that syntax errors
>>>>> occur a lot.
>>>>>
>>>>> Maybe after having dry-run we will be able to create some automation
>>>>> around basic syntax checking for IDEs too.
>>>>>
>>>>> On Thu, Sep 30, 2021 at 4:15 PM Sean Owen <[email protected]> wrote:
>>>>>
>>>>>> If testing, wouldn't you actually want to execute things? even if at
>>>>>> a small scale, on a sample of data?
>>>>>>
>>>>>> On Thu, Sep 30, 2021 at 9:07 AM Ali Behjati <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey everyone,
>>>>>>>
>>>>>>>
>>>>>>> By dry run I mean ability to validate the execution plan but not
>>>>>>> executing it within the code. I was wondering whether this exists in 
>>>>>>> spark
>>>>>>> or not. I couldn't find it anywhere.
>>>>>>>
>>>>>>> If it doesn't exist I want to propose adding such a feature in
>>>>>>> spark.
>>>>>>>
>>>>>>> Why is it useful?
>>>>>>> 1. Faster testing: When using pyspark or spark on scala/java without
>>>>>>> DataSet we are prone to typos and mistakes about column names and other
>>>>>>> logical problems. Unfortunately IDEs won't help much and when dealing 
>>>>>>> with
>>>>>>> Big Data, testing by running the code takes a lot of time. So this way 
>>>>>>> we
>>>>>>> can understand typos very fast.
>>>>>>>
>>>>>>> 2. (Continuous) Integrity checks: When there are upstream and
>>>>>>> downstream pipelines, we can understand breaking changes much faster by
>>>>>>> running downstream pipelines in "dry run" mode.
>>>>>>>
>>>>>>> I believe it is not so hard to implement and I volunteer to work on
>>>>>>> it if the community approves this feature request.
>>>>>>>
>>>>>>> It can be tackled in different ways. I have two Ideas for
>>>>>>> implementation:
>>>>>>> 1. Noop (No Op) executor engine
>>>>>>> 2. On reads just infer schema and replace it with empty table with
>>>>>>> same schema
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ali
>>>>>>>
>>>>>>

Re: [Spark-Core] Spark Dry Run

Reply via email to