Re: Modularising Spark/Scala program

2020-05-02 Thread Stephen Boesch
I neglected to include the rationale: the assumption is this will be a repeatedly needed process thus a reusable method were helpful. The predicate/input rules that are supported will need to be flexible enough to support the range of input data domains and use cases . For my workflows the predic

Re: Modularising Spark/Scala program

2020-05-02 Thread Stephen Boesch
Hi Mich! I think you can combine the good/rejected into one method that internally: - Create good/rejected df's given an input df and input rules/predicates to apply to the df. - Create a third df containing the good rows and the rejected rows with the bad columns nulled out - Ap

Modularising Spark/Scala program

2020-05-02 Thread Mich Talebzadeh
Hi, I have a Spark Scala program created and compiled with Maven. It works fine. It basically does the following: 1. Reads an xml file from HDFS location 2. Creates a DF on top of what it reads 3. Creates a new DF with some columns renamed etc 4. Creates a new DF for rejected rows (i