Hi all,
I'm writing an ETL process with Spark 1.5, and I was wondering the best way to do something. A lot of the fields I am processing require an algorithm similar to this: Join input dataframe to a lookup table. if (that lookup fails (the joined fields are null)) { Lookup into some other table to join some other fields. } With Dataframes, it seems the only way to do this is to do something like this: Join input dataframe to a lookup table. if (that lookup fails (the joined fields are null)) { *SPLIT the dataframe into two DFs via DataFrame.filter(), one group with successful lookup, the other failed).* For failed lookup: { Lookup into some other table to grab some other fields. } *MERGE the dataframe splits back together via DataFrame.unionAll().* } I'm seeing some really large execution plans as you might imagine in the Spark Ui, and the processing time seems way out of proportion with the size of the dataset. (~250GB in 9 hours). Is this the best approach to implement an algorithm like this? Note also that some fields I am implementing require multiple staged split/merge steps due to cascading lookup joins. Thanks, Michael Sesterhenn msesterh...@cars.com