Hi all,

I'm writing an ETL process with Spark 1.5, and I was wondering the best way to 
do something.


A lot of the fields I am processing require an algorithm similar to this:


Join input dataframe to a lookup table.

if (that lookup fails (the joined fields are null)) {

    Lookup into some other table to join some other fields.

}


With Dataframes, it seems the only way to do this is to do something like this:


Join input dataframe to a lookup table.

if (that lookup fails (the joined fields are null)) {

   *SPLIT the dataframe into two DFs via DataFrame.filter(),

      one group with successful lookup, the other failed).*

   For failed lookup:  {

       Lookup into some other table to grab some other fields.

   }

   *MERGE the dataframe splits back together via DataFrame.unionAll().*

}


I'm seeing some really large execution plans as you might imagine in the Spark 
Ui, and the processing time seems way out of proportion with the size of the 
dataset.  (~250GB in 9 hours).


Is this the best approach to implement an algorithm like this?  Note also that 
some fields I am implementing require multiple staged split/merge steps due to 
cascading lookup joins.


Thanks,


Michael Sesterhenn

msesterh...@cars.com

Reply via email to