Greetings, I am trying to implement a classic star schema ETL pipeline using Spark SQL, 1.2.1. I am running into problems with shuffle joins, for those dimension tables which have skewed keys and are too large to let Spark broadcast them.
I have a few questions 1. Can I split my queries so a unique, skewed key gets processed by by multiple reducer steps? I have tried this (using a UNION) but I am always left with the 199/200 executors complete, which times out and even starts throwing memory errors. That single executor is processing 95% of the 80G fact table for the single skewed key. 2. Does 1.3.2 or 1.4 have any enhancements that can help? I tried to use 1.3.1 but SPARK-6967 prohibits me from doing so. Now that 1.4 is available, would any of the JOIN enhancements help this situation? 3. Do you have suggestions for memory config if I wanted to broadcast 2G dimension tables? Is this even feasible? Do table broadcasts wind up in the heap or in dedicated storage space? Thanks for your help, Jon