I'm not a software engineer by training and I hope that there's an existing best practice for the problem I'm trying to solve. I'm using Spark 2.4.5, Hadoop 2.7, Hive 1.2.
I have a large table (terabytes) from an external source (which is beyond my control) where the data is stored in a key-value format with an array of values: | id | val + - +----------- | k1 | <v1,v2,v3> | k2 | <v2> | k3 | <v3,v5> I want to invert the map so that I have a collection of keys for each value (let's assume I don't care about uniqueness): | id | val + - + -------------- | v1 | <k1> | v2 | <k1, k2> | v3 | <k1, k3> | v5 | <k3> It seems like a lot of shuffle is required somehow, but I'm not sure what the best approach is. I've written solutions using DataFrame (with explode(), groupBy() and collect_set()) and with RDD but it's always very expensive. Is there a best practice technique for this kind of operation? My leading thought so far is to restage the data in a partitioned, bucketed flat table as an intermediary step but that too is costly in terms of disk space and transform time. Thanks, Patrick