Controlling data placement / locality

Michael Johnson Tue, 29 Nov 2016 14:47:47 -0800

I'm reading in data from a single file. I do some computations on the data to 
get good groupings of the data. Future computations in my program operate on a 
single group at once. (E.g., I might do frequent itemset mining of members 
within each group.) How do I tell Spark that all members of a specific group 
should be in the same node / processed by the same executor? I'm OK with a 
single large reshuffle once I've determined the grouping, but there should be 
no future data movement if possible.
(I know that Spark handles its own partitioning, but I'd like all of the 
partitions that contain a particular group to end up on the same executor.)
Thanks,Michael

Controlling data placement / locality

Reply via email to