I have a fairly large data set that I need to perform a GroupByKey on.
This is by far the most time consuming part of my pipeline and I'm looking
for ways to optimize it.  The data is somewhat static and only changes
periodically so it pains me to have to wait on the GBK to happen every time
I want to run the pipeline.  Is there any way to cache the result of the
operation and load the data each time already grouped?

thanks
--Cory

Reply via email to