While investigating performance challenges in a Streaming application using
UpdateStateByKey, I found that serialization of state was a meaningful (not
dominant) portion of our execution time.
In StateDStream.scala, serialized persistence is required:
super.persist(StorageLevel.MEMORY_ONLY_SER)
I can see why that might be a good choice for a default. For our
workload, I made a clone that uses StorageLevel.MEMORY_ONLY. I've just
completed some tests and it is indeed faster, with the expected cost of
greater memory usage. For us that would be a good tradeoff.
I'm not taking any particular extra risks by doing this, am I?
Should this be configurable? Perhaps yet another signature for
PairDStreamFunctions.updateStateByKey?
Thanks for sharing any thoughts-
Jef