While investigating performance challenges in a Streaming application using UpdateStateByKey, I found that serialization of state was a meaningful (not dominant) portion of our execution time.
In StateDStream.scala, serialized persistence is required: super.persist(StorageLevel.MEMORY_ONLY_SER) I can see why that might be a good choice for a default. For our workload, I made a clone that uses StorageLevel.MEMORY_ONLY. I've just completed some tests and it is indeed faster, with the expected cost of greater memory usage. For us that would be a good tradeoff. I'm not taking any particular extra risks by doing this, am I? Should this be configurable? Perhaps yet another signature for PairDStreamFunctions.updateStateByKey? Thanks for sharing any thoughts- Jef