While investigating performance challenges in a Streaming application using
UpdateStateByKey, I found that serialization of state was a meaningful (not
dominant) portion of our execution time.

In StateDStream.scala, serialized persistence is required:

     super.persist(StorageLevel.MEMORY_ONLY_SER)

I can see why that might be a good choice for a default.    For our
workload, I made a clone that uses StorageLevel.MEMORY_ONLY.   I've just
completed some tests and it is indeed faster, with the expected cost of
greater memory usage.   For us that would be a good tradeoff.

I'm not taking any particular extra risks by doing this, am I?

Should this be configurable?  Perhaps yet another signature for
PairDStreamFunctions.updateStateByKey?

Thanks for sharing any thoughts-

Jef

Reply via email to