[Structured Streaming] Keeping checkpointing cost under control

Andrzej Zera Fri, 05 Jan 2024 14:13:59 -0800

Hey,

I'm running a few Structured Streaming jobs (with Spark 3.5.0) that require
near-real time accuracy with trigger intervals in the level of 5-10
seconds. I usually run 3-6 streaming queries as part of the job and each
query includes at least one stateful operation (and usually two or more).
My checkpoint location is S3 bucket and I use RocksDB as a state store.
Unfortunately, checkpointing costs are quite high. It's the main cost item
of the system and it's roughly 4-5 times the cost of compute.


To save on compute costs, the following things are usually recommended:

   - increase trigger interval (as mentioned, I don't have much space here)
   - decrease the number of shuffle partitions (I have 2x the number of
   workers)

I'm looking for some other recommendations that I can use to save on
checkpointing costs. I saw that most requests are LIST requests. Can we cut
them down somehow? I'm using Databricks. If I replace S3 bucket with DBFS,
will it help in any way?

Thank you!
Andrzej

[Structured Streaming] Keeping checkpointing cost under control

Reply via email to