Re: [Structured Streaming] Keeping checkpointing cost under control

Andrzej Zera Wed, 10 Jan 2024 05:10:49 -0800

Thank you very much for your suggestions. Yes, my main concern is
checkpointing costs.


I went through your suggestions and here're my comments:

- Caching: Can you please explain what you mean by caching? I know that
when you have batch and streaming sources in a streaming query, then you
can try to cache batch ones to save on reads. But I'm not sure if it's what
you mean, and I don't know how to apply what you suggest to streaming data.

- Optimize Checkpointing Frequency: I'm already using changelog
checkpointing with RocksDB and increased trigger interval to a maximum
acceptable value.

- Minimize LIST Request: That's where I can get most savings. My LIST
requests account for ~70% of checkpointing costs. From what I see, LIST
requests are ~2.5x the number of PUT requests. Unfortunately, when I
changed to checkpoting location DBFS, it didn't help with minimizing LIST
requests. They are roughly at the same level. From what I see, S3 Optimized
Committer is EMR-specific so I can't use it in Databricks. The fact that I
don't see a difference between S3 and DBFS checkpoint location suggests
that both must implement the same or similar committer.

- Optimizing RocksDB: I still need to do this but I don't suspect it will
help much. From what I understand, these settings shouldn't have a
significant impact on the number of requests to S3.

Any other ideas how to limit the number of LIST requests are appreciated

niedz., 7 sty 2024 o 15:38 Mich Talebzadeh <[email protected]>
napisał(a):

> OK I assume that your main concern is checkpointing costs.
>
> - Caching: If your queries read the same data multiple times, caching the
> data might reduce the amount of data that needs to be checkpointed.
>
> - Optimize Checkpointing Frequency i.e
>
>    - Consider Changelog Checkpointing with RocksDB.  This can
>    potentially reduce checkpoint size and duration by only storing state
>    changes, rather than the entire state.
>    - Adjust Trigger Interval (if possible): While not ideal for your
>    near-real time requirement, even a slight increase in the trigger interval
>    (e.g., to 7-8 seconds) can reduce checkpoint frequency and costs.
>
> - Minimize LIST Requests:
>
>    - Enable S3 Optimized Committer: or as you stated consider DBFS
>
> You can also optimise RocksDB. Set your state backend to RocksDB, if not
> already. Here are what I use
>
>               # Add RocksDB configurations here
>         spark.conf.set("spark.sql.streaming.stateStore.providerClass",
> "org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider")
>         spark.conf.set("spark.sql.streaming.stateStore.rocksdb.changelog",
> "true")
>
> spark.conf.set("spark.sql.streaming.stateStore.rocksdb.writeBufferSizeMB",
> "64")  # Example configuration
>
>  spark.conf.set("spark.sql.streaming.stateStore.rocksdb.compaction.style",
> "level")
>
> spark.conf.set("spark.sql.streaming.stateStore.rocksdb.compaction.level.targetFileSizeBase",
> "67108864")
>
> These configurations provide a starting point for tuning RocksDB.
> Depending on your specific use case and requirements, of course, your
> mileage varies.
>
> HTH
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 7 Jan 2024 at 08:07, Andrzej Zera <[email protected]> wrote:
>
>> Usually one or two topics per query. Each query has its own checkpoint
>> directory. Each topic has a few partitions.
>>
>> Performance-wise I don't experience any bottlenecks in terms of
>> checkpointing. It's all about the number of requests (including a high
>> number of LIST requests) and the associated cost.
>>
>> sob., 6 sty 2024 o 13:30 Mich Talebzadeh <[email protected]>
>> napisał(a):
>>
>>> How many topics and checkpoint directories are you dealing with?
>>>
>>> Does each topic has its own checkpoint  on S3?
>>>
>>> All these checkpoints are sequential writes so even SSD would not really
>>> help
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Dad | Technologist | Solutions Architect | Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sat, 6 Jan 2024 at 08:19, Andrzej Zera <[email protected]> wrote:
>>>
>>>> Hey,
>>>>
>>>> I'm running a few Structured Streaming jobs (with Spark 3.5.0) that
>>>> require near-real time accuracy with trigger intervals in the level of 5-10
>>>> seconds. I usually run 3-6 streaming queries as part of the job and each
>>>> query includes at least one stateful operation (and usually two or more).
>>>> My checkpoint location is S3 bucket and I use RocksDB as a state store.
>>>> Unfortunately, checkpointing costs are quite high. It's the main cost item
>>>> of the system and it's roughly 4-5 times the cost of compute.
>>>>
>>>> To save on compute costs, the following things are usually recommended:
>>>>
>>>>    - increase trigger interval (as mentioned, I don't have much space
>>>>    here)
>>>>    - decrease the number of shuffle partitions (I have 2x the number
>>>>    of workers)
>>>>
>>>> I'm looking for some other recommendations that I can use to save on
>>>> checkpointing costs. I saw that most requests are LIST requests. Can we cut
>>>> them down somehow? I'm using Databricks. If I replace S3 bucket with DBFS,
>>>> will it help in any way?
>>>>
>>>> Thank you!
>>>> Andrzej
>>>>
>>>>

Re: [Structured Streaming] Keeping checkpointing cost under control

Reply via email to