[ANNOUNCE] Apache Celeborn(incubating) 0.3.2 available

2024-01-07 Thread Nicholas Jiang
Hi all,

Apache Celeborn(Incubating) community is glad to announce the
new release of Apache Celeborn(Incubating) 0.3.2.

Celeborn is dedicated to improving the efficiency and elasticity of
different map-reduce engines and provides an elastic, high-efficient
service for intermediate data including shuffle data, spilled data,
result data, etc.


Download Link: https://celeborn.apache.org/download/

GitHub Release Tag:

- https://github.com/apache/incubator-celeborn/releases/tag/v0.3.2-incubating

Release Notes:

- https://celeborn.apache.org/community/release_notes/release_note_0.3.2


Home Page: https://celeborn.apache.org/

Celeborn Resources:

- Issue Management: https://issues.apache.org/jira/projects/CELEBORN
- Mailing List: d...@celeborn.apache.org

Regards,
Nicholas Jiang
On behalf of the Apache Celeborn(incubating) community

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-07 Thread Mich Talebzadeh
OK I assume that your main concern is checkpointing costs.

- Caching: If your queries read the same data multiple times, caching the
data might reduce the amount of data that needs to be checkpointed.

- Optimize Checkpointing Frequency i.e

   - Consider Changelog Checkpointing with RocksDB.  This can
   potentially reduce checkpoint size and duration by only storing state
   changes, rather than the entire state.
   - Adjust Trigger Interval (if possible): While not ideal for your
   near-real time requirement, even a slight increase in the trigger interval
   (e.g., to 7-8 seconds) can reduce checkpoint frequency and costs.

- Minimize LIST Requests:

   - Enable S3 Optimized Committer: or as you stated consider DBFS

You can also optimise RocksDB. Set your state backend to RocksDB, if not
already. Here are what I use

  # Add RocksDB configurations here
spark.conf.set("spark.sql.streaming.stateStore.providerClass",
"org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider")
spark.conf.set("spark.sql.streaming.stateStore.rocksdb.changelog",
"true")

spark.conf.set("spark.sql.streaming.stateStore.rocksdb.writeBufferSizeMB",
"64")  # Example configuration

 spark.conf.set("spark.sql.streaming.stateStore.rocksdb.compaction.style",
"level")

spark.conf.set("spark.sql.streaming.stateStore.rocksdb.compaction.level.targetFileSizeBase",
"67108864")

These configurations provide a starting point for tuning RocksDB. Depending
on your specific use case and requirements, of course, your mileage varies.

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 7 Jan 2024 at 08:07, Andrzej Zera  wrote:

> Usually one or two topics per query. Each query has its own checkpoint
> directory. Each topic has a few partitions.
>
> Performance-wise I don't experience any bottlenecks in terms of
> checkpointing. It's all about the number of requests (including a high
> number of LIST requests) and the associated cost.
>
> sob., 6 sty 2024 o 13:30 Mich Talebzadeh 
> napisał(a):
>
>> How many topics and checkpoint directories are you dealing with?
>>
>> Does each topic has its own checkpoint  on S3?
>>
>> All these checkpoints are sequential writes so even SSD would not really
>> help
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 6 Jan 2024 at 08:19, Andrzej Zera  wrote:
>>
>>> Hey,
>>>
>>> I'm running a few Structured Streaming jobs (with Spark 3.5.0) that
>>> require near-real time accuracy with trigger intervals in the level of 5-10
>>> seconds. I usually run 3-6 streaming queries as part of the job and each
>>> query includes at least one stateful operation (and usually two or more).
>>> My checkpoint location is S3 bucket and I use RocksDB as a state store.
>>> Unfortunately, checkpointing costs are quite high. It's the main cost item
>>> of the system and it's roughly 4-5 times the cost of compute.
>>>
>>> To save on compute costs, the following things are usually recommended:
>>>
>>>- increase trigger interval (as mentioned, I don't have much space
>>>here)
>>>- decrease the number of shuffle partitions (I have 2x the number of
>>>workers)
>>>
>>> I'm looking for some other recommendations that I can use to save on
>>> checkpointing costs. I saw that most requests are LIST requests. Can we cut
>>> them down somehow? I'm using Databricks. If I replace S3 bucket with DBFS,
>>> will it help in any way?
>>>
>>> Thank you!
>>> Andrzej
>>>
>>>


Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-07 Thread Andrzej Zera
Usually one or two topics per query. Each query has its own checkpoint
directory. Each topic has a few partitions.

Performance-wise I don't experience any bottlenecks in terms of
checkpointing. It's all about the number of requests (including a high
number of LIST requests) and the associated cost.

sob., 6 sty 2024 o 13:30 Mich Talebzadeh 
napisał(a):

> How many topics and checkpoint directories are you dealing with?
>
> Does each topic has its own checkpoint  on S3?
>
> All these checkpoints are sequential writes so even SSD would not really
> help
>
> HTH
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 6 Jan 2024 at 08:19, Andrzej Zera  wrote:
>
>> Hey,
>>
>> I'm running a few Structured Streaming jobs (with Spark 3.5.0) that
>> require near-real time accuracy with trigger intervals in the level of 5-10
>> seconds. I usually run 3-6 streaming queries as part of the job and each
>> query includes at least one stateful operation (and usually two or more).
>> My checkpoint location is S3 bucket and I use RocksDB as a state store.
>> Unfortunately, checkpointing costs are quite high. It's the main cost item
>> of the system and it's roughly 4-5 times the cost of compute.
>>
>> To save on compute costs, the following things are usually recommended:
>>
>>- increase trigger interval (as mentioned, I don't have much space
>>here)
>>- decrease the number of shuffle partitions (I have 2x the number of
>>workers)
>>
>> I'm looking for some other recommendations that I can use to save on
>> checkpointing costs. I saw that most requests are LIST requests. Can we cut
>> them down somehow? I'm using Databricks. If I replace S3 bucket with DBFS,
>> will it help in any way?
>>
>> Thank you!
>> Andrzej
>>
>>