Hello, Currently we are in the process of productionizing Celeborn, and have been experimenting with setting celeborn configuration. The guides [1][2] have been helpful however we are curious about what configurations have been useful for large production environments.
For context our worker instances have 16 vCPU, 96GB RAM with 15TB SSD which we are using for shuffle file storage. Network bandwidth is 25Gbps. Currently our configuration is set to the following: CELEBORN_WORKER_MEMORY=18g CELEBORN_WORKER_OFFHEAP_MEMORY=64g celeborn.worker.flusher.buffer.size=256k celeborn.rpc.io.clientThreads=8 celeborn.rpc.io.serverThreads=8 celeborn.worker.push.io.threads=8 celeborn.worker.flusher.threads=16 (default) We are seeing high push data latency - O(seconds) with high load - and understand that the thread count probably needs to be higher. Additionally, we could also use some guidance on setting buffer size given the size of our SSDs. 1. https://celeborn.apache.org/docs/latest/configuration/ 2. https://celeborn.apache.org/docs/latest/cluster_planning/ Thanks.