Hello,
Currently we are in the process of productionizing Celeborn, and have been
experimenting with setting celeborn configuration. The guides [1][2] have
been helpful however we are curious about what configurations have been
useful for large production environments.

For context our worker instances have 16 vCPU, 96GB RAM with 15TB
SSD which we are using for shuffle file storage. Network bandwidth is
25Gbps.
Currently our configuration is set to the following:
CELEBORN_WORKER_MEMORY=18g
CELEBORN_WORKER_OFFHEAP_MEMORY=64g
celeborn.worker.flusher.buffer.size=256k
celeborn.rpc.io.clientThreads=8
celeborn.rpc.io.serverThreads=8
celeborn.worker.push.io.threads=8
celeborn.worker.flusher.threads=16 (default)

We are seeing high push data latency - O(seconds) with high load -  and
understand that the thread count probably needs to be higher. Additionally,
we could also use some guidance on setting buffer size given the size of
our SSDs.

1. https://celeborn.apache.org/docs/latest/configuration/
2. https://celeborn.apache.org/docs/latest/cluster_planning/

Thanks.

Reply via email to