Hi

首先需要纠正一点的是,RocksDB的实际可用内存并不是你以为的13GB,因为从Flink-1.10 开始引入的 managed memory 
[1][2],会将slot上的RocksDB的实际可用内存限制在 managed memory / number of 
slots,也就是说对于你配置的10个slot,20GB的process memory,0.75的managed fraction,真实的per slot 
managed memory其实只有不到1.5GB,也就是说你配置的write buffer count以及max write 
buffer啥的并没有真正“生效”。RocksDB的write buffer manager会提前将write buffer 
置为immutable并flush出去。应该增大 managed memory / number of slots 
来增大单个slot内多个RocksDB的共享可用内存,来确保RocksDB的可用实际内存真的有效。
从你的栈看,很多时候卡在了数据put上,我怀疑是遇到了写阻塞 (write stall) [3],可以用async-profiler [4] 
来观察RocksDB的内部相关调用栈。
另外,可以开启rocksDB的native metrics [5][6],观察RocksDB的写是不是经常被阻塞


[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/memory/mem_tuning.html#rocksdb-state-backend
[2] 
https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/state/large_state_tuning.html#tuning-rocksdb-memory
[3] https://github.com/facebook/rocksdb/wiki/Write-Stalls
[4] https://github.com/jvm-profiling-tools/async-profiler
[5] 
https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#state-backend-rocksdb-metrics-actual-delayed-write-rate
[6] 
https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#state-backend-rocksdb-metrics-is-write-stopped

祝好
唐云
________________________________
From: jindy_liu <[email protected]>
Sent: Thursday, December 10, 2020 16:22
To: [email protected] <[email protected]>
Subject: Re: 关于flink cdc的N流join状态后端的选择问题: ‎FsStateBackend和‎RocksDBStateBackend

补充一个,当我把state.backend.rocksdb.writebuffer.count: 48调小到10的话,

jstack来看,从https://spotify.github.io/threaddump-analyzer/分析来看

top类的方法基本都在rocksdb的io上了。并且很多线程都在等待
<http://apache-flink.147419.n8.nabble.com/file/t670/stack.png>

<http://apache-flink.147419.n8.nabble.com/file/t670/sleep.png>





--
Sent from: http://apache-flink.147419.n8.nabble.com/

回复