Hi 首先需要纠正一点的是,RocksDB的实际可用内存并不是你以为的13GB,因为从Flink-1.10 开始引入的 managed memory [1][2],会将slot上的RocksDB的实际可用内存限制在 managed memory / number of slots,也就是说对于你配置的10个slot,20GB的process memory,0.75的managed fraction,真实的per slot managed memory其实只有不到1.5GB,也就是说你配置的write buffer count以及max write buffer啥的并没有真正“生效”。RocksDB的write buffer manager会提前将write buffer 置为immutable并flush出去。应该增大 managed memory / number of slots 来增大单个slot内多个RocksDB的共享可用内存,来确保RocksDB的可用实际内存真的有效。 从你的栈看,很多时候卡在了数据put上,我怀疑是遇到了写阻塞 (write stall) [3],可以用async-profiler [4] 来观察RocksDB的内部相关调用栈。 另外,可以开启rocksDB的native metrics [5][6],观察RocksDB的写是不是经常被阻塞
[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/memory/mem_tuning.html#rocksdb-state-backend [2] https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/state/large_state_tuning.html#tuning-rocksdb-memory [3] https://github.com/facebook/rocksdb/wiki/Write-Stalls [4] https://github.com/jvm-profiling-tools/async-profiler [5] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#state-backend-rocksdb-metrics-actual-delayed-write-rate [6] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#state-backend-rocksdb-metrics-is-write-stopped 祝好 唐云 ________________________________ From: jindy_liu <[email protected]> Sent: Thursday, December 10, 2020 16:22 To: [email protected] <[email protected]> Subject: Re: 关于flink cdc的N流join状态后端的选择问题: FsStateBackend和RocksDBStateBackend 补充一个,当我把state.backend.rocksdb.writebuffer.count: 48调小到10的话, jstack来看,从https://spotify.github.io/threaddump-analyzer/分析来看 top类的方法基本都在rocksdb的io上了。并且很多线程都在等待 <http://apache-flink.147419.n8.nabble.com/file/t670/stack.png> <http://apache-flink.147419.n8.nabble.com/file/t670/sleep.png> -- Sent from: http://apache-flink.147419.n8.nabble.com/
