Hi We are running OpenTSDB 2.2 with HBase 1.1.2 and are having problems with RegionServers that are shutting down sporadically from alleged GC pauses.
We run 2 OpenTSDB machines and 30 region servers. 8 GB heaps. The region servers are collocated with data nodes and yarn jobs. Every region server receive around 1000 req/s each. Even though the logs says it's a GC pause, monitoring doesn't report the actual pause. The rather suspicious log line says wal.FSHLog: Slow sync cost: 56257 ms just after the GC pause detector warned and aborts the region server. CPU, memory, network looks fine. We have had this problem for a long time and have been troubleshooting thoroughly, but we are still clueless. Any advice would be helpful. Cheers, -Kristoffer [1] https://www.dropbox.com/s/m2cuutcdh81itay/hbase.log?dl=0