We are running OpenTSDB 2.2 with HBase 1.1.2 and are having problems
with RegionServers that are shutting down sporadically from alleged GC
We run 2 OpenTSDB machines and 30 region servers. 8 GB heaps. The
region servers are collocated with data nodes and yarn jobs. Every
region server receive around 1000 req/s each.
Even though the logs says it's a GC pause, monitoring doesn't report
the actual pause. The rather suspicious log line says wal.FSHLog: Slow
sync cost: 56257 ms just after the GC pause detector warned and aborts
the region server. CPU, memory, network looks fine.
We have had this problem for a long time and have been troubleshooting
thoroughly, but we are still clueless.
Any advice would be helpful.