We had a regionserver fall out of our cluster, I assume due to the process
hitting a limit as the region servers .out log file just contained "Killed"
which I've experienced when hitting open file descriptors limits. After
this, hbck then reported inconsistencies in tables:
ERROR: There is a hole in the region chain between
dce998f6f8c63d3515a3207330697ce4-ravi teja and e4. You need to create a
new .regioninfo and region dir in hdfs to plug the hole.
`hdfs fsck` reports a healthy dfs.
I attempted to run `hbase hbck -repairHoles` which didn't resolve the
inconsistencies.
I then restarted the HBase cluster and it now appears from looking at the
master log files that there are many tasks waiting to complete, and the web
interface results in a timeout:
master.SplitLogManager: total tasks = 299 unassigned = 285 tasks={ ... }
>From looking at the logs on the regionservers I see messages such as:
"regionserver.SplitLogWorker: Current region server ... has 2 tasks in
progress and can't take more".
How can I speed up working through these tasks? I suspect our nodes can
handle many more that 2 tasks at a time. I'll likely have followup
questions ones these have been worked through but I think that's it for not.
Any other information you need?