On Fri, May 27, 2016 at 9:37 AM, Harry Waye <[email protected]> wrote:

> We had a regionserver fall out of our cluster, I assume due to the process
> hitting a limit as the region servers .out log file just contained "Killed"
> which I've experienced when hitting open file descriptors limits.  After
> this, hbck then reported inconsistencies in tables:
>
>
Or kernel is killing the process because it is out of memory (no swapping
but all memory occupied by running processes)


> ERROR: There is a hole in the region chain between
> dce998f6f8c63d3515a3207330697ce4-ravi teja and e4.  You need to create a
> new .regioninfo and region dir in hdfs to plug the hole.
>
> `hdfs fsck` reports a healthy dfs.
>
> I attempted to run `hbase hbck -repairHoles` which didn't resolve the
> inconsistencies.
>
> I then restarted the HBase cluster and it now appears from looking at the
> master log files that there are many tasks waiting to complete, and the web
> interface results in a timeout:
>
> master.SplitLogManager: total tasks = 299 unassigned = 285 tasks={ ... }
>
>
We are trying to split WAL files before cluster comes back on line it
seems. Are we stuck on one WAL?



> From looking at the logs on the regionservers I see messages such as:
> "regionserver.SplitLogWorker: Current region server ... has 2 tasks in
> progress and can't take more".
>
>
There is a configuration which says how many tasks per regionserver:
"hbase.regionserver.wal.max.splitters"




> How can I speed up working through these tasks?  I suspect our nodes can
> handle many more that 2 tasks at a time. I'll likely have followup
> questions ones these have been worked through but I think that's it for
> not.
>
>
Did your cluster recover? Or is there a bad WAL in the way? One damaged
somehow by the kill (perhaps other than RSs are getting killed on your
possibly oversubscribed cluster)?

Yours,
St.


> Any other information you need?
>

Reply via email to