Hi Esteban,

Thanks for pointing to that, will try to collect all logs tomorrow and to
take deeper look and post here specific errors. Yes, good news are that all
logs are preserved.

Thanks a lot,
Dejan

On Mon, Apr 13, 2015 at 8:01 PM Esteban Gutierrez <[email protected]>
wrote:

> Hi Dejan,
>
> Do you have the logs from any of those failed region servers? Usually in
> case of a critical failure the RS will shutdown itself or if the RS "hangs"
> for a long time and the master will start processing the expiration of that
> RS and reject the RS if it tries to reconnect with a YouAreDeadException.
> The HBase master and RS logs for sure will tell us.
>
> thanks,
> esteban.
>
>
> --
> Cloudera, Inc.
>
>
> On Mon, Apr 13, 2015 at 1:11 AM, Dejan Menges <[email protected]>
> wrote:
>
> > Hi,
> >
> > We had some issues recently with HDFS - hardware issue with one of the
> > nodes, nodes died, HDFS recovered, but we figured out that something is
> > wrong with HBase. Checking HMaster log, we saw that bunch of our region
> > servers got to the famous failed servers list, and it was going on and on
> > until we restarted every one of them.
> >
> > Are we doing something wrong? Is it possible somehow to tune this out,
> once
> > the server is in this list to forget about it or something?
> >
> > Main question - how HMaster decides at all that server should be in the
> > failed server list, and what does this means exactly?
> >
> > Was looking into HBase book, googling, but beside some generic answers
> > wasn't able to find anything more internal.
> >
> > Thanks in advance!
> >
>

Reply via email to