Thanks Dejan,

Please keep us posted!

cheers,
esteban.


--
Cloudera, Inc.


On Mon, Apr 13, 2015 at 11:08 AM, Dejan Menges <[email protected]>
wrote:

> Hi Esteban,
>
> Thanks for pointing to that, will try to collect all logs tomorrow and to
> take deeper look and post here specific errors. Yes, good news are that all
> logs are preserved.
>
> Thanks a lot,
> Dejan
>
> On Mon, Apr 13, 2015 at 8:01 PM Esteban Gutierrez <[email protected]>
> wrote:
>
> > Hi Dejan,
> >
> > Do you have the logs from any of those failed region servers? Usually in
> > case of a critical failure the RS will shutdown itself or if the RS
> "hangs"
> > for a long time and the master will start processing the expiration of
> that
> > RS and reject the RS if it tries to reconnect with a YouAreDeadException.
> > The HBase master and RS logs for sure will tell us.
> >
> > thanks,
> > esteban.
> >
> >
> > --
> > Cloudera, Inc.
> >
> >
> > On Mon, Apr 13, 2015 at 1:11 AM, Dejan Menges <[email protected]>
> > wrote:
> >
> > > Hi,
> > >
> > > We had some issues recently with HDFS - hardware issue with one of the
> > > nodes, nodes died, HDFS recovered, but we figured out that something is
> > > wrong with HBase. Checking HMaster log, we saw that bunch of our region
> > > servers got to the famous failed servers list, and it was going on and
> on
> > > until we restarted every one of them.
> > >
> > > Are we doing something wrong? Is it possible somehow to tune this out,
> > once
> > > the server is in this list to forget about it or something?
> > >
> > > Main question - how HMaster decides at all that server should be in the
> > > failed server list, and what does this means exactly?
> > >
> > > Was looking into HBase book, googling, but beside some generic answers
> > > wasn't able to find anything more internal.
> > >
> > > Thanks in advance!
> > >
> >
>

Reply via email to