Thanks Dejan, Please keep us posted!
cheers, esteban. -- Cloudera, Inc. On Mon, Apr 13, 2015 at 11:08 AM, Dejan Menges <[email protected]> wrote: > Hi Esteban, > > Thanks for pointing to that, will try to collect all logs tomorrow and to > take deeper look and post here specific errors. Yes, good news are that all > logs are preserved. > > Thanks a lot, > Dejan > > On Mon, Apr 13, 2015 at 8:01 PM Esteban Gutierrez <[email protected]> > wrote: > > > Hi Dejan, > > > > Do you have the logs from any of those failed region servers? Usually in > > case of a critical failure the RS will shutdown itself or if the RS > "hangs" > > for a long time and the master will start processing the expiration of > that > > RS and reject the RS if it tries to reconnect with a YouAreDeadException. > > The HBase master and RS logs for sure will tell us. > > > > thanks, > > esteban. > > > > > > -- > > Cloudera, Inc. > > > > > > On Mon, Apr 13, 2015 at 1:11 AM, Dejan Menges <[email protected]> > > wrote: > > > > > Hi, > > > > > > We had some issues recently with HDFS - hardware issue with one of the > > > nodes, nodes died, HDFS recovered, but we figured out that something is > > > wrong with HBase. Checking HMaster log, we saw that bunch of our region > > > servers got to the famous failed servers list, and it was going on and > on > > > until we restarted every one of them. > > > > > > Are we doing something wrong? Is it possible somehow to tune this out, > > once > > > the server is in this list to forget about it or something? > > > > > > Main question - how HMaster decides at all that server should be in the > > > failed server list, and what does this means exactly? > > > > > > Was looking into HBase book, googling, but beside some generic answers > > > wasn't able to find anything more internal. > > > > > > Thanks in advance! > > > > > >
