Re: Strange issue when DataNode goes down

Bryan Beaudreault Mon, 23 Mar 2015 08:57:31 -0700

@Dejan, I've had staleness configured on my cluster for a while, but
haven't needed it.  Looking more closely at it thanks to this thread, I
noticed though that I was missing two critical parameters.


Considering you just now set this up, I'll guess that you probably didn't
miss this (the docs used to be much less organized).  But it won't work
unless the namenodes have dfs.namenode.avoid.read.stale.datanode
and dfs.namenode.avoid.write.stale.datanode set to true.

Just thought I'd throw it out there.

On Mon, Mar 23, 2015 at 11:51 AM, Dejan Menges <[email protected]>
wrote:

> I'm surprised by this as well. Staleness was configured also, MTTR from
> HBase book as described, and in this specific case - when machine really
> dies - even when NN makes DataNode dead, HBase was trying to replay WALs
> from dead node until timeout reached.
>
> Still reading this HDFS-3703, trying to get more context, but looks to me
> so far that this is one of those things where you should decide what are
> your own acceptable tradeoffs.
>
> On Mon, Mar 23, 2015 at 4:40 PM Bryan Beaudreault <
> [email protected]>
> wrote:
>
> > @Nicholas, I see, thanks.  I'll keep the settings at default.  So really
> if
> > everything else is configured properly you should never reach the lease
> > recovery timeout in any failure scenarios.  It seems that the staleness
> > check would be the thing that prevents this, correct?  I'm surprised it
> > didn't help Dejan.
> >
> > On Mon, Mar 23, 2015 at 11:20 AM, Nicolas Liochon <[email protected]>
> > wrote:
> >
> > > @bryan: yes, you can change hbase.lease.recovery.timeout if you changed
> > he
> > > hdfs settings. But this setting is really for desperate cases. The
> > recover
> > > Lease should have succeeded before. As well, if you depend on
> > > hbase.lease.recovery.timeout, it means that you're wasting recovery
> time:
> > > the lease should be recovered in a few seconds.
> > >
> > > On Mon, Mar 23, 2015 at 3:59 PM, Dejan Menges <[email protected]>
> > > wrote:
> > >
> > > > Interesting discussion I also found, gives me some more light on what
> > > > Nicolas is talking about -
> > > https://issues.apache.org/jira/browse/HDFS-3703
> > > >
> > > > On Mon, Mar 23, 2015 at 3:53 PM Bryan Beaudreault <
> > > > [email protected]>
> > > > wrote:
> > > >
> > > > > So it is safe to set hbase.lease.recovery.timeout lower if you also
> > > > > set heartbeat.recheck.interval lower (lowering that 10.5 min dead
> > node
> > > > > timer)?  Or is it recommended to not touch either of those?
> > > > >
> > > > > Reading the above with interest, thanks for digging in here guys.
> > > > >
> > > > > On Mon, Mar 23, 2015 at 10:13 AM, Nicolas Liochon <
> [email protected]
> > >
> > > > > wrote:
> > > > >
> > > > > > If the node is actually down it's fine. But the node may not be
> > that
> > > > down
> > > > > > (CAP theorem here); and then it's looking for trouble.
> > > > > > HDFS, by default declare a node as dead after 10:30. 15 minutes
> is
> > an
> > > > > extra
> > > > > > security. It seems your hdfs settings are different (or there is
> a
> > > > > bug...).
> > > > > > There should be some info in the hdfs logs.
> > > > > >
> > > > > > On Mon, Mar 23, 2015 at 3:05 PM, Dejan Menges <
> > > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > Will take a look.
> > > > > > >
> > > > > > > Actually, if node is down (someone unplugged network cable, it
> > just
> > > > > died,
> > > > > > > whatever) what's chance it's going to become live so write can
> > > > > continue?
> > > > > > On
> > > > > > > the other side, HBase is not starting recovery trying to
> contact
> > > node
> > > > > > which
> > > > > > > is not there anymore, and even elected as dead on Namenode side
> > > > > (another
> > > > > > > thing I didn't understood quite good).
> > > > > > >
> > > > > > > So what I was expecting is that as soon as Namenode decided
> node
> > is
> > > > > dead,
> > > > > > > that it would be enough for RegionServer to stop trying to
> > recover
> > > > from
> > > > > > the
> > > > > > > dead node, but it wasn't the case. Also, this whole MTTR
> article
> > in
> > > > > HBase
> > > > > > > book doesn't work at all with this parameter set to it's
> default
> > > > value
> > > > > > (15
> > > > > > > minutes).
> > > > > > >
> > > > > > > So I'm still struggling to figure out what's drawback exactly
> on
> > > > this?
> > > > > > >
> > > > > > > On Mon, Mar 23, 2015 at 2:50 PM Nicolas Liochon <
> > [email protected]
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks for the explanation. There is an issue if you modify
> > this
> > > > > > setting
> > > > > > > > however.
> > > > > > > > hbase tries to recover the lease (i.e. be sure that nobody is
> > > > > writing)
> > > > > > > > If you change hbase.lease.recovery.timeout hbase will start
> the
> > > > > > recovery
> > > > > > > > (i.e. start to read) even if it's not sure that nobody's
> > writing.
> > > > > That
> > > > > > > > means there is a dataloss risk.
> > > > > > > > Basically, you must not see this warning: WARN
> > > > > > > > org.apache.hadoop.hbase.util.FSHDFSUtils: Cannot recoverLease
> > > after
> > > > > > > trying
> > > > > > > > for[]
> > > > > > > >
> > > > > > > > The recoverLease must succeed. The fact that it does not
> after
> > X
> > > > > tries
> > > > > > is
> > > > > > > > strange.
> > > > > > > > There may be a mistmatch between the hbase parameters and the
> > > hdfs
> > > > > > ones.
> > > > > > > > You may need to have a look at the comments in
> FSHDFSUtils.java
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Mar 23, 2015 at 2:15 PM, Dejan Menges <
> > > > > [email protected]>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > I found the issue and fixed it, and will try to explain
> here
> > > what
> > > > > was
> > > > > > > > > exactly in our case in case someone else finds this
> > interesting
> > > > > too.
> > > > > > > > >
> > > > > > > > > So initially, we had (couple of times) some network and
> > > hardware
> > > > > > issues
> > > > > > > > in
> > > > > > > > > our datacenters. When one server would die (literary die,
> we
> > > had
> > > > > some
> > > > > > > > issue
> > > > > > > > > with PSUs) we saw issues with overall cluster performance
> on
> > > > HBase
> > > > > > > side.
> > > > > > > > As
> > > > > > > > > cluster is quite big and live, it was also quite hard to
> > figure
> > > > out
> > > > > > > exact
> > > > > > > > > root cause and how to fix the stuff we wanted to fix.
> > > > > > > > >
> > > > > > > > > So I set up another cluster, four nodes (with DataNode and
> > > > > > > RegionServer)
> > > > > > > > > and two other nodes with HMaster and Namenode in HA, using
> > same
> > > > > stuff
> > > > > > > we
> > > > > > > > > use on production. We pumped some data into it, and I was
> > able
> > > to
> > > > > > > > reproduce
> > > > > > > > > same issue last week on it. What I tried to do is to cut
> one
> > > > server
> > > > > > > (shut
> > > > > > > > > down it's interface) when all is good with cluster, when we
> > > have
> > > > > > load,
> > > > > > > > and
> > > > > > > > > see what's going to happen.
> > > > > > > > >
> > > > > > > > > On Friday, after Nicolas mentioned, I started taking a look
> > in
> > > > > HBase
> > > > > > > logs
> > > > > > > > > on the node which was mentioned in HMaster log as the one
> > > taking
> > > > > over
> > > > > > > > > regions for the dead server. Basically what I was able to
> > > observe
> > > > > was
> > > > > > > 15
> > > > > > > > > minutes time (+- couple of seconds, what was also
> > interesting,
> > > > and
> > > > > > will
> > > > > > > > got
> > > > > > > > > later to that) between HMaster figures out that one of it's
> > > > > > > RegionServers
> > > > > > > > > is dead, and the time one of the mentioned nodes starts
> > taking
> > > > over
> > > > > > > those
> > > > > > > > > regions and they start appearing in HMaster's Web UI.
> > > > > > > > >
> > > > > > > > > I then set up everything like mentioned here
> > > > > > > > > http://hbase.apache.org/book.html#mttr - but still had
> > exactly
> > > > the
> > > > > > > same
> > > > > > > > > issues. Went over (again and again) all currently
> configured
> > > > stuff,
> > > > > > but
> > > > > > > > > still had the same issue.
> > > > > > > > >
> > > > > > > > > Then I started looking into HDFS. Opened NameNode UI, saw
> all
> > > is
> > > > > > good,
> > > > > > > > took
> > > > > > > > > one node down, was also looking RegionServer logs in the
> same
> > > > time
> > > > > -
> > > > > > > and
> > > > > > > > I
> > > > > > > > > also see that it took ~15 minutes for Namenode to elect
> dead
> > > node
> > > > > as
> > > > > > > > dead.
> > > > > > > > > Somehow in the same moment regions started getting back to
> > > life.
> > > > I
> > > > > > > > remember
> > > > > > > > > in some older versions dfs timeout checks and check
> retries.
> > > > Looked
> > > > > > > into
> > > > > > > > > defaults for our Hadoop version -
> > > > > > > > >
> > > > > > > > > http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/
> > > > > > > > hadoop-hdfs/hdfs-default.xml
> > > > > > > > > - and saw there that there's no recheck parameter anymore.
> > > > Strange,
> > > > > > as
> > > > > > > on
> > > > > > > > > StackOverflow I found post from month ago, for newer
> version
> > > than
> > > > > we
> > > > > > > use
> > > > > > > > > (we use 2.4.1, guy was using 2.6 - dfs.namenode.heartbeat.
> > > > > > > > recheck-interval)
> > > > > > > > > I set it to 10 seconds as he mentioned, having checks every
> > > three
> > > > > > > seconds
> > > > > > > > > (default) and got DataNode elected as dead in ~45 seconds,
> as
> > > he
> > > > > > > > mentioned.
> > > > > > > > > Not sure why this parameter is not documented, but
> obviously
> > it
> > > > > > works.
> > > > > > > > >
> > > > > > > > > Then figured out it still didn't fix our HBase failover
> > issue.
> > > I
> > > > > was
> > > > > > > > > looking into HBase book again and again, and then saw this
> > > part:
> > > > > > > > >
> > > > > > > > > "How much time we allow elapse between calls to recover
> > lease.
> > > > > Should
> > > > > > > be
> > > > > > > > > larger than the dfs timeout."
> > > > > > > > >
> > > > > > > > > This was description for hbase.lease.recovery.dfs.timeout.
> > > Wasn't
> > > > > > sure
> > > > > > > > from
> > > > > > > > > the comment what of all timeouts that's possible to set in
> > > > > > Hadoop/HBase
> > > > > > > > and
> > > > > > > > > that has something to do with DFS is this all about. But
> > picked
> > > > > > > > > hbase.lease.recovery.timeout, and that was the catch.
> > > > > > > > >
> > > > > > > > > Initially, by default, hbase.lease.recovery.timeout is set
> to
> > > 15
> > > > > > > minutes.
> > > > > > > > > Not sure why, and wasn't able to find yet why, but getting
> > this
> > > > > down
> > > > > > to
> > > > > > > > one
> > > > > > > > > minute (what's more than OK for us) I was able to get rid
> of
> > > our
> > > > > > issue.
> > > > > > > > Not
> > > > > > > > > also sure why this is not mentioned in MTTR section in
> HBase
> > > > book,
> > > > > as
> > > > > > > > > obviously MTTR doesn't work at all with this default
> timeout,
> > > at
> > > > > > least
> > > > > > > it
> > > > > > > > > doesn't work the way we expected it to work.
> > > > > > > > >
> > > > > > > > > So thanks again for everyone being spammed with this, and
> > > > specially
> > > > > > > > thanks
> > > > > > > > > to Nicolas pointing me to the right direction.
> > > > > > > > >
> > > > > > > > > On Mon, Mar 23, 2015 at 1:37 PM Nicolas Liochon <
> > > > [email protected]
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > the attachments are rejected by the mailing list, can you
> > put
> > > > > then
> > > > > > on
> > > > > > > > > > pastebin?
> > > > > > > > > >
> > > > > > > > > > stale is mandatory (so it's good), but the issue here is
> > just
> > > > > > before.
> > > > > > > > The
> > > > > > > > > > region server needs to read the file. In order to be sure
> > > that
> > > > > > there
> > > > > > > is
> > > > > > > > > no
> > > > > > > > > > data loss, it needs to "recover the lease", that means
> > > ensuring
> > > > > > that
> > > > > > > > > nobody
> > > > > > > > > > is writing the file. The regionserver calls the namenode
> to
> > > do
> > > > > this
> > > > > > > > > > recoverLease. So there should be some info in the
> namenode
> > > > logs.
> > > > > > You
> > > > > > > > have
> > > > > > > > > > HDFS-4721 on your hdfs? The hbase book details (more or
> > > > less...)
> > > > > > this
> > > > > > > > > > recoverLease stuff.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Mon, Mar 23, 2015 at 10:33 AM, Dejan Menges <
> > > > > > > [email protected]
> > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > And also, just checked - dfs.namenode.avoid.read.stale.
> > > > > datanode
> > > > > > and
> > > > > > > > > > > dfs.namenode.avoid.write.stale.datanode
> > > > > > > > > > > are both true, and dfs.namenode.stale.datanode.interval
> > is
> > > > > set to
> > > > > > > > > > default
> > > > > > > > > > > 30000.
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Mar 23, 2015 at 10:03 AM Dejan Menges <
> > > > > > > > [email protected]>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Nicolas,
> > > > > > > > > > > >
> > > > > > > > > > > > Please find log attached.
> > > > > > > > > > > >
> > > > > > > > > > > > As I see it now more clearly, it was trying to
> recover
> > > HDFS
> > > > > > WALs
> > > > > > > > from
> > > > > > > > > > > node
> > > > > > > > > > > > that's dead:
> > > > > > > > > > > >
> > > > > > > > > > > > 2015-03-23 08:53:44,381 WARN
> > > > > > > > > org.apache.hadoop.hbase.util.FSHDFSUtils:
> > > > > > > > > > > > Cannot recoverLease after trying for 900000ms
> > > > > > > > > > > > (hbase.lease.recovery.timeout); continuing, but may
> be
> > > > > > > > DATALOSS!!!;
> > > > > > > > > > > > attempt=40 on
> > > > > > > > > > > >
> > > > > > > > > > >
> > > file=hdfs://{my_hmaster_node}:8020/hbase/WALs/{node_i_intent
> > > > > > > > > >
> ionally_get_down_by_getting_network_down},60020,1426862900
> > > > > > > > > > 506-splitting/{node_i_intentionally_get_down_by_gett
> > > > > > > > > > ing_network_down}%2C60020%2C1426862900506.1427096924508
> > > > > > > > > > > > after 908210ms
> > > > > > > > > > > >
> > > > > > > > > > > > And as you can see from the log, it tried 40 times,
> > what
> > > > took
> > > > > > it
> > > > > > > > > > exactly
> > > > > > > > > > > > 15 minutes.
> > > > > > > > > > > >
> > > > > > > > > > > > There's probably some parameter to tune to cut it of
> > from
> > > > 40
> > > > > > > times
> > > > > > > > /
> > > > > > > > > 15
> > > > > > > > > > > > minutes to something more useful, as for 15 minutes
> we
> > > > don't
> > > > > > have
> > > > > > > > our
> > > > > > > > > > > > regions available, and HDFS have however replication
> > > factor
> > > > > 3.
> > > > > > > > > > > >
> > > > > > > > > > > > Googling, if I figure out what's this I will post it
> > > here.
> > > > > Will
> > > > > > > > also
> > > > > > > > > > > > appreciate if someone knows how to cut this down.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > >
> > > > > > > > > > > > Dejan
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Mar 20, 2015 at 3:49 PM Nicolas Liochon <
> > > > > > > [email protected]
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >> The split is done by the region servers (the master
> > > > > > > coordinates).
> > > > > > > > Is
> > > > > > > > > > > there
> > > > > > > > > > > >> some interesting stuff in their logs?
> > > > > > > > > > > >>
> > > > > > > > > > > >> On Fri, Mar 20, 2015 at 3:38 PM, Dejan Menges <
> > > > > > > > > [email protected]
> > > > > > > > > > >
> > > > > > > > > > > >> wrote:
> > > > > > > > > > > >>
> > > > > > > > > > > >> > With client issue was that it was retrying
> > connecting
> > > to
> > > > > the
> > > > > > > > same
> > > > > > > > > > > region
> > > > > > > > > > > >> > servers even when they were reassigned. Lowering
> it
> > > down
> > > > > > > helped
> > > > > > > > in
> > > > > > > > > > > this
> > > > > > > > > > > >> > specific use case, but yes, we still want servers
> to
> > > > > > > reallocate
> > > > > > > > > > > quickly.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > What got me here:
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > http://hbase.apache.org/book.html#mttr
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > I basically set configuration exactly the same way
> > as
> > > > it's
> > > > > > > > > explained
> > > > > > > > > > > >> here.
> > > > > > > > > > > >> > *zookeeper.session.timeout* is (and was before)
> > 60000
> > > > (one
> > > > > > > > > minute).
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > So basically what happens is: - simulating network
> > > > issues
> > > > > we
> > > > > > > had
> > > > > > > > > > > >> recently).
> > > > > > > > > > > >> > - After short time I see in HBase that my
> > RegionServer
> > > > is
> > > > > > > dead,
> > > > > > > > > and
> > > > > > > > > > as
> > > > > > > > > > > >> > total number of regions I see previous total minus
> > > > number
> > > > > of
> > > > > > > > > regions
> > > > > > > > > > > >> that
> > > > > > > > > > > >> > were hosted on the node hosting RegionServer that
> > just
> > > > > > > > > > 'disappeared'.
> > > > > > > > > > > >> > - In this point I want my clus
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > - I have test cluster consisting of four nodes,
> > every
> > > > node
> > > > > > > being
> > > > > > > > > > > >> DataNode
> > > > > > > > > > > >> > and RegionServer.
> > > > > > > > > > > >> > - I simulate network partition on one (connect to
> it
> > > > > through
> > > > > > > > > console
> > > > > > > > > > > and
> > > > > > > > > > > >> > take network interface downter to recover as soon
> as
> > > > > > possible,
> > > > > > > > to
> > > > > > > > > > > start
> > > > > > > > > > > >> > serving missing regions.
> > > > > > > > > > > >> > - First thing I see in HMaster logs are:
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > 2015-03-20 14:17:26,015 INFO
> > > > > > > > > > > >> > org.apache.hadoop.hbase.
> > zookeeper.RegionServerTracker:
> > > > > > > > > RegionServer
> > > > > > > > > > > >> > ephemeral node deleted, processing expiration
> > > > > > > > > > > >> > [{name_of_node_I_took_down},60020,1426860403261]
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > 2015-03-20 14:17:26,067 INFO
> > > > > > > > > > > >> >
> > > > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
> > > > > > > > > > > Splitting
> > > > > > > > > > > >> > logs for
> > > {name_of_node_I_took_down},60020,1426860403261
> > > > > > before
> > > > > > > > > > > >> assignment.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > 2015-03-20 14:17:26,105 INFO
> > > > > > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager:
> > dead
> > > > > > splitlog
> > > > > > > > > > workers
> > > > > > > > > > > [
> > > > > > > > > > > >> > {name_of_node_I_took_down},60020,1426860403261]
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > 2015-03-20 14:17:26,107 INFO
> > > > > > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager:
> > > started
> > > > > > > > splitting
> > > > > > > > > 1
> > > > > > > > > > > >> logs in
> > > > > > > > > > > >> >
> > > [hdfs://{fqdn_of_hmaster}:8020/hbase/WALs/{name_of_node_
> > > > > I_
> > > > > > > > > > took_down}
> > > > > > > > > > > >> > ,60020,1426860403261-splitting]
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > 2015-03-20 14:17:26,150 INFO
> > > > > > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager:
> > task
> > > > > > > > > > > >> > /hbase/splitWAL/WALs%2F
> > > > > > > > > > > >> >
> > > > > > > {name_of_node_I_took_down}%2C60020%2C1426860403261-splitting%2F
> > > > > > > > > > > >> >
> > > {name_of_node_I_took_down}%252C60020%252C1426860403261.
> > > > > 14268
> > > > > > > > > > 60404905
> > > > > > > > > > > >> > acquired by {fqdn_of_another_live_hnode},
> > > > > 60020,1426859445623
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > 2015-03-20 14:17:26,182 INFO
> > > > > > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager:
> > total
> > > > > tasks
> > > > > > =
> > > > > > > 1
> > > > > > > > > > > >> unassigned
> > > > > > > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs%
> > > > > 2F{name_of_node_I_took_down}
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > %2C60020%2C1426860403261-
> > splitting%2F{name_of_node_I_
> > > > > took_
> > > > > > > > > > > >>
> > > > down}%252C60020%252C1426860403261.1426860404905=last_update
> > > > > > > > > > > >> > = 1426861046182 last_version = 2 cur_worker_name =
> > > > > > > > > > > >> > {fqdn_of_another_live_node},60020,1426859445623
> > > status =
> > > > > > > > > > in_progress
> > > > > > > > > > > >> > incarnation = 0 resubmits = 0 batch = installed =
> 1
> > > done
> > > > > = 0
> > > > > > > > > error =
> > > > > > > > > > > 0}
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > 2015-03-20 14:17:31,183 INFO
> > > > > > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager:
> > total
> > > > > tasks
> > > > > > =
> > > > > > > 1
> > > > > > > > > > > >> unassigned
> > > > > > > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs%
> > > > > 2F{name_of_node_I_took_down}
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > %2C60020%2C1426860403261-
> > splitting%2F{name_of_node_I_
> > > > > took_
> > > > > > > > > > > >>
> > > > down}%252C60020%252C1426860403261.1426860404905=last_update
> > > > > > > > > > > >> > = 1426861046182 last_version = 2 cur_worker_name =
> > > > > > > > > > > >> > {fqdn_of_another_live_node},60020,1426859445623
> > > status =
> > > > > > > > > > in_progress
> > > > > > > > > > > >> > incarnation = 0 resubmits = 0 batch = installed =
> 1
> > > done
> > > > > = 0
> > > > > > > > > error =
> > > > > > > > > > > 0}
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > 2015-03-20 14:17:36,184 INFO
> > > > > > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager:
> > total
> > > > > tasks
> > > > > > =
> > > > > > > 1
> > > > > > > > > > > >> unassigned
> > > > > > > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs%
> > > > > 2F{name_of_node_I_took_down}
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > %2C60020%2C1426860403261-
> > splitting%2F{name_of_node_I_
> > > > > took_
> > > > > > > > > > > >>
> > > > down}%252C60020%252C1426860403261.1426860404905=last_update
> > > > > > > > > > > >> > = 1426861046182 last_version = 2 cur_worker_name =
> > > > > > > > > > > >> > {fqdn_of_another_live_node},60020,1426859445623
> > > status =
> > > > > > > > > > in_progress
> > > > > > > > > > > >> > incarnation = 0 resubmits = 0 batch = installed =
> 1
> > > done
> > > > > = 0
> > > > > > > > > error =
> > > > > > > > > > > 0}
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > 2015-03-20 14:17:42,185 INFO
> > > > > > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager:
> > total
> > > > > tasks
> > > > > > =
> > > > > > > 1
> > > > > > > > > > > >> unassigned
> > > > > > > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs%
> > > > > 2F{name_of_node_I_took_down}
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > %2C60020%2C1426860403261-
> > splitting%2F{name_of_node_I_
> > > > > took_
> > > > > > > > > > > >>
> > > > down}%252C60020%252C1426860403261.1426860404905=last_update
> > > > > > > > > > > >> > = 1426861046182 last_version = 2 cur_worker_name =
> > > > > > > > > > > >> > {fqdn_of_another_live_node},60020,1426859445623
> > > status =
> > > > > > > > > > in_progress
> > > > > > > > > > > >> > incarnation = 0 resubmits = 0 batch = installed =
> 1
> > > done
> > > > > = 0
> > > > > > > > > error =
> > > > > > > > > > > 0}
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > 2015-03-20 14:17:48,184 INFO
> > > > > > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager:
> > total
> > > > > tasks
> > > > > > =
> > > > > > > 1
> > > > > > > > > > > >> unassigned
> > > > > > > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs%
> > > > > 2F{name_of_node_I_took_down}
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > %2C60020%2C1426860403261-
> > splitting%2F{name_of_node_I_
> > > > > took_
> > > > > > > > > > > >>
> > > > down}%252C60020%252C1426860403261.1426860404905=last_update
> > > > > > > > > > > >> > = 1426861046182 last_version = 2 cur_worker_name =
> > > > > > > > > > > >> > {fqdn_of_another_live_node},60020,1426859445623
> > > status =
> > > > > > > > > > in_progress
> > > > > > > > > > > >> > incarnation = 0 resubmits = 0 batch = installed =
> 1
> > > done
> > > > > = 0
> > > > > > > > > error =
> > > > > > > > > > > 0}
> > > > > > > > > > > >> > In the meantime, In hbase...out log I got this:
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > ==> hbase-hbase-master-{fqdn_of_
> > my_hmaster_node}.out
> > > <==
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > java.io.IOException: Call to
> > > > > > > > > > > >> >
> > > > {name_of_node_I_took_down}/{ip_of_local_interface_I_took_
> > > > > dow
> > > > > > > > > > n}:60020
> > > > > > > > > > > >> > failed on local exception:
> > > > > > > > > > > >> > org.apache.hadoop.hbase.ipc.RpcClient$
> > > > > CallTimeoutException:
> > > > > > > > Call
> > > > > > > > > > > >> > id=93152, waitTime=60044, rpcTimeout=60000
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > > org.apache.hadoop.hbase.ipc.RpcClient.wrapException(RpcClien
> > > > > > > > > > > >> t.java:1532)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:
> > > > > > > > 1502)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(
> > > > > Rpc
> > > > > > > > > > > >> Client.java:1684)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > org.apache.hadoop.hbase.ipc.RpcClient$
> > > > > BlockingRpcChannelImpl
> > > > > > > > > > ementati
> > > > > > > > > > > >> on.callBlockingMethod(RpcClient.java:1737)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > org.apache.hadoop.hbase.protobuf.generated.AdminProtos$
> > > > > > > > > > AdminService$
> > > > > > > > > > > >> BlockingStub.getRegionInfo(AdminProtos.java:20806)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > org.apache.hadoop.hbase.client.HBaseAdmin.
> > > > > getCompactionState
> > > > > > > > > > > >> (HBaseAdmin.java:2524)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > org.apache.hadoop.hbase.
> > generated.master.table_jsp._
> > > > > jspServi
> > > > > > > > > > > >> ce(table_jsp.java:167)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.
> > > > > > > > > > java:98)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > > javax.servlet.http.HttpServlet.service(HttpServlet.java:770)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder
> > > > > > > > > > > >> .java:511)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > org.mortbay.jetty.servlet.
> > ServletHandler$CachedChain.
> > > > > doFilte
> > > > > > > > > > > >> r(ServletHandler.java:1221)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > org.apache.hadoop.http.lib.StaticUserWebFilter$
> > > > > StaticUserFil
> > > > > > > > > > > >> ter.doFilter(StaticUserWebFilter.java:109)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > org.mortbay.jetty.servlet.
> > ServletHandler$CachedChain.
> > > > > doFilte
> > > > > > > > > > > >> r(ServletHandler.java:1212)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > org.apache.hadoop.http.
> > HttpServer$QuotingInputFilter.
> > > > > doFilte
> > > > > > > > > > > >> r(HttpServer.java:1081)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > org.mortbay.jetty.servlet.
> > ServletHandler$CachedChain.
> > > > > doFilte
> > > > > > > > > > > >> r(ServletHandler.java:1212)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > > > > > >
> > > > > > >
> > > org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > org.mortbay.jetty.servlet.
> > ServletHandler$CachedChain.
> > > > > doFilte
> > > > > > > > > > > >> r(ServletHandler.java:1212)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandl
> > > > > > > > > > > >> er.java:399)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > > > > > > >> > org.mortbay.jetty.security.SecurityHandler.handle(
> > > > > SecurityHa
> > > > > > > > > > > >> ndler.java:216)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandl
> > > > > > > > > > > >> er.java:182)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandl
> > > > > > > > > > > >> er.java:766)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.
> > > > > > > > > > > >> java:450)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> org.mortbay.jetty.handler.ContextHandlerCollection.
> > > > > handle(Co
> > > > > > > > > > > >> ntextHandlerCollection.java:230)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapp
> > > > > > > > > > > >> er.java:152)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> org.mortbay.jetty.Server.handle(Server.java:326)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnectio
> > > > > > > > > > > >> n.java:542)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > org.mortbay.jetty.HttpConnection$RequestHandler.
> > > > > headerComple
> > > > > > > > > > > >> te(HttpConnection.java:928)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > > > >
> > > org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > > > >
> > > org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > org.mortbay.io.nio.SelectChannelEndPoint.run(
> > > > > SelectChannelEn
> > > > > > > > > > > >> dPoint.java:410)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(
> > > > > > > > > > > >> QueuedThreadPool.java:582)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Caused by: org.apache.hadoop.hbase.ipc.RpcClient$
> > > > > > > > > > CallTimeoutException:
> > > > > > > > > > > >> Call
> > > > > > > > > > > >> > id=93152, waitTime=60044, rpcTimeout=60000
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > org.apache.hadoop.hbase.ipc.RpcClient$Connection.
> > > > > cleanupCall
> > > > > > > > > > > >> s(RpcClient.java:1234)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > org.apache.hadoop.hbase.ipc.RpcClient$Connection.
> > > > > readRespons
> > > > > > > > > > > >> e(RpcClient.java:1171)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > at
> > > > > > > org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClie
> > > > > > > > > > > >> nt.java:751)
> > > > > > > > > > > >> > Beside this same issue, please note that first
> > message
> > > > was
> > > > > > at
> > > > > > > > > > > 2015-03-20
> > > > > > > > > > > >> > 14:17:26,015. And then (we got to the point when
> it
> > > > > started
> > > > > > > > > > > transition):
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > 2015-03-20 14:32:35,059 INFO
> > > > > > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager:
> > task
> > > > > > > > > > > >> > /hbase/splitWAL/WALs%2F
> > > > > > > > > > > >> >
> > > > > > > {name_of_node_I_took_down}%2C60020%2C1426860403261-splitting%2F
> > > > > > > > > > > >> >
> > > {name_of_node_I_took_down}%252C60020%252C1426860403261.
> > > > > 14268
> > > > > > > > > > 60404905
> > > > > > > > > > > >> > entered state: DONE
> > > > > > > {fqdn_of_new_live_node},60020,1426859445623
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > 2015-03-20 14:32:35,109 INFO
> > > > > > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager:
> > Done
> > > > > > splitting
> > > > > > > > > > > >> > /hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
> > > > > > > > > > > >> >
> > > > > > > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_down}
> > > > > > > > > > > >> > %252C60020%252C1426860403261.1426860404905
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > 2015-03-20 14:32:35,190 INFO
> > > > > > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager:
> > > finished
> > > > > > > > splitting
> > > > > > > > > > > >> (more
> > > > > > > > > > > >> > than or equal to) 9 bytes in 1 log files in
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > [hdfs://{fqdn_of_my_hmaster_node}:8020/hbase/WALs/{name_
> > > > > of_
> > > > > > > > > > > >> node_I_took_down},60020,1426860403261-splitting]
> > > > > > > > > > > >> > in 909083ms
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > 2015-03-20 14:32:35,191 INFO
> > > > > org.apache.hadoop.hbase.master
> > > > > > > > > > > >> .RegionStates:
> > > > > > > > > > > >> > Transitioned {0e7cc87a4ef6c47a779186f5bf79a01c
> > > > > state=OPEN,
> > > > > > > > > > > >> > ts=1426860639088,
> > > > > > > > > > > server={name_of_node_I_took_down},60020,1426860403261}
> > > > > > > > > > > >> to
> > > > > > > > > > > >> > {0e7cc87a4ef6c47a779186f5bf79a01c state=OFFLINE,
> > > > > > > > ts=1426861955191,
> > > > > > > > > > > >> server=
> > > > > > > > > > > >> > {name_of_node_I_took_down},60020,1426860403261}
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > 2015-03-20 14:32:35,191 INFO
> > > > > org.apache.hadoop.hbase.master
> > > > > > > > > > > >> .RegionStates:
> > > > > > > > > > > >> > Offlined 0e7cc87a4ef6c47a779186f5bf79a01c from
> > > > > > > > > > > >> {name_of_node_I_took_down}
> > > > > > > > > > > >> > ,60020,1426860403261
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > 2015-03-20 14:32:35,191 INFO
> > > > > org.apache.hadoop.hbase.master
> > > > > > > > > > > >> .RegionStates:
> > > > > > > > > > > >> > Transitioned {25ab6e7b42e36ddaa723d71be5954543
> > > > > state=OPEN,
> > > > > > > > > > > >> > ts=1426860641783,
> > > > > > > > > > > server={name_of_node_I_took_down},60020,1426860403261}
> > > > > > > > > > > >> to
> > > > > > > > > > > >> > {25ab6e7b42e36ddaa723d71be5954543 state=OFFLINE,
> > > > > > > > ts=1426861955191,
> > > > > > > > > > > >> server=
> > > > > > > > > > > >> > {name_of_node_I_took_down},60020,1426860403261}
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > 2015-03-20 14:32:35,191 INFO
> > > > > org.apache.hadoop.hbase.master
> > > > > > > > > > > >> .RegionStates:
> > > > > > > > > > > >> > Offlined 25ab6e7b42e36ddaa723d71be5954543 from
> > > > > > > > > > > >> {name_of_node_I_took_down}
> > > > > > > > > > > >> > ,60020,1426860403261
> > > > > > > > > > > >> > At this point, note that it finished
> SplitLogManager
> > > > task
> > > > > at
> > > > > > > > > > 14:32:35
> > > > > > > > > > > >> and
> > > > > > > > > > > >> > started transitioning just after that. So this is
> 15
> > > > > minutes
> > > > > > > > that
> > > > > > > > > > I'm
> > > > > > > > > > > >> > talking about.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > What am I missing?
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > On Fri, Mar 20, 2015 at 2:37 PM Nicolas Liochon <
> > > > > > > > > [email protected]>
> > > > > > > > > > > >> wrote:
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > > You've changed the value of
> > hbase.zookeeper.timeout
> > > to
> > > > > 15
> > > > > > > > > > minutes? A
> > > > > > > > > > > >> very
> > > > > > > > > > > >> > > reasonable target is 1 minute before relocating
> > the
> > > > > > regions.
> > > > > > > > > > That's
> > > > > > > > > > > >> the
> > > > > > > > > > > >> > > default iirc. You can push it to 20s, but then
> > > > > > > > > > gc-stopping-the-world
> > > > > > > > > > > >> > > becomes more of an issue. 15 minutes is really a
> > > lot.
> > > > > The
> > > > > > > hdfs
> > > > > > > > > > stale
> > > > > > > > > > > >> mode
> > > > > > > > > > > >> > > must always be used, with a lower timeout than
> the
> > > > hbase
> > > > > > > one.
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > Client side there should be nothing to do
> > (excepted
> > > > for
> > > > > > > > advanced
> > > > > > > > > > > >> stuff);
> > > > > > > > > > > >> > at
> > > > > > > > > > > >> > > each retry the client checks the location of the
> > > > > regions.
> > > > > > If
> > > > > > > > you
> > > > > > > > > > > lower
> > > > > > > > > > > >> > the
> > > > > > > > > > > >> > > number of retry the client will fail sooner, but
> > > > usually
> > > > > > you
> > > > > > > > > don't
> > > > > > > > > > > >> want
> > > > > > > > > > > >> > the
> > > > > > > > > > > >> > > client to fail, you want the servers to
> reallocate
> > > > > > quickly.
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > On Fri, Mar 20, 2015 at 1:36 PM, Dejan Menges <
> > > > > > > > > > > [email protected]
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > > wrote:
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > > Hi,
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > Sorry for little bit late update, but managed
> to
> > > > > narrow
> > > > > > it
> > > > > > > > > > little
> > > > > > > > > > > >> bit
> > > > > > > > > > > >> > > down.
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > We didn't update yet, as we are using
> > Hortonworks
> > > > > > > > distribution
> > > > > > > > > > > right
> > > > > > > > > > > >> > now,
> > > > > > > > > > > >> > > > and even if we update we will get 0.98.4.
> > However,
> > > > > looks
> > > > > > > > that
> > > > > > > > > > > issue
> > > > > > > > > > > >> > here
> > > > > > > > > > > >> > > > was in our use case and configuration (still
> > > looking
> > > > > > into
> > > > > > > > it).
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > Basically, initially I saw that when one
> server
> > > goes
> > > > > > down,
> > > > > > > > we
> > > > > > > > > > > start
> > > > > > > > > > > >> > > having
> > > > > > > > > > > >> > > > performance issues in general, but it managed
> to
> > > be
> > > > on
> > > > > > our
> > > > > > > > > > client
> > > > > > > > > > > >> side,
> > > > > > > > > > > >> > > due
> > > > > > > > > > > >> > > > to caching, and clients were trying to
> reconnect
> > > to
> > > > > > nodes
> > > > > > > > that
> > > > > > > > > > > were
> > > > > > > > > > > >> > > offline
> > > > > > > > > > > >> > > > and later trying to get regions from those
> nodes
> > > > too.
> > > > > > This
> > > > > > > > is
> > > > > > > > > > > >> basically
> > > > > > > > > > > >> > > why
> > > > > > > > > > > >> > > > on server side I didn't manage to see anything
> > in
> > > > logs
> > > > > > > that
> > > > > > > > > > would
> > > > > > > > > > > >> be at
> > > > > > > > > > > >> > > > least little bit interesting and point me into
> > > > desired
> > > > > > > > > > direction.
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > Another question that popped up to me is - in
> > case
> > > > > > server
> > > > > > > is
> > > > > > > > > > down
> > > > > > > > > > > >> (and
> > > > > > > > > > > >> > > with
> > > > > > > > > > > >> > > > it DataNode and HRegionServer it was hosting)
> -
> > > > what's
> > > > > > > > optimal
> > > > > > > > > > > time
> > > > > > > > > > > >> to
> > > > > > > > > > > >> > > set
> > > > > > > > > > > >> > > > for HMaster to consider server dead reassign
> > > regions
> > > > > > > > somewhere
> > > > > > > > > > > >> else, as
> > > > > > > > > > > >> > > > this is another performance bottleneck we hit
> > > during
> > > > > > > > inability
> > > > > > > > > > to
> > > > > > > > > > > >> > access
> > > > > > > > > > > >> > > > regions? In our case it's configured to 15
> > > minutes,
> > > > > and
> > > > > > > > simple
> > > > > > > > > > > logic
> > > > > > > > > > > >> > > tells
> > > > > > > > > > > >> > > > me if you want it earlier then configure lower
> > > > number
> > > > > of
> > > > > > > > > > retries,
> > > > > > > > > > > >> but
> > > > > > > > > > > >> > > issue
> > > > > > > > > > > >> > > > is as always in details, so not sure if anyone
> > > knows
> > > > > > some
> > > > > > > > > better
> > > > > > > > > > > >> math
> > > > > > > > > > > >> > for
> > > > > > > > > > > >> > > > this?
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > And last question - is it possible to manually
> > > force
> > > > > > HBase
> > > > > > > > to
> > > > > > > > > > > >> reassign
> > > > > > > > > > > >> > > > regions? In this case, while HMaster is
> retrying
> > > to
> > > > > > > contact
> > > > > > > > > node
> > > > > > > > > > > >> that's
> > > > > > > > > > > >> > > > dead, it's impossible to force it using
> > 'balancer'
> > > > > > > command.
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > Thanks a lot!
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > Dejan
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > On Tue, Mar 17, 2015 at 9:37 AM Dejan Menges <
> > > > > > > > > > > >> [email protected]>
> > > > > > > > > > > >> > > > wrote:
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > > Hi,
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > To be very honest - there's no particular
> > reason
> > > > why
> > > > > > we
> > > > > > > > > stick
> > > > > > > > > > to
> > > > > > > > > > > >> this
> > > > > > > > > > > >> > > > one,
> > > > > > > > > > > >> > > > > beside just lack of time currently to go
> > through
> > > > > > upgrade
> > > > > > > > > > > process,
> > > > > > > > > > > >> but
> > > > > > > > > > > >> > > > looks
> > > > > > > > > > > >> > > > > to me that's going to be next step.
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > Had a crazy day, didn't have time to go
> > through
> > > > all
> > > > > > logs
> > > > > > > > > > again,
> > > > > > > > > > > >> plus
> > > > > > > > > > > >> > > one
> > > > > > > > > > > >> > > > > of the machines (last one where we had this
> > > issue)
> > > > > is
> > > > > > > > fully
> > > > > > > > > > > >> > > reprovisioned
> > > > > > > > > > > >> > > > > yesterday so I don't have logs from there
> > > anymore.
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > Beside upgrading,  what I will talk about
> > today,
> > > > can
> > > > > > you
> > > > > > > > > just
> > > > > > > > > > > >> point
> > > > > > > > > > > >> > me
> > > > > > > > > > > >> > > to
> > > > > > > > > > > >> > > > > the specific RPC issue in 0.98.0? Thing is
> > that
> > > we
> > > > > > have
> > > > > > > > some
> > > > > > > > > > > >> strange
> > > > > > > > > > > >> > > > > moments with RPC in this case, and just want
> > to
> > > > see
> > > > > if
> > > > > > > > > that's
> > > > > > > > > > > the
> > > > > > > > > > > >> > same
> > > > > > > > > > > >> > > > > thing (and we were even suspecting to RPC).
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > Thanks a lot!
> > > > > > > > > > > >> > > > > Dejan
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > On Mon, Mar 16, 2015 at 9:32 PM, Andrew
> > Purtell
> > > <
> > > > > > > > > > > >> [email protected]
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > > > wrote:
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > >> Is there a particular reason why you are
> > using
> > > > > HBase
> > > > > > > > > 0.98.0?
> > > > > > > > > > > The
> > > > > > > > > > > >> > > latest
> > > > > > > > > > > >> > > > >> 0.98 release is 0.98.11. There's a known
> > > > > performance
> > > > > > > > issue
> > > > > > > > > > with
> > > > > > > > > > > >> > 0.98.0
> > > > > > > > > > > >> > > > >> pertaining to RPC that was fixed in later
> > > > releases,
> > > > > > you
> > > > > > > > > > should
> > > > > > > > > > > >> move
> > > > > > > > > > > >> > up
> > > > > > > > > > > >> > > > >> from
> > > > > > > > > > > >> > > > >> 0.98.0. In addition hundreds of
> improvements
> > > and
> > > > > bug
> > > > > > > > fixes
> > > > > > > > > > have
> > > > > > > > > > > >> gone
> > > > > > > > > > > >> > > > into
> > > > > > > > > > > >> > > > >> the ten releases since 0.98.0.
> > > > > > > > > > > >> > > > >>
> > > > > > > > > > > >> > > > >> On Mon, Mar 16, 2015 at 6:40 AM, Dejan
> > Menges <
> > > > > > > > > > > >> > [email protected]
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > >> wrote:
> > > > > > > > > > > >> > > > >>
> > > > > > > > > > > >> > > > >> > Hi All,
> > > > > > > > > > > >> > > > >> >
> > > > > > > > > > > >> > > > >> > We have a strange issue with HBase
> > > performance
> > > > > > > (overall
> > > > > > > > > > > cluster
> > > > > > > > > > > >> > > > >> > performance) in case one of datanodes in
> > the
> > > > > > cluster
> > > > > > > > > > > >> unexpectedly
> > > > > > > > > > > >> > > goes
> > > > > > > > > > > >> > > > >> > down.
> > > > > > > > > > > >> > > > >> >
> > > > > > > > > > > >> > > > >> > So scenario is like follows:
> > > > > > > > > > > >> > > > >> > - Cluster works fine, it's stable.
> > > > > > > > > > > >> > > > >> > - One DataNode unexpectedly goes down
> (PSU
> > > > issue,
> > > > > > > > network
> > > > > > > > > > > >> issue,
> > > > > > > > > > > >> > > > >> anything)
> > > > > > > > > > > >> > > > >> > - Whole HBase cluster goes down
> > (performance
> > > > > > becomes
> > > > > > > so
> > > > > > > > > bad
> > > > > > > > > > > >> that
> > > > > > > > > > > >> > we
> > > > > > > > > > > >> > > > >> have to
> > > > > > > > > > > >> > > > >> > restart all RegionServers to get it back
> to
> > > > > life).
> > > > > > > > > > > >> > > > >> >
> > > > > > > > > > > >> > > > >> > Most funny and latest issue that happened
> > was
> > > > > that
> > > > > > we
> > > > > > > > > added
> > > > > > > > > > > new
> > > > > > > > > > > >> > node
> > > > > > > > > > > >> > > > to
> > > > > > > > > > > >> > > > >> the
> > > > > > > > > > > >> > > > >> > cluster (having 8 x 4T SATA disks) and we
> > > left
> > > > > just
> > > > > > > > > > DataNode
> > > > > > > > > > > >> > running
> > > > > > > > > > > >> > > > on
> > > > > > > > > > > >> > > > >> it
> > > > > > > > > > > >> > > > >> > to give it couple of days to get some
> data.
> > > At
> > > > > some
> > > > > > > > point
> > > > > > > > > > in
> > > > > > > > > > > >> time,
> > > > > > > > > > > >> > > due
> > > > > > > > > > > >> > > > >> to
> > > > > > > > > > > >> > > > >> > hardware issue, server rebooted (twice
> > during
> > > > > three
> > > > > > > > > hours)
> > > > > > > > > > in
> > > > > > > > > > > >> > moment
> > > > > > > > > > > >> > > > >> when
> > > > > > > > > > > >> > > > >> > it had maybe 5% of data it would have in
> a
> > > > couple
> > > > > > of
> > > > > > > > > days.
> > > > > > > > > > > >> Nothing
> > > > > > > > > > > >> > > > else
> > > > > > > > > > > >> > > > >> > beside DataNode was running, and once it
> > went
> > > > > down,
> > > > > > > it
> > > > > > > > > > > affected
> > > > > > > > > > > >> > > > literary
> > > > > > > > > > > >> > > > >> > everything, and restarting RegionServers
> in
> > > the
> > > > > end
> > > > > > > > fixed
> > > > > > > > > > it.
> > > > > > > > > > > >> > > > >> >
> > > > > > > > > > > >> > > > >> > We are using HBase 0.98.0 with Hadoop
> 2.4.0
> > > > > > > > > > > >> > > > >> >
> > > > > > > > > > > >> > > > >>
> > > > > > > > > > > >> > > > >>
> > > > > > > > > > > >> > > > >>
> > > > > > > > > > > >> > > > >> --
> > > > > > > > > > > >> > > > >> Best regards,
> > > > > > > > > > > >> > > > >>
> > > > > > > > > > > >> > > > >>    - Andy
> > > > > > > > > > > >> > > > >>
> > > > > > > > > > > >> > > > >> Problems worthy of attack prove their worth
> > by
> > > > > > hitting
> > > > > > > > > back.
> > > > > > > > > > -
> > > > > > > > > > > >> Piet
> > > > > > > > > > > >> > > Hein
> > > > > > > > > > > >> > > > >> (via Tom White)
> > > > > > > > > > > >> > > > >>
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Strange issue when DataNode goes down

Reply via email to