Re: Strange issue when DataNode goes down

Dejan Menges Mon, 23 Mar 2015 09:35:14 -0700

...and I also got sure that it's applied with hdfs getconf -confKey...

On Mon, Mar 23, 2015 at 5:31 PM Dejan Menges <[email protected]> wrote:


> It was true all the time, together with dfs.namenode.avoid.read.stale.
> datanode.
>
> On Mon, Mar 23, 2015 at 5:29 PM Nicolas Liochon <[email protected]> wrote:
>
>> Actually, double checking the final patch in HDFS-4721, the stale mode is
>> taken in account. Bryan is right, it's worth checking the namenodes
>> config.
>> Especially, dfs.namenode.avoid.write.stale.datanode must be set to true
>> on
>> the namenode.
>>
>> On Mon, Mar 23, 2015 at 5:08 PM, Nicolas Liochon <[email protected]>
>> wrote:
>>
>> > stale should not help for recoverLease: it helps for reads, but it's the
>> > step after lease recovery.
>> > It's not needed in recoverLease because the recoverLease in hdfs just
>> > sorts the datanode by the heartbeat time, so, usually the stale datanode
>> > will be the last one of the list.
>> >
>> >
>> >
>> >
>> > On Mon, Mar 23, 2015 at 4:38 PM, Bryan Beaudreault <
>> > [email protected]> wrote:
>> >
>> >> @Nicholas, I see, thanks.  I'll keep the settings at default.  So
>> really
>> >> if
>> >> everything else is configured properly you should never reach the lease
>> >> recovery timeout in any failure scenarios.  It seems that the staleness
>> >> check would be the thing that prevents this, correct?  I'm surprised it
>> >> didn't help Dejan.
>> >>
>> >> On Mon, Mar 23, 2015 at 11:20 AM, Nicolas Liochon <[email protected]>
>> >> wrote:
>> >>
>> >> > @bryan: yes, you can change hbase.lease.recovery.timeout if you
>> changed
>> >> he
>> >> > hdfs settings. But this setting is really for desperate cases. The
>> >> recover
>> >> > Lease should have succeeded before. As well, if you depend on
>> >> > hbase.lease.recovery.timeout, it means that you're wasting recovery
>> >> time:
>> >> > the lease should be recovered in a few seconds.
>> >> >
>> >> > On Mon, Mar 23, 2015 at 3:59 PM, Dejan Menges <
>> [email protected]>
>> >> > wrote:
>> >> >
>> >> > > Interesting discussion I also found, gives me some more light on
>> what
>> >> > > Nicolas is talking about -
>> >> > https://issues.apache.org/jira/browse/HDFS-3703
>> >> > >
>> >> > > On Mon, Mar 23, 2015 at 3:53 PM Bryan Beaudreault <
>> >> > > [email protected]>
>> >> > > wrote:
>> >> > >
>> >> > > > So it is safe to set hbase.lease.recovery.timeout lower if you
>> also
>> >> > > > set heartbeat.recheck.interval lower (lowering that 10.5 min dead
>> >> node
>> >> > > > timer)?  Or is it recommended to not touch either of those?
>> >> > > >
>> >> > > > Reading the above with interest, thanks for digging in here guys.
>> >> > > >
>> >> > > > On Mon, Mar 23, 2015 at 10:13 AM, Nicolas Liochon <
>> >> [email protected]>
>> >> > > > wrote:
>> >> > > >
>> >> > > > > If the node is actually down it's fine. But the node may not be
>> >> that
>> >> > > down
>> >> > > > > (CAP theorem here); and then it's looking for trouble.
>> >> > > > > HDFS, by default declare a node as dead after 10:30. 15 minutes
>> >> is an
>> >> > > > extra
>> >> > > > > security. It seems your hdfs settings are different (or there
>> is a
>> >> > > > bug...).
>> >> > > > > There should be some info in the hdfs logs.
>> >> > > > >
>> >> > > > > On Mon, Mar 23, 2015 at 3:05 PM, Dejan Menges <
>> >> > [email protected]>
>> >> > > > > wrote:
>> >> > > > >
>> >> > > > > > Will take a look.
>> >> > > > > >
>> >> > > > > > Actually, if node is down (someone unplugged network cable,
>> it
>> >> just
>> >> > > > died,
>> >> > > > > > whatever) what's chance it's going to become live so write
>> can
>> >> > > > continue?
>> >> > > > > On
>> >> > > > > > the other side, HBase is not starting recovery trying to
>> contact
>> >> > node
>> >> > > > > which
>> >> > > > > > is not there anymore, and even elected as dead on Namenode
>> side
>> >> > > > (another
>> >> > > > > > thing I didn't understood quite good).
>> >> > > > > >
>> >> > > > > > So what I was expecting is that as soon as Namenode decided
>> >> node is
>> >> > > > dead,
>> >> > > > > > that it would be enough for RegionServer to stop trying to
>> >> recover
>> >> > > from
>> >> > > > > the
>> >> > > > > > dead node, but it wasn't the case. Also, this whole MTTR
>> >> article in
>> >> > > > HBase
>> >> > > > > > book doesn't work at all with this parameter set to it's
>> default
>> >> > > value
>> >> > > > > (15
>> >> > > > > > minutes).
>> >> > > > > >
>> >> > > > > > So I'm still struggling to figure out what's drawback
>> exactly on
>> >> > > this?
>> >> > > > > >
>> >> > > > > > On Mon, Mar 23, 2015 at 2:50 PM Nicolas Liochon <
>> >> [email protected]
>> >> > >
>> >> > > > > wrote:
>> >> > > > > >
>> >> > > > > > > Thanks for the explanation. There is an issue if you modify
>> >> this
>> >> > > > > setting
>> >> > > > > > > however.
>> >> > > > > > > hbase tries to recover the lease (i.e. be sure that nobody
>> is
>> >> > > > writing)
>> >> > > > > > > If you change hbase.lease.recovery.timeout hbase will start
>> >> the
>> >> > > > > recovery
>> >> > > > > > > (i.e. start to read) even if it's not sure that nobody's
>> >> writing.
>> >> > > > That
>> >> > > > > > > means there is a dataloss risk.
>> >> > > > > > > Basically, you must not see this warning: WARN
>> >> > > > > > > org.apache.hadoop.hbase.util.FSHDFSUtils: Cannot
>> recoverLease
>> >> > after
>> >> > > > > > trying
>> >> > > > > > > for[]
>> >> > > > > > >
>> >> > > > > > > The recoverLease must succeed. The fact that it does not
>> >> after X
>> >> > > > tries
>> >> > > > > is
>> >> > > > > > > strange.
>> >> > > > > > > There may be a mistmatch between the hbase parameters and
>> the
>> >> > hdfs
>> >> > > > > ones.
>> >> > > > > > > You may need to have a look at the comments in
>> >> FSHDFSUtils.java
>> >> > > > > > >
>> >> > > > > > >
>> >> > > > > > >
>> >> > > > > > >
>> >> > > > > > > On Mon, Mar 23, 2015 at 2:15 PM, Dejan Menges <
>> >> > > > [email protected]>
>> >> > > > > > > wrote:
>> >> > > > > > >
>> >> > > > > > > > I found the issue and fixed it, and will try to explain
>> here
>> >> > what
>> >> > > > was
>> >> > > > > > > > exactly in our case in case someone else finds this
>> >> interesting
>> >> > > > too.
>> >> > > > > > > >
>> >> > > > > > > > So initially, we had (couple of times) some network and
>> >> > hardware
>> >> > > > > issues
>> >> > > > > > > in
>> >> > > > > > > > our datacenters. When one server would die (literary
>> die, we
>> >> > had
>> >> > > > some
>> >> > > > > > > issue
>> >> > > > > > > > with PSUs) we saw issues with overall cluster
>> performance on
>> >> > > HBase
>> >> > > > > > side.
>> >> > > > > > > As
>> >> > > > > > > > cluster is quite big and live, it was also quite hard to
>> >> figure
>> >> > > out
>> >> > > > > > exact
>> >> > > > > > > > root cause and how to fix the stuff we wanted to fix.
>> >> > > > > > > >
>> >> > > > > > > > So I set up another cluster, four nodes (with DataNode
>> and
>> >> > > > > > RegionServer)
>> >> > > > > > > > and two other nodes with HMaster and Namenode in HA,
>> using
>> >> same
>> >> > > > stuff
>> >> > > > > > we
>> >> > > > > > > > use on production. We pumped some data into it, and I was
>> >> able
>> >> > to
>> >> > > > > > > reproduce
>> >> > > > > > > > same issue last week on it. What I tried to do is to cut
>> one
>> >> > > server
>> >> > > > > > (shut
>> >> > > > > > > > down it's interface) when all is good with cluster, when
>> we
>> >> > have
>> >> > > > > load,
>> >> > > > > > > and
>> >> > > > > > > > see what's going to happen.
>> >> > > > > > > >
>> >> > > > > > > > On Friday, after Nicolas mentioned, I started taking a
>> look
>> >> in
>> >> > > > HBase
>> >> > > > > > logs
>> >> > > > > > > > on the node which was mentioned in HMaster log as the one
>> >> > taking
>> >> > > > over
>> >> > > > > > > > regions for the dead server. Basically what I was able to
>> >> > observe
>> >> > > > was
>> >> > > > > > 15
>> >> > > > > > > > minutes time (+- couple of seconds, what was also
>> >> interesting,
>> >> > > and
>> >> > > > > will
>> >> > > > > > > got
>> >> > > > > > > > later to that) between HMaster figures out that one of
>> it's
>> >> > > > > > RegionServers
>> >> > > > > > > > is dead, and the time one of the mentioned nodes starts
>> >> taking
>> >> > > over
>> >> > > > > > those
>> >> > > > > > > > regions and they start appearing in HMaster's Web UI.
>> >> > > > > > > >
>> >> > > > > > > > I then set up everything like mentioned here
>> >> > > > > > > > http://hbase.apache.org/book.html#mttr - but still had
>> >> exactly
>> >> > > the
>> >> > > > > > same
>> >> > > > > > > > issues. Went over (again and again) all currently
>> configured
>> >> > > stuff,
>> >> > > > > but
>> >> > > > > > > > still had the same issue.
>> >> > > > > > > >
>> >> > > > > > > > Then I started looking into HDFS. Opened NameNode UI, saw
>> >> all
>> >> > is
>> >> > > > > good,
>> >> > > > > > > took
>> >> > > > > > > > one node down, was also looking RegionServer logs in the
>> >> same
>> >> > > time
>> >> > > > -
>> >> > > > > > and
>> >> > > > > > > I
>> >> > > > > > > > also see that it took ~15 minutes for Namenode to elect
>> dead
>> >> > node
>> >> > > > as
>> >> > > > > > > dead.
>> >> > > > > > > > Somehow in the same moment regions started getting back
>> to
>> >> > life.
>> >> > > I
>> >> > > > > > > remember
>> >> > > > > > > > in some older versions dfs timeout checks and check
>> retries.
>> >> > > Looked
>> >> > > > > > into
>> >> > > > > > > > defaults for our Hadoop version -
>> >> > > > > > > >
>> >> > > > > > > > http://hadoop.apache.org/docs/
>> r2.4.1/hadoop-project-dist/
>> >> > > > > > > hadoop-hdfs/hdfs-default.xml
>> >> > > > > > > > - and saw there that there's no recheck parameter
>> anymore.
>> >> > > Strange,
>> >> > > > > as
>> >> > > > > > on
>> >> > > > > > > > StackOverflow I found post from month ago, for newer
>> version
>> >> > than
>> >> > > > we
>> >> > > > > > use
>> >> > > > > > > > (we use 2.4.1, guy was using 2.6 -
>> dfs.namenode.heartbeat.
>> >> > > > > > > recheck-interval)
>> >> > > > > > > > I set it to 10 seconds as he mentioned, having checks
>> every
>> >> > three
>> >> > > > > > seconds
>> >> > > > > > > > (default) and got DataNode elected as dead in ~45
>> seconds,
>> >> as
>> >> > he
>> >> > > > > > > mentioned.
>> >> > > > > > > > Not sure why this parameter is not documented, but
>> >> obviously it
>> >> > > > > works.
>> >> > > > > > > >
>> >> > > > > > > > Then figured out it still didn't fix our HBase failover
>> >> issue.
>> >> > I
>> >> > > > was
>> >> > > > > > > > looking into HBase book again and again, and then saw
>> this
>> >> > part:
>> >> > > > > > > >
>> >> > > > > > > > "How much time we allow elapse between calls to recover
>> >> lease.
>> >> > > > Should
>> >> > > > > > be
>> >> > > > > > > > larger than the dfs timeout."
>> >> > > > > > > >
>> >> > > > > > > > This was description for hbase.lease.recovery.dfs.timeo
>> ut.
>> >> > Wasn't
>> >> > > > > sure
>> >> > > > > > > from
>> >> > > > > > > > the comment what of all timeouts that's possible to set
>> in
>> >> > > > > Hadoop/HBase
>> >> > > > > > > and
>> >> > > > > > > > that has something to do with DFS is this all about. But
>> >> picked
>> >> > > > > > > > hbase.lease.recovery.timeout, and that was the catch.
>> >> > > > > > > >
>> >> > > > > > > > Initially, by default, hbase.lease.recovery.timeout is
>> set
>> >> to
>> >> > 15
>> >> > > > > > minutes.
>> >> > > > > > > > Not sure why, and wasn't able to find yet why, but
>> getting
>> >> this
>> >> > > > down
>> >> > > > > to
>> >> > > > > > > one
>> >> > > > > > > > minute (what's more than OK for us) I was able to get
>> rid of
>> >> > our
>> >> > > > > issue.
>> >> > > > > > > Not
>> >> > > > > > > > also sure why this is not mentioned in MTTR section in
>> HBase
>> >> > > book,
>> >> > > > as
>> >> > > > > > > > obviously MTTR doesn't work at all with this default
>> >> timeout,
>> >> > at
>> >> > > > > least
>> >> > > > > > it
>> >> > > > > > > > doesn't work the way we expected it to work.
>> >> > > > > > > >
>> >> > > > > > > > So thanks again for everyone being spammed with this, and
>> >> > > specially
>> >> > > > > > > thanks
>> >> > > > > > > > to Nicolas pointing me to the right direction.
>> >> > > > > > > >
>> >> > > > > > > > On Mon, Mar 23, 2015 at 1:37 PM Nicolas Liochon <
>> >> > > [email protected]
>> >> > > > >
>> >> > > > > > > wrote:
>> >> > > > > > > >
>> >> > > > > > > > > the attachments are rejected by the mailing list, can
>> you
>> >> put
>> >> > > > then
>> >> > > > > on
>> >> > > > > > > > > pastebin?
>> >> > > > > > > > >
>> >> > > > > > > > > stale is mandatory (so it's good), but the issue here
>> is
>> >> just
>> >> > > > > before.
>> >> > > > > > > The
>> >> > > > > > > > > region server needs to read the file. In order to be
>> sure
>> >> > that
>> >> > > > > there
>> >> > > > > > is
>> >> > > > > > > > no
>> >> > > > > > > > > data loss, it needs to "recover the lease", that means
>> >> > ensuring
>> >> > > > > that
>> >> > > > > > > > nobody
>> >> > > > > > > > > is writing the file. The regionserver calls the
>> namenode
>> >> to
>> >> > do
>> >> > > > this
>> >> > > > > > > > > recoverLease. So there should be some info in the
>> namenode
>> >> > > logs.
>> >> > > > > You
>> >> > > > > > > have
>> >> > > > > > > > > HDFS-4721 on your hdfs? The hbase book details (more or
>> >> > > less...)
>> >> > > > > this
>> >> > > > > > > > > recoverLease stuff.
>> >> > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > > > On Mon, Mar 23, 2015 at 10:33 AM, Dejan Menges <
>> >> > > > > > [email protected]
>> >> > > > > > > >
>> >> > > > > > > > > wrote:
>> >> > > > > > > > >
>> >> > > > > > > > > > And also, just checked -
>> dfs.namenode.avoid.read.stale.
>> >> > > > datanode
>> >> > > > > and
>> >> > > > > > > > > > dfs.namenode.avoid.write.stale.datanode
>> >> > > > > > > > > > are both true, and dfs.namenode.stale.datanode.in
>> terval
>> >> is
>> >> > > > set to
>> >> > > > > > > > > default
>> >> > > > > > > > > > 30000.
>> >> > > > > > > > > >
>> >> > > > > > > > > > On Mon, Mar 23, 2015 at 10:03 AM Dejan Menges <
>> >> > > > > > > [email protected]>
>> >> > > > > > > > > > wrote:
>> >> > > > > > > > > >
>> >> > > > > > > > > > > Hi Nicolas,
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > Please find log attached.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > As I see it now more clearly, it was trying to
>> recover
>> >> > HDFS
>> >> > > > > WALs
>> >> > > > > > > from
>> >> > > > > > > > > > node
>> >> > > > > > > > > > > that's dead:
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > 2015-03-23 08:53:44,381 WARN
>> >> > > > > > > > org.apache.hadoop.hbase.util.FSHDFSUtils:
>> >> > > > > > > > > > > Cannot recoverLease after trying for 900000ms
>> >> > > > > > > > > > > (hbase.lease.recovery.timeout); continuing, but
>> may be
>> >> > > > > > > DATALOSS!!!;
>> >> > > > > > > > > > > attempt=40 on
>> >> > > > > > > > > > >
>> >> > > > > > > > > >
>> >> > file=hdfs://{my_hmaster_node}:8020/hbase/WALs/{node_i_intent
>> >> > > > > > > > > ionally_get_down_by_getting_ne
>> twork_down},60020,1426862900
>> >> > > > > > > > > 506-splitting/{node_i_intentionally_get_down_by_gett
>> >> > > > > > > > > ing_network_down}%2C60020%2C14
>> 26862900506.1427096924508
>> >> > > > > > > > > > > after 908210ms
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > And as you can see from the log, it tried 40 times,
>> >> what
>> >> > > took
>> >> > > > > it
>> >> > > > > > > > > exactly
>> >> > > > > > > > > > > 15 minutes.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > There's probably some parameter to tune to cut it
>> of
>> >> from
>> >> > > 40
>> >> > > > > > times
>> >> > > > > > > /
>> >> > > > > > > > 15
>> >> > > > > > > > > > > minutes to something more useful, as for 15
>> minutes we
>> >> > > don't
>> >> > > > > have
>> >> > > > > > > our
>> >> > > > > > > > > > > regions available, and HDFS have however
>> replication
>> >> > factor
>> >> > > > 3.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > Googling, if I figure out what's this I will post
>> it
>> >> > here.
>> >> > > > Will
>> >> > > > > > > also
>> >> > > > > > > > > > > appreciate if someone knows how to cut this down.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > Thanks,
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > Dejan
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > On Fri, Mar 20, 2015 at 3:49 PM Nicolas Liochon <
>> >> > > > > > [email protected]
>> >> > > > > > > >
>> >> > > > > > > > > > wrote:
>> >> > > > > > > > > > >
>> >> > > > > > > > > > >> The split is done by the region servers (the
>> master
>> >> > > > > > coordinates).
>> >> > > > > > > Is
>> >> > > > > > > > > > there
>> >> > > > > > > > > > >> some interesting stuff in their logs?
>> >> > > > > > > > > > >>
>> >> > > > > > > > > > >> On Fri, Mar 20, 2015 at 3:38 PM, Dejan Menges <
>> >> > > > > > > > [email protected]
>> >> > > > > > > > > >
>> >> > > > > > > > > > >> wrote:
>> >> > > > > > > > > > >>
>> >> > > > > > > > > > >> > With client issue was that it was retrying
>> >> connecting
>> >> > to
>> >> > > > the
>> >> > > > > > > same
>> >> > > > > > > > > > region
>> >> > > > > > > > > > >> > servers even when they were reassigned.
>> Lowering it
>> >> > down
>> >> > > > > > helped
>> >> > > > > > > in
>> >> > > > > > > > > > this
>> >> > > > > > > > > > >> > specific use case, but yes, we still want
>> servers
>> >> to
>> >> > > > > > reallocate
>> >> > > > > > > > > > quickly.
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > What got me here:
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > http://hbase.apache.org/book.html#mttr
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > I basically set configuration exactly the same
>> way
>> >> as
>> >> > > it's
>> >> > > > > > > > explained
>> >> > > > > > > > > > >> here.
>> >> > > > > > > > > > >> > *zookeeper.session.timeout* is (and was before)
>> >> 60000
>> >> > > (one
>> >> > > > > > > > minute).
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > So basically what happens is: - simulating
>> network
>> >> > > issues
>> >> > > > we
>> >> > > > > > had
>> >> > > > > > > > > > >> recently).
>> >> > > > > > > > > > >> > - After short time I see in HBase that my
>> >> RegionServer
>> >> > > is
>> >> > > > > > dead,
>> >> > > > > > > > and
>> >> > > > > > > > > as
>> >> > > > > > > > > > >> > total number of regions I see previous total
>> minus
>> >> > > number
>> >> > > > of
>> >> > > > > > > > regions
>> >> > > > > > > > > > >> that
>> >> > > > > > > > > > >> > were hosted on the node hosting RegionServer
>> that
>> >> just
>> >> > > > > > > > > 'disappeared'.
>> >> > > > > > > > > > >> > - In this point I want my clus
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > - I have test cluster consisting of four nodes,
>> >> every
>> >> > > node
>> >> > > > > > being
>> >> > > > > > > > > > >> DataNode
>> >> > > > > > > > > > >> > and RegionServer.
>> >> > > > > > > > > > >> > - I simulate network partition on one (connect
>> to
>> >> it
>> >> > > > through
>> >> > > > > > > > console
>> >> > > > > > > > > > and
>> >> > > > > > > > > > >> > take network interface downter to recover as
>> soon
>> >> as
>> >> > > > > possible,
>> >> > > > > > > to
>> >> > > > > > > > > > start
>> >> > > > > > > > > > >> > serving missing regions.
>> >> > > > > > > > > > >> > - First thing I see in HMaster logs are:
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > 2015-03-20 14:17:26,015 INFO
>> >> > > > > > > > > > >> >
>> >> org.apache.hadoop.hbase.zookeeper.RegionServerTracker:
>> >> > > > > > > > RegionServer
>> >> > > > > > > > > > >> > ephemeral node deleted, processing expiration
>> >> > > > > > > > > > >> > [{name_of_node_I_took_down},60
>> 020,1426860403261]
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > 2015-03-20 14:17:26,067 INFO
>> >> > > > > > > > > > >> >
>> >> > > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
>> >> > > > > > > > > > Splitting
>> >> > > > > > > > > > >> > logs for
>> >> > {name_of_node_I_took_down},60020,1426860403261
>> >> > > > > before
>> >> > > > > > > > > > >> assignment.
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > 2015-03-20 14:17:26,105 INFO
>> >> > > > > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager:
>> >> dead
>> >> > > > > splitlog
>> >> > > > > > > > > workers
>> >> > > > > > > > > > [
>> >> > > > > > > > > > >> > {name_of_node_I_took_down},60020,1426860403261]
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > 2015-03-20 14:17:26,107 INFO
>> >> > > > > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager:
>> >> > started
>> >> > > > > > > splitting
>> >> > > > > > > > 1
>> >> > > > > > > > > > >> logs in
>> >> > > > > > > > > > >> >
>> >> > [hdfs://{fqdn_of_hmaster}:8020/hbase/WALs/{name_of_node_
>> >> > > > I_
>> >> > > > > > > > > took_down}
>> >> > > > > > > > > > >> > ,60020,1426860403261-splitting]
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > 2015-03-20 14:17:26,150 INFO
>> >> > > > > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager:
>> >> task
>> >> > > > > > > > > > >> > /hbase/splitWAL/WALs%2F
>> >> > > > > > > > > > >> >
>> >> > > > > > {name_of_node_I_took_down}%2C60020%2C1426860403261-splitting
>> %2F
>> >> > > > > > > > > > >> >
>> >> > {name_of_node_I_took_down}%252C60020%252C1426860403261.
>> >> > > > 14268
>> >> > > > > > > > > 60404905
>> >> > > > > > > > > > >> > acquired by {fqdn_of_another_live_hnode},
>> >> > > > 60020,1426859445623
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > 2015-03-20 14:17:26,182 INFO
>> >> > > > > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager:
>> >> total
>> >> > > > tasks
>> >> > > > > =
>> >> > > > > > 1
>> >> > > > > > > > > > >> unassigned
>> >> > > > > > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs%
>> >> > > > 2F{name_of_node_I_took_down}
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> >
>> >> %2C60020%2C1426860403261-splitting%2F{name_of_node_I_
>> >> > > > took_
>> >> > > > > > > > > > >>
>> >> > > down}%252C60020%252C1426860403261.1426860404905=last_update
>> >> > > > > > > > > > >> > = 1426861046182 last_version = 2
>> cur_worker_name =
>> >> > > > > > > > > > >> > {fqdn_of_another_live_node},60020,1426859445623
>> >> > status =
>> >> > > > > > > > > in_progress
>> >> > > > > > > > > > >> > incarnation = 0 resubmits = 0 batch = installed
>> = 1
>> >> > done
>> >> > > > = 0
>> >> > > > > > > > error =
>> >> > > > > > > > > > 0}
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > 2015-03-20 14:17:31,183 INFO
>> >> > > > > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager:
>> >> total
>> >> > > > tasks
>> >> > > > > =
>> >> > > > > > 1
>> >> > > > > > > > > > >> unassigned
>> >> > > > > > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs%
>> >> > > > 2F{name_of_node_I_took_down}
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> >
>> >> %2C60020%2C1426860403261-splitting%2F{name_of_node_I_
>> >> > > > took_
>> >> > > > > > > > > > >>
>> >> > > down}%252C60020%252C1426860403261.1426860404905=last_update
>> >> > > > > > > > > > >> > = 1426861046182 last_version = 2
>> cur_worker_name =
>> >> > > > > > > > > > >> > {fqdn_of_another_live_node},60020,1426859445623
>> >> > status =
>> >> > > > > > > > > in_progress
>> >> > > > > > > > > > >> > incarnation = 0 resubmits = 0 batch = installed
>> = 1
>> >> > done
>> >> > > > = 0
>> >> > > > > > > > error =
>> >> > > > > > > > > > 0}
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > 2015-03-20 14:17:36,184 INFO
>> >> > > > > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager:
>> >> total
>> >> > > > tasks
>> >> > > > > =
>> >> > > > > > 1
>> >> > > > > > > > > > >> unassigned
>> >> > > > > > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs%
>> >> > > > 2F{name_of_node_I_took_down}
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> >
>> >> %2C60020%2C1426860403261-splitting%2F{name_of_node_I_
>> >> > > > took_
>> >> > > > > > > > > > >>
>> >> > > down}%252C60020%252C1426860403261.1426860404905=last_update
>> >> > > > > > > > > > >> > = 1426861046182 last_version = 2
>> cur_worker_name =
>> >> > > > > > > > > > >> > {fqdn_of_another_live_node},60020,1426859445623
>> >> > status =
>> >> > > > > > > > > in_progress
>> >> > > > > > > > > > >> > incarnation = 0 resubmits = 0 batch = installed
>> = 1
>> >> > done
>> >> > > > = 0
>> >> > > > > > > > error =
>> >> > > > > > > > > > 0}
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > 2015-03-20 14:17:42,185 INFO
>> >> > > > > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager:
>> >> total
>> >> > > > tasks
>> >> > > > > =
>> >> > > > > > 1
>> >> > > > > > > > > > >> unassigned
>> >> > > > > > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs%
>> >> > > > 2F{name_of_node_I_took_down}
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> >
>> >> %2C60020%2C1426860403261-splitting%2F{name_of_node_I_
>> >> > > > took_
>> >> > > > > > > > > > >>
>> >> > > down}%252C60020%252C1426860403261.1426860404905=last_update
>> >> > > > > > > > > > >> > = 1426861046182 last_version = 2
>> cur_worker_name =
>> >> > > > > > > > > > >> > {fqdn_of_another_live_node},60020,1426859445623
>> >> > status =
>> >> > > > > > > > > in_progress
>> >> > > > > > > > > > >> > incarnation = 0 resubmits = 0 batch = installed
>> = 1
>> >> > done
>> >> > > > = 0
>> >> > > > > > > > error =
>> >> > > > > > > > > > 0}
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > 2015-03-20 14:17:48,184 INFO
>> >> > > > > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager:
>> >> total
>> >> > > > tasks
>> >> > > > > =
>> >> > > > > > 1
>> >> > > > > > > > > > >> unassigned
>> >> > > > > > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs%
>> >> > > > 2F{name_of_node_I_took_down}
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> >
>> >> %2C60020%2C1426860403261-splitting%2F{name_of_node_I_
>> >> > > > took_
>> >> > > > > > > > > > >>
>> >> > > down}%252C60020%252C1426860403261.1426860404905=last_update
>> >> > > > > > > > > > >> > = 1426861046182 last_version = 2
>> cur_worker_name =
>> >> > > > > > > > > > >> > {fqdn_of_another_live_node},60020,1426859445623
>> >> > status =
>> >> > > > > > > > > in_progress
>> >> > > > > > > > > > >> > incarnation = 0 resubmits = 0 batch = installed
>> = 1
>> >> > done
>> >> > > > = 0
>> >> > > > > > > > error =
>> >> > > > > > > > > > 0}
>> >> > > > > > > > > > >> > In the meantime, In hbase...out log I got this:
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > ==>
>> >> hbase-hbase-master-{fqdn_of_my_hmaster_node}.out
>> >> > <==
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > java.io.IOException: Call to
>> >> > > > > > > > > > >> >
>> >> > > {name_of_node_I_took_down}/{ip_of_local_interface_I_took_
>> >> > > > dow
>> >> > > > > > > > > n}:60020
>> >> > > > > > > > > > >> > failed on local exception:
>> >> > > > > > > > > > >> > org.apache.hadoop.hbase.ipc.RpcClient$
>> >> > > > CallTimeoutException:
>> >> > > > > > > Call
>> >> > > > > > > > > > >> > id=93152, waitTime=60044, rpcTimeout=60000
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > > org.apache.hadoop.hbase.ipc.RpcClient.wrapException(RpcClien
>> >> > > > > > > > > > >> t.java:1532)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:
>> >> > > > > > > 1502)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> >
>> >> > > org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(
>> >> > > > Rpc
>> >> > > > > > > > > > >> Client.java:1684)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > org.apache.hadoop.hbase.ipc.RpcClient$
>> >> > > > BlockingRpcChannelImpl
>> >> > > > > > > > > ementati
>> >> > > > > > > > > > >> on.callBlockingMethod(RpcClient.java:1737)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> >
>> >> > org.apache.hadoop.hbase.protobuf.generated.AdminProtos$
>> >> > > > > > > > > AdminService$
>> >> > > > > > > > > > >> BlockingStub.getRegionInfo(Adm
>> inProtos.java:20806)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > org.apache.hadoop.hbase.client.HBaseAdmin.
>> >> > > > getCompactionState
>> >> > > > > > > > > > >> (HBaseAdmin.java:2524)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> >
>> >> org.apache.hadoop.hbase.generated.master.table_jsp._
>> >> > > > jspServi
>> >> > > > > > > > > > >> ce(table_jsp.java:167)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.
>> >> > > > > > > > > java:98)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > > javax.servlet.http.HttpServlet.service(HttpServlet.java:770)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder
>> >> > > > > > > > > > >> .java:511)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> >
>> >> org.mortbay.jetty.servlet.ServletHandler$CachedChain.
>> >> > > > doFilte
>> >> > > > > > > > > > >> r(ServletHandler.java:1221)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > org.apache.hadoop.http.lib.StaticUserWebFilter$
>> >> > > > StaticUserFil
>> >> > > > > > > > > > >> ter.doFilter(StaticUserWebFilter.java:109)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> >
>> >> org.mortbay.jetty.servlet.ServletHandler$CachedChain.
>> >> > > > doFilte
>> >> > > > > > > > > > >> r(ServletHandler.java:1212)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> >
>> >> org.apache.hadoop.http.HttpServer$QuotingInputFilter.
>> >> > > > doFilte
>> >> > > > > > > > > > >> r(HttpServer.java:1081)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> >
>> >> org.mortbay.jetty.servlet.ServletHandler$CachedChain.
>> >> > > > doFilte
>> >> > > > > > > > > > >> r(ServletHandler.java:1212)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > > > > > >
>> >> > > > > >
>> >> > org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> >
>> >> org.mortbay.jetty.servlet.ServletHandler$CachedChain.
>> >> > > > doFilte
>> >> > > > > > > > > > >> r(ServletHandler.java:1212)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandl
>> >> > > > > > > > > > >> er.java:399)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > > > > > > >> > org.mortbay.jetty.security.Sec
>> urityHandler.handle(
>> >> > > > SecurityHa
>> >> > > > > > > > > > >> ndler.java:216)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandl
>> >> > > > > > > > > > >> er.java:182)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandl
>> >> > > > > > > > > > >> er.java:766)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.
>> >> > > > > > > > > > >> java:450)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > org.mortbay.jetty.handler.Cont
>> extHandlerCollection.
>> >> > > > handle(Co
>> >> > > > > > > > > > >> ntextHandlerCollection.java:230)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapp
>> >> > > > > > > > > > >> er.java:152)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at org.mortbay.jetty.Server.handl
>> e(Server.java:326)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnectio
>> >> > > > > > > > > > >> n.java:542)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > org.mortbay.jetty.HttpConnecti
>> on$RequestHandler.
>> >> > > > headerComple
>> >> > > > > > > > > > >> te(HttpConnection.java:928)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > > > >
>> >> > org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > > > >
>> >> > org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > org.mortbay.io.nio.SelectChannelEndPoint.run(
>> >> > > > SelectChannelEn
>> >> > > > > > > > > > >> dPoint.java:410)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > org.mortbay.thread.QueuedThreadPool$PoolThread.
>> run(
>> >> > > > > > > > > > >> QueuedThreadPool.java:582)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > Caused by: org.apache.hadoop.hbase.ipc.Rp
>> cClient$
>> >> > > > > > > > > CallTimeoutException:
>> >> > > > > > > > > > >> Call
>> >> > > > > > > > > > >> > id=93152, waitTime=60044, rpcTimeout=60000
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > org.apache.hadoop.hbase.ipc.Rp
>> cClient$Connection.
>> >> > > > cleanupCall
>> >> > > > > > > > > > >> s(RpcClient.java:1234)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > org.apache.hadoop.hbase.ipc.Rp
>> cClient$Connection.
>> >> > > > readRespons
>> >> > > > > > > > > > >> e(RpcClient.java:1171)
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > at
>> >> > > > > > org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClie
>> >> > > > > > > > > > >> nt.java:751)
>> >> > > > > > > > > > >> > Beside this same issue, please note that first
>> >> message
>> >> > > was
>> >> > > > > at
>> >> > > > > > > > > > 2015-03-20
>> >> > > > > > > > > > >> > 14:17:26,015. And then (we got to the point
>> when it
>> >> > > > started
>> >> > > > > > > > > > transition):
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > 2015-03-20 14:32:35,059 INFO
>> >> > > > > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager:
>> >> task
>> >> > > > > > > > > > >> > /hbase/splitWAL/WALs%2F
>> >> > > > > > > > > > >> >
>> >> > > > > > {name_of_node_I_took_down}%2C60020%2C1426860403261-splitting
>> %2F
>> >> > > > > > > > > > >> >
>> >> > {name_of_node_I_took_down}%252C60020%252C1426860403261.
>> >> > > > 14268
>> >> > > > > > > > > 60404905
>> >> > > > > > > > > > >> > entered state: DONE
>> >> > > > > > {fqdn_of_new_live_node},60020,1426859445623
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > 2015-03-20 14:32:35,109 INFO
>> >> > > > > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager:
>> >> Done
>> >> > > > > splitting
>> >> > > > > > > > > > >> > /hbase/splitWAL/WALs%2F{name_o
>> f_node_I_took_down}
>> >> > > > > > > > > > >> >
>> >> > > > > > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_do
>> wn}
>> >> > > > > > > > > > >> > %252C60020%252C1426860403261.1426860404905
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > 2015-03-20 14:32:35,190 INFO
>> >> > > > > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager:
>> >> > finished
>> >> > > > > > > splitting
>> >> > > > > > > > > > >> (more
>> >> > > > > > > > > > >> > than or equal to) 9 bytes in 1 log files in
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> >
>> >> > [hdfs://{fqdn_of_my_hmaster_node}:8020/hbase/WALs/{name_
>> >> > > > of_
>> >> > > > > > > > > > >> node_I_took_down},60020,1426860403261-splitting]
>> >> > > > > > > > > > >> > in 909083ms
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > 2015-03-20 14:32:35,191 INFO
>> >> > > > org.apache.hadoop.hbase.master
>> >> > > > > > > > > > >> .RegionStates:
>> >> > > > > > > > > > >> > Transitioned {0e7cc87a4ef6c47a779186f5bf79a01c
>> >> > > > state=OPEN,
>> >> > > > > > > > > > >> > ts=1426860639088,
>> >> > > > > > > > > > server={name_of_node_I_took_do
>> wn},60020,1426860403261}
>> >> > > > > > > > > > >> to
>> >> > > > > > > > > > >> > {0e7cc87a4ef6c47a779186f5bf79a01c
>> state=OFFLINE,
>> >> > > > > > > ts=1426861955191,
>> >> > > > > > > > > > >> server=
>> >> > > > > > > > > > >> > {name_of_node_I_took_down},60020,1426860403261}
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > 2015-03-20 14:32:35,191 INFO
>> >> > > > org.apache.hadoop.hbase.master
>> >> > > > > > > > > > >> .RegionStates:
>> >> > > > > > > > > > >> > Offlined 0e7cc87a4ef6c47a779186f5bf79a01c from
>> >> > > > > > > > > > >> {name_of_node_I_took_down}
>> >> > > > > > > > > > >> > ,60020,1426860403261
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > 2015-03-20 14:32:35,191 INFO
>> >> > > > org.apache.hadoop.hbase.master
>> >> > > > > > > > > > >> .RegionStates:
>> >> > > > > > > > > > >> > Transitioned {25ab6e7b42e36ddaa723d71be5954543
>> >> > > > state=OPEN,
>> >> > > > > > > > > > >> > ts=1426860641783,
>> >> > > > > > > > > > server={name_of_node_I_took_do
>> wn},60020,1426860403261}
>> >> > > > > > > > > > >> to
>> >> > > > > > > > > > >> > {25ab6e7b42e36ddaa723d71be5954543
>> state=OFFLINE,
>> >> > > > > > > ts=1426861955191,
>> >> > > > > > > > > > >> server=
>> >> > > > > > > > > > >> > {name_of_node_I_took_down},60020,1426860403261}
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > 2015-03-20 14:32:35,191 INFO
>> >> > > > org.apache.hadoop.hbase.master
>> >> > > > > > > > > > >> .RegionStates:
>> >> > > > > > > > > > >> > Offlined 25ab6e7b42e36ddaa723d71be5954543 from
>> >> > > > > > > > > > >> {name_of_node_I_took_down}
>> >> > > > > > > > > > >> > ,60020,1426860403261
>> >> > > > > > > > > > >> > At this point, note that it finished
>> >> SplitLogManager
>> >> > > task
>> >> > > > at
>> >> > > > > > > > > 14:32:35
>> >> > > > > > > > > > >> and
>> >> > > > > > > > > > >> > started transitioning just after that. So this
>> is
>> >> 15
>> >> > > > minutes
>> >> > > > > > > that
>> >> > > > > > > > > I'm
>> >> > > > > > > > > > >> > talking about.
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > What am I missing?
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > On Fri, Mar 20, 2015 at 2:37 PM Nicolas Liochon
>> <
>> >> > > > > > > > [email protected]>
>> >> > > > > > > > > > >> wrote:
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > > You've changed the value of
>> >> hbase.zookeeper.timeout
>> >> > to
>> >> > > > 15
>> >> > > > > > > > > minutes? A
>> >> > > > > > > > > > >> very
>> >> > > > > > > > > > >> > > reasonable target is 1 minute before
>> relocating
>> >> the
>> >> > > > > regions.
>> >> > > > > > > > > That's
>> >> > > > > > > > > > >> the
>> >> > > > > > > > > > >> > > default iirc. You can push it to 20s, but then
>> >> > > > > > > > > gc-stopping-the-world
>> >> > > > > > > > > > >> > > becomes more of an issue. 15 minutes is
>> really a
>> >> > lot.
>> >> > > > The
>> >> > > > > > hdfs
>> >> > > > > > > > > stale
>> >> > > > > > > > > > >> mode
>> >> > > > > > > > > > >> > > must always be used, with a lower timeout than
>> >> the
>> >> > > hbase
>> >> > > > > > one.
>> >> > > > > > > > > > >> > >
>> >> > > > > > > > > > >> > > Client side there should be nothing to do
>> >> (excepted
>> >> > > for
>> >> > > > > > > advanced
>> >> > > > > > > > > > >> stuff);
>> >> > > > > > > > > > >> > at
>> >> > > > > > > > > > >> > > each retry the client checks the location of
>> the
>> >> > > > regions.
>> >> > > > > If
>> >> > > > > > > you
>> >> > > > > > > > > > lower
>> >> > > > > > > > > > >> > the
>> >> > > > > > > > > > >> > > number of retry the client will fail sooner,
>> but
>> >> > > usually
>> >> > > > > you
>> >> > > > > > > > don't
>> >> > > > > > > > > > >> want
>> >> > > > > > > > > > >> > the
>> >> > > > > > > > > > >> > > client to fail, you want the servers to
>> >> reallocate
>> >> > > > > quickly.
>> >> > > > > > > > > > >> > >
>> >> > > > > > > > > > >> > > On Fri, Mar 20, 2015 at 1:36 PM, Dejan Menges
>> <
>> >> > > > > > > > > > [email protected]
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >> > > wrote:
>> >> > > > > > > > > > >> > >
>> >> > > > > > > > > > >> > > > Hi,
>> >> > > > > > > > > > >> > > >
>> >> > > > > > > > > > >> > > > Sorry for little bit late update, but
>> managed
>> >> to
>> >> > > > narrow
>> >> > > > > it
>> >> > > > > > > > > little
>> >> > > > > > > > > > >> bit
>> >> > > > > > > > > > >> > > down.
>> >> > > > > > > > > > >> > > >
>> >> > > > > > > > > > >> > > > We didn't update yet, as we are using
>> >> Hortonworks
>> >> > > > > > > distribution
>> >> > > > > > > > > > right
>> >> > > > > > > > > > >> > now,
>> >> > > > > > > > > > >> > > > and even if we update we will get 0.98.4.
>> >> However,
>> >> > > > looks
>> >> > > > > > > that
>> >> > > > > > > > > > issue
>> >> > > > > > > > > > >> > here
>> >> > > > > > > > > > >> > > > was in our use case and configuration (still
>> >> > looking
>> >> > > > > into
>> >> > > > > > > it).
>> >> > > > > > > > > > >> > > >
>> >> > > > > > > > > > >> > > > Basically, initially I saw that when one
>> server
>> >> > goes
>> >> > > > > down,
>> >> > > > > > > we
>> >> > > > > > > > > > start
>> >> > > > > > > > > > >> > > having
>> >> > > > > > > > > > >> > > > performance issues in general, but it
>> managed
>> >> to
>> >> > be
>> >> > > on
>> >> > > > > our
>> >> > > > > > > > > client
>> >> > > > > > > > > > >> side,
>> >> > > > > > > > > > >> > > due
>> >> > > > > > > > > > >> > > > to caching, and clients were trying to
>> >> reconnect
>> >> > to
>> >> > > > > nodes
>> >> > > > > > > that
>> >> > > > > > > > > > were
>> >> > > > > > > > > > >> > > offline
>> >> > > > > > > > > > >> > > > and later trying to get regions from those
>> >> nodes
>> >> > > too.
>> >> > > > > This
>> >> > > > > > > is
>> >> > > > > > > > > > >> basically
>> >> > > > > > > > > > >> > > why
>> >> > > > > > > > > > >> > > > on server side I didn't manage to see
>> anything
>> >> in
>> >> > > logs
>> >> > > > > > that
>> >> > > > > > > > > would
>> >> > > > > > > > > > >> be at
>> >> > > > > > > > > > >> > > > least little bit interesting and point me
>> into
>> >> > > desired
>> >> > > > > > > > > direction.
>> >> > > > > > > > > > >> > > >
>> >> > > > > > > > > > >> > > > Another question that popped up to me is -
>> in
>> >> case
>> >> > > > > server
>> >> > > > > > is
>> >> > > > > > > > > down
>> >> > > > > > > > > > >> (and
>> >> > > > > > > > > > >> > > with
>> >> > > > > > > > > > >> > > > it DataNode and HRegionServer it was
>> hosting) -
>> >> > > what's
>> >> > > > > > > optimal
>> >> > > > > > > > > > time
>> >> > > > > > > > > > >> to
>> >> > > > > > > > > > >> > > set
>> >> > > > > > > > > > >> > > > for HMaster to consider server dead reassign
>> >> > regions
>> >> > > > > > > somewhere
>> >> > > > > > > > > > >> else, as
>> >> > > > > > > > > > >> > > > this is another performance bottleneck we
>> hit
>> >> > during
>> >> > > > > > > inability
>> >> > > > > > > > > to
>> >> > > > > > > > > > >> > access
>> >> > > > > > > > > > >> > > > regions? In our case it's configured to 15
>> >> > minutes,
>> >> > > > and
>> >> > > > > > > simple
>> >> > > > > > > > > > logic
>> >> > > > > > > > > > >> > > tells
>> >> > > > > > > > > > >> > > > me if you want it earlier then configure
>> lower
>> >> > > number
>> >> > > > of
>> >> > > > > > > > > retries,
>> >> > > > > > > > > > >> but
>> >> > > > > > > > > > >> > > issue
>> >> > > > > > > > > > >> > > > is as always in details, so not sure if
>> anyone
>> >> > knows
>> >> > > > > some
>> >> > > > > > > > better
>> >> > > > > > > > > > >> math
>> >> > > > > > > > > > >> > for
>> >> > > > > > > > > > >> > > > this?
>> >> > > > > > > > > > >> > > >
>> >> > > > > > > > > > >> > > > And last question - is it possible to
>> manually
>> >> > force
>> >> > > > > HBase
>> >> > > > > > > to
>> >> > > > > > > > > > >> reassign
>> >> > > > > > > > > > >> > > > regions? In this case, while HMaster is
>> >> retrying
>> >> > to
>> >> > > > > > contact
>> >> > > > > > > > node
>> >> > > > > > > > > > >> that's
>> >> > > > > > > > > > >> > > > dead, it's impossible to force it using
>> >> 'balancer'
>> >> > > > > > command.
>> >> > > > > > > > > > >> > > >
>> >> > > > > > > > > > >> > > > Thanks a lot!
>> >> > > > > > > > > > >> > > >
>> >> > > > > > > > > > >> > > > Dejan
>> >> > > > > > > > > > >> > > >
>> >> > > > > > > > > > >> > > > On Tue, Mar 17, 2015 at 9:37 AM Dejan
>> Menges <
>> >> > > > > > > > > > >> [email protected]>
>> >> > > > > > > > > > >> > > > wrote:
>> >> > > > > > > > > > >> > > >
>> >> > > > > > > > > > >> > > > > Hi,
>> >> > > > > > > > > > >> > > > >
>> >> > > > > > > > > > >> > > > > To be very honest - there's no particular
>> >> reason
>> >> > > why
>> >> > > > > we
>> >> > > > > > > > stick
>> >> > > > > > > > > to
>> >> > > > > > > > > > >> this
>> >> > > > > > > > > > >> > > > one,
>> >> > > > > > > > > > >> > > > > beside just lack of time currently to go
>> >> through
>> >> > > > > upgrade
>> >> > > > > > > > > > process,
>> >> > > > > > > > > > >> but
>> >> > > > > > > > > > >> > > > looks
>> >> > > > > > > > > > >> > > > > to me that's going to be next step.
>> >> > > > > > > > > > >> > > > >
>> >> > > > > > > > > > >> > > > > Had a crazy day, didn't have time to go
>> >> through
>> >> > > all
>> >> > > > > logs
>> >> > > > > > > > > again,
>> >> > > > > > > > > > >> plus
>> >> > > > > > > > > > >> > > one
>> >> > > > > > > > > > >> > > > > of the machines (last one where we had
>> this
>> >> > issue)
>> >> > > > is
>> >> > > > > > > fully
>> >> > > > > > > > > > >> > > reprovisioned
>> >> > > > > > > > > > >> > > > > yesterday so I don't have logs from there
>> >> > anymore.
>> >> > > > > > > > > > >> > > > >
>> >> > > > > > > > > > >> > > > > Beside upgrading,  what I will talk about
>> >> today,
>> >> > > can
>> >> > > > > you
>> >> > > > > > > > just
>> >> > > > > > > > > > >> point
>> >> > > > > > > > > > >> > me
>> >> > > > > > > > > > >> > > to
>> >> > > > > > > > > > >> > > > > the specific RPC issue in 0.98.0? Thing is
>> >> that
>> >> > we
>> >> > > > > have
>> >> > > > > > > some
>> >> > > > > > > > > > >> strange
>> >> > > > > > > > > > >> > > > > moments with RPC in this case, and just
>> want
>> >> to
>> >> > > see
>> >> > > > if
>> >> > > > > > > > that's
>> >> > > > > > > > > > the
>> >> > > > > > > > > > >> > same
>> >> > > > > > > > > > >> > > > > thing (and we were even suspecting to
>> RPC).
>> >> > > > > > > > > > >> > > > >
>> >> > > > > > > > > > >> > > > > Thanks a lot!
>> >> > > > > > > > > > >> > > > > Dejan
>> >> > > > > > > > > > >> > > > >
>> >> > > > > > > > > > >> > > > > On Mon, Mar 16, 2015 at 9:32 PM, Andrew
>> >> Purtell
>> >> > <
>> >> > > > > > > > > > >> [email protected]
>> >> > > > > > > > > > >> > >
>> >> > > > > > > > > > >> > > > > wrote:
>> >> > > > > > > > > > >> > > > >
>> >> > > > > > > > > > >> > > > >> Is there a particular reason why you are
>> >> using
>> >> > > > HBase
>> >> > > > > > > > 0.98.0?
>> >> > > > > > > > > > The
>> >> > > > > > > > > > >> > > latest
>> >> > > > > > > > > > >> > > > >> 0.98 release is 0.98.11. There's a known
>> >> > > > performance
>> >> > > > > > > issue
>> >> > > > > > > > > with
>> >> > > > > > > > > > >> > 0.98.0
>> >> > > > > > > > > > >> > > > >> pertaining to RPC that was fixed in later
>> >> > > releases,
>> >> > > > > you
>> >> > > > > > > > > should
>> >> > > > > > > > > > >> move
>> >> > > > > > > > > > >> > up
>> >> > > > > > > > > > >> > > > >> from
>> >> > > > > > > > > > >> > > > >> 0.98.0. In addition hundreds of
>> improvements
>> >> > and
>> >> > > > bug
>> >> > > > > > > fixes
>> >> > > > > > > > > have
>> >> > > > > > > > > > >> gone
>> >> > > > > > > > > > >> > > > into
>> >> > > > > > > > > > >> > > > >> the ten releases since 0.98.0.
>> >> > > > > > > > > > >> > > > >>
>> >> > > > > > > > > > >> > > > >> On Mon, Mar 16, 2015 at 6:40 AM, Dejan
>> >> Menges <
>> >> > > > > > > > > > >> > [email protected]
>> >> > > > > > > > > > >> > > >
>> >> > > > > > > > > > >> > > > >> wrote:
>> >> > > > > > > > > > >> > > > >>
>> >> > > > > > > > > > >> > > > >> > Hi All,
>> >> > > > > > > > > > >> > > > >> >
>> >> > > > > > > > > > >> > > > >> > We have a strange issue with HBase
>> >> > performance
>> >> > > > > > (overall
>> >> > > > > > > > > > cluster
>> >> > > > > > > > > > >> > > > >> > performance) in case one of datanodes
>> in
>> >> the
>> >> > > > > cluster
>> >> > > > > > > > > > >> unexpectedly
>> >> > > > > > > > > > >> > > goes
>> >> > > > > > > > > > >> > > > >> > down.
>> >> > > > > > > > > > >> > > > >> >
>> >> > > > > > > > > > >> > > > >> > So scenario is like follows:
>> >> > > > > > > > > > >> > > > >> > - Cluster works fine, it's stable.
>> >> > > > > > > > > > >> > > > >> > - One DataNode unexpectedly goes down
>> (PSU
>> >> > > issue,
>> >> > > > > > > network
>> >> > > > > > > > > > >> issue,
>> >> > > > > > > > > > >> > > > >> anything)
>> >> > > > > > > > > > >> > > > >> > - Whole HBase cluster goes down
>> >> (performance
>> >> > > > > becomes
>> >> > > > > > so
>> >> > > > > > > > bad
>> >> > > > > > > > > > >> that
>> >> > > > > > > > > > >> > we
>> >> > > > > > > > > > >> > > > >> have to
>> >> > > > > > > > > > >> > > > >> > restart all RegionServers to get it
>> back
>> >> to
>> >> > > > life).
>> >> > > > > > > > > > >> > > > >> >
>> >> > > > > > > > > > >> > > > >> > Most funny and latest issue that
>> happened
>> >> was
>> >> > > > that
>> >> > > > > we
>> >> > > > > > > > added
>> >> > > > > > > > > > new
>> >> > > > > > > > > > >> > node
>> >> > > > > > > > > > >> > > > to
>> >> > > > > > > > > > >> > > > >> the
>> >> > > > > > > > > > >> > > > >> > cluster (having 8 x 4T SATA disks) and
>> we
>> >> > left
>> >> > > > just
>> >> > > > > > > > > DataNode
>> >> > > > > > > > > > >> > running
>> >> > > > > > > > > > >> > > > on
>> >> > > > > > > > > > >> > > > >> it
>> >> > > > > > > > > > >> > > > >> > to give it couple of days to get some
>> >> data.
>> >> > At
>> >> > > > some
>> >> > > > > > > point
>> >> > > > > > > > > in
>> >> > > > > > > > > > >> time,
>> >> > > > > > > > > > >> > > due
>> >> > > > > > > > > > >> > > > >> to
>> >> > > > > > > > > > >> > > > >> > hardware issue, server rebooted (twice
>> >> during
>> >> > > > three
>> >> > > > > > > > hours)
>> >> > > > > > > > > in
>> >> > > > > > > > > > >> > moment
>> >> > > > > > > > > > >> > > > >> when
>> >> > > > > > > > > > >> > > > >> > it had maybe 5% of data it would have
>> in a
>> >> > > couple
>> >> > > > > of
>> >> > > > > > > > days.
>> >> > > > > > > > > > >> Nothing
>> >> > > > > > > > > > >> > > > else
>> >> > > > > > > > > > >> > > > >> > beside DataNode was running, and once
>> it
>> >> went
>> >> > > > down,
>> >> > > > > > it
>> >> > > > > > > > > > affected
>> >> > > > > > > > > > >> > > > literary
>> >> > > > > > > > > > >> > > > >> > everything, and restarting
>> RegionServers
>> >> in
>> >> > the
>> >> > > > end
>> >> > > > > > > fixed
>> >> > > > > > > > > it.
>> >> > > > > > > > > > >> > > > >> >
>> >> > > > > > > > > > >> > > > >> > We are using HBase 0.98.0 with Hadoop
>> >> 2.4.0
>> >> > > > > > > > > > >> > > > >> >
>> >> > > > > > > > > > >> > > > >>
>> >> > > > > > > > > > >> > > > >>
>> >> > > > > > > > > > >> > > > >>
>> >> > > > > > > > > > >> > > > >> --
>> >> > > > > > > > > > >> > > > >> Best regards,
>> >> > > > > > > > > > >> > > > >>
>> >> > > > > > > > > > >> > > > >>    - Andy
>> >> > > > > > > > > > >> > > > >>
>> >> > > > > > > > > > >> > > > >> Problems worthy of attack prove their
>> worth
>> >> by
>> >> > > > > hitting
>> >> > > > > > > > back.
>> >> > > > > > > > > -
>> >> > > > > > > > > > >> Piet
>> >> > > > > > > > > > >> > > Hein
>> >> > > > > > > > > > >> > > > >> (via Tom White)
>> >> > > > > > > > > > >> > > > >>
>> >> > > > > > > > > > >> > > > >
>> >> > > > > > > > > > >> > > > >
>> >> > > > > > > > > > >> > > >
>> >> > > > > > > > > > >> > >
>> >> > > > > > > > > > >> >
>> >> > > > > > > > > > >>
>> >> > > > > > > > > > >
>> >> > > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > >
>> >> > > > > > >
>> >> > > > > >
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> >
>> >
>>
>

Re: Strange issue when DataNode goes down

Reply via email to