Re: HDFS Restart with Replication

Patrick Schless Thu, 08 Aug 2013 11:50:49 -0700

Hi Asaf,

Thanks for the info. I tried this, but it didn't work for me (the region
servers never shut down). Any idea how long it should take to pick it up? I
let it sit several minutes, and all I saw in the RS logs was:
2013-08-08 13:41:55,303 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Attempting connect to
Master server at hmaster.sea01.staging.tdb.com,60000,1375899835329


Do you know what flag it sets in ZK? I do have a node at /hbase/shutdown,
but it's ctime/mtime are both Oct 23, 2012. Is it possible that's not
getting updated, but should be?

Thanks,
Patrick


On Wed, Aug 7, 2013 at 12:36 AM, Asaf Mesika <[email protected]> wrote:

> Yep. That's a confusing one.
> When running /hbase stop master, it sets the shutdown flag in ZK. RS listen
> in on this flag, and once they see it set, they shut them selfs down. Once
> they are all down, the master goes down as well.
>
> On Saturday, August 3, 2013, Jean-Daniel Cryans wrote:
>
> > Ah then doing "bin/hbase-daemon.sh stop master" on the master node is
> > the equivalent, but don't stop the region server themselves as the
> > master will take care of it. Doing a stop on the master and the region
> > servers will screw things up.
> >
> > J-D
> >
> > On Fri, Aug 2, 2013 at 3:28 PM, Patrick Schless
> > <[email protected]> wrote:
> > > Doesn't stop-hbase.sh (and its ilk) require the server to be able to
> > manage
> > > the clients (using unpassworded SSH keys, for instance)? I don't have
> > that
> > > set up (for security reasons). I use capistrano for all these sort of
> > > coordination tasks.
> > >
> > >
> > > On Fri, Aug 2, 2013 at 12:07 PM, Jean-Daniel Cryans <
> [email protected]
> > >wrote:
> > >
> > >> Doing a bin/stop-hbase.sh is the way to go, then on the Hadoop side
> > >> you do stop-all.sh. I think your ordering is correct but I'm not sure
> > >> you are using the right commands.
> > >>
> > >> J-D
> > >>
> > >> On Fri, Aug 2, 2013 at 8:27 AM, Patrick Schless
> > >> <[email protected]> wrote:
> > >> > Ah, I bet the issue is that I'm stopped the HMaster, but not the
> > Region
> > >> > Servers, then restarting HDFS. What's the correct order of
> operations
> > for
> > >> > bouncing everything?
> > >> >
> > >> >
> > >> > On Thu, Aug 1, 2013 at 5:21 PM, Jean-Daniel Cryans <
> > [email protected]
> > >> >wrote:
> > >> >
> > >> >> Can you follow the life of one of those blocks though the Namenode
> > and
> > >> >> datanode logs? I'd suggest you start by doing a fsck on one of
> those
> > >> >> files with the option that gives the block locations first.
> > >> >>
> > >> >> By the way why do you have split logs? Are region servers dying
> every
> > >> >> time you try out something?
> > >> >>
> > >> >> On Thu, Aug 1, 2013 at 3:16 PM, Patrick Schless
> > >> >> <[email protected]> wrote:
> > >> >> > Yup, 14 datanodes, all check back in. However, all of the corrupt
> > >> files
> > >> >> > seem to be splitlogs from data05. This is true even though I've
> > done
> > >> >> > several restarts (each restart adding a few missing blocks).
> > There's
> > >> >> > nothing special about data05, and it seems to be in the cluster,
> > the
> > >> same
> > >> >> > as anyone else.
> > >> >> >
> > >> >> >
> > >> >> > On Thu, Aug 1, 2013 at 5:04 PM, Jean-Daniel Cryans <
> > >> [email protected]
> > >> >> >wrote:
> > >> >> >
> > >> >> >> I can't think of a way how your missing blocks would be related
> to
> > >> >> >> HBase replication, there's something else going on. Are all the
> > >> >> >> datanodes checking back in?
> > >> >> >>
> > >> >> >> J-D
> > >> >> >>
> > >> >> >> On Thu, Aug 1, 2013 at 2:17 PM, Patrick Schless
> > >> >> >> <[email protected]> wrote:
> > >> >> >> > I'm running:
> > >> >> >> > CDH4.1.2
> > >> >> >> > HBase 0.92.1
> > >> >> >> > Hadoop 2.0.0
> > >> >> >> >
> > >> >> >> > Is there an issue with restarting a standby cluster with
> > >> replication
> > >> >> >> > running? I am doing the following on the standby cluster:
> > >> >> >> >
> > >> >> >> > - stop hmaster
> > >> >> >> > - stop name_node
> > >> >> >> > - start name_node
> > >> >> >> > - start hmaster
> > >> >> >> >
> > >> >> >> > When the name node comes back up, it's reliably missing
> blocks.
> > I
> > >> >> started
> > >> >> >> > with 0 missing blocks, and have run through this scenario a
> few
> > >> times,
> > >> >> >> and
> > >> >> >> > am up to 46 missing blocks, all from the table that is the
> > standby
> > >> for
> > >> >> >> our
> > >> >> >> > production table (in a different datacenter). The missing
> blocks
> > >> all
> > >> >> are
> > >> >> >> > from the same table, and look like:
> > >> >> >>
>

Re: HDFS Restart with Replication

Reply via email to