Hi Asaf, Thanks for the info. I tried this, but it didn't work for me (the region servers never shut down). Any idea how long it should take to pick it up? I let it sit several minutes, and all I saw in the RS logs was: 2013-08-08 13:41:55,303 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Attempting connect to Master server at hmaster.sea01.staging.tdb.com,60000,1375899835329
Do you know what flag it sets in ZK? I do have a node at /hbase/shutdown, but it's ctime/mtime are both Oct 23, 2012. Is it possible that's not getting updated, but should be? Thanks, Patrick On Wed, Aug 7, 2013 at 12:36 AM, Asaf Mesika <[email protected]> wrote: > Yep. That's a confusing one. > When running /hbase stop master, it sets the shutdown flag in ZK. RS listen > in on this flag, and once they see it set, they shut them selfs down. Once > they are all down, the master goes down as well. > > On Saturday, August 3, 2013, Jean-Daniel Cryans wrote: > > > Ah then doing "bin/hbase-daemon.sh stop master" on the master node is > > the equivalent, but don't stop the region server themselves as the > > master will take care of it. Doing a stop on the master and the region > > servers will screw things up. > > > > J-D > > > > On Fri, Aug 2, 2013 at 3:28 PM, Patrick Schless > > <[email protected]> wrote: > > > Doesn't stop-hbase.sh (and its ilk) require the server to be able to > > manage > > > the clients (using unpassworded SSH keys, for instance)? I don't have > > that > > > set up (for security reasons). I use capistrano for all these sort of > > > coordination tasks. > > > > > > > > > On Fri, Aug 2, 2013 at 12:07 PM, Jean-Daniel Cryans < > [email protected] > > >wrote: > > > > > >> Doing a bin/stop-hbase.sh is the way to go, then on the Hadoop side > > >> you do stop-all.sh. I think your ordering is correct but I'm not sure > > >> you are using the right commands. > > >> > > >> J-D > > >> > > >> On Fri, Aug 2, 2013 at 8:27 AM, Patrick Schless > > >> <[email protected]> wrote: > > >> > Ah, I bet the issue is that I'm stopped the HMaster, but not the > > Region > > >> > Servers, then restarting HDFS. What's the correct order of > operations > > for > > >> > bouncing everything? > > >> > > > >> > > > >> > On Thu, Aug 1, 2013 at 5:21 PM, Jean-Daniel Cryans < > > [email protected] > > >> >wrote: > > >> > > > >> >> Can you follow the life of one of those blocks though the Namenode > > and > > >> >> datanode logs? I'd suggest you start by doing a fsck on one of > those > > >> >> files with the option that gives the block locations first. > > >> >> > > >> >> By the way why do you have split logs? Are region servers dying > every > > >> >> time you try out something? > > >> >> > > >> >> On Thu, Aug 1, 2013 at 3:16 PM, Patrick Schless > > >> >> <[email protected]> wrote: > > >> >> > Yup, 14 datanodes, all check back in. However, all of the corrupt > > >> files > > >> >> > seem to be splitlogs from data05. This is true even though I've > > done > > >> >> > several restarts (each restart adding a few missing blocks). > > There's > > >> >> > nothing special about data05, and it seems to be in the cluster, > > the > > >> same > > >> >> > as anyone else. > > >> >> > > > >> >> > > > >> >> > On Thu, Aug 1, 2013 at 5:04 PM, Jean-Daniel Cryans < > > >> [email protected] > > >> >> >wrote: > > >> >> > > > >> >> >> I can't think of a way how your missing blocks would be related > to > > >> >> >> HBase replication, there's something else going on. Are all the > > >> >> >> datanodes checking back in? > > >> >> >> > > >> >> >> J-D > > >> >> >> > > >> >> >> On Thu, Aug 1, 2013 at 2:17 PM, Patrick Schless > > >> >> >> <[email protected]> wrote: > > >> >> >> > I'm running: > > >> >> >> > CDH4.1.2 > > >> >> >> > HBase 0.92.1 > > >> >> >> > Hadoop 2.0.0 > > >> >> >> > > > >> >> >> > Is there an issue with restarting a standby cluster with > > >> replication > > >> >> >> > running? I am doing the following on the standby cluster: > > >> >> >> > > > >> >> >> > - stop hmaster > > >> >> >> > - stop name_node > > >> >> >> > - start name_node > > >> >> >> > - start hmaster > > >> >> >> > > > >> >> >> > When the name node comes back up, it's reliably missing > blocks. > > I > > >> >> started > > >> >> >> > with 0 missing blocks, and have run through this scenario a > few > > >> times, > > >> >> >> and > > >> >> >> > am up to 46 missing blocks, all from the table that is the > > standby > > >> for > > >> >> >> our > > >> >> >> > production table (in a different datacenter). The missing > blocks > > >> all > > >> >> are > > >> >> >> > from the same table, and look like: > > >> >> >> >
