Doesn't stop-hbase.sh (and its ilk) require the server to be able to manage the clients (using unpassworded SSH keys, for instance)? I don't have that set up (for security reasons). I use capistrano for all these sort of coordination tasks.
On Fri, Aug 2, 2013 at 12:07 PM, Jean-Daniel Cryans <[email protected]>wrote: > Doing a bin/stop-hbase.sh is the way to go, then on the Hadoop side > you do stop-all.sh. I think your ordering is correct but I'm not sure > you are using the right commands. > > J-D > > On Fri, Aug 2, 2013 at 8:27 AM, Patrick Schless > <[email protected]> wrote: > > Ah, I bet the issue is that I'm stopped the HMaster, but not the Region > > Servers, then restarting HDFS. What's the correct order of operations for > > bouncing everything? > > > > > > On Thu, Aug 1, 2013 at 5:21 PM, Jean-Daniel Cryans <[email protected] > >wrote: > > > >> Can you follow the life of one of those blocks though the Namenode and > >> datanode logs? I'd suggest you start by doing a fsck on one of those > >> files with the option that gives the block locations first. > >> > >> By the way why do you have split logs? Are region servers dying every > >> time you try out something? > >> > >> On Thu, Aug 1, 2013 at 3:16 PM, Patrick Schless > >> <[email protected]> wrote: > >> > Yup, 14 datanodes, all check back in. However, all of the corrupt > files > >> > seem to be splitlogs from data05. This is true even though I've done > >> > several restarts (each restart adding a few missing blocks). There's > >> > nothing special about data05, and it seems to be in the cluster, the > same > >> > as anyone else. > >> > > >> > > >> > On Thu, Aug 1, 2013 at 5:04 PM, Jean-Daniel Cryans < > [email protected] > >> >wrote: > >> > > >> >> I can't think of a way how your missing blocks would be related to > >> >> HBase replication, there's something else going on. Are all the > >> >> datanodes checking back in? > >> >> > >> >> J-D > >> >> > >> >> On Thu, Aug 1, 2013 at 2:17 PM, Patrick Schless > >> >> <[email protected]> wrote: > >> >> > I'm running: > >> >> > CDH4.1.2 > >> >> > HBase 0.92.1 > >> >> > Hadoop 2.0.0 > >> >> > > >> >> > Is there an issue with restarting a standby cluster with > replication > >> >> > running? I am doing the following on the standby cluster: > >> >> > > >> >> > - stop hmaster > >> >> > - stop name_node > >> >> > - start name_node > >> >> > - start hmaster > >> >> > > >> >> > When the name node comes back up, it's reliably missing blocks. I > >> started > >> >> > with 0 missing blocks, and have run through this scenario a few > times, > >> >> and > >> >> > am up to 46 missing blocks, all from the table that is the standby > for > >> >> our > >> >> > production table (in a different datacenter). The missing blocks > all > >> are > >> >> > from the same table, and look like: > >> >> > > >> >> > blk_-2036986832155369224 /hbase/splitlog/ > data01.sea01.staging.tdb.com > >> >> > ,60020,1372703317824_hdfs%3A%2F%2Fname-node.sea01.staging.tdb.com > >> >> > %3A8020%2Fhbase%2F.logs%2Fdata05.sea01.staging.tdb.com > >> >> > %2C60020%2C1373557074890-splitting%2Fdata05.sea01.staging.tdb.com > >> >> > > >> >> > >> > %252C60020%252C1373557074890.1374960698485/tempodb-data/c9cdd64af0bfed70da154c219c69d62d/recovered.edits/0000000001366319450.temp > >> >> > > >> >> > Do I have to stop replication before restarting the standby? > >> >> > > >> >> > Thanks, > >> >> > Patrick > >> >> > >> >
