First thing I'd check is if your configuration has dfs.support.append, you can confirm this by looking at your region server logs. When a RS starts, it creates an HLog and will print out: "Using syncFs -- HDFS-200" if it's configured, else you'll see "syncFs -- HDFS-200 -- not available, dfs.support.append=false". Also the master web ui (on port 60010) will print an error message regarding that.
If it's all ok, then you should take a look at the master log when it does the log splitting and see if it contains any obvious errors. J-D On Thu, Oct 28, 2010 at 12:58 AM, Erdem Agaoglu <[email protected]> wrote: > Hi all, > > We have a testing cluster of 6 nodes which we try to run an HBase/MapReduce > application on. In order to simulate a power failure we kill -9ed all things > hadoop related on one of the slave nodes (DataNode, RegionServer, > TaskTracker, ZK quorum peer and i think SecondaryNameNode was on this node > too). We were expecting a smooth transition on all services but were unable > to get on HBase end. While our regions seemed intact (not confirmed), we > lost table definitions that pointed some kind of META region fail. So our > application failed with several TableNotFoundExceptions. Simulation was > conducted with no-load and extremely small data (like 10 rows in 3 tables). > > On our setup, HBase is 0.89.20100924, r1001068 while Hadoop > runs 0.20.3-append-r964955-1240, r960957. Most of the configuration > parameters are in default. > > If we did something wrong up to this point, please ignore the rest of the > message as i'll try to explain what we did to reproduce it and might be > irrelevant. > > Say the machines are named A, B, C, D, E, F; where A is master-like node, > others are slaves and power fail is on F. Since we have little data, we have > one ROOT and only one META region. I'll try to sum up the whole scenario. > > A: NN, DN, JT, TT, HM, RS > B: DN, TT, RS, ZK > C: DN, TT, RS, ZK > D: DN, TT, RS, ZK > E: DN, TT, RS, ZK > F: SNN, DN, TT, RS, ZK > > 0. Initial state -> ROOT: F, META: A > 1. Power fail on F -> ROOT: C, META: E -> lost tables, waited for about half > an hour to get nothing BTW > 2. Put F back online -> No effect > 3. Create a table 'testtable' to see if we lose it > 4. Kill -9ed DataNode on F -> No effect -> Start it again > 5. Kill -9ed RegionServer on F -> No effect -> Start it again > 6. Kill -9ed RegionServer on E -> ROOT: C, META: A -> We lost 'testtable' > but get our tables from before the simulation. It seemed like because A had > META before the simulation, the table definitions were revived. > 7. Restarted the whole cluster -> ROOT: A, META: F -> We lost 2 out of our > original 6 tables, 'testtable' revived. That small data seems corrupted too > as our Scans don't finish. > 8. Run to mailing-list. > > First of all thanks for reading up to this point. From what we are now, we > are not even sure if this is the expected behavior, like if ROOT or META > region dies we lose data and must do sth like hbck, or if we are missing a > configuration, or if this is a bug. No need to mention that we are > relatively new to HBase so the last possibility is that if we didn't > understand it at all. > > Thanks in advance for any ideas. > > -- > erdem agaoglu >
