Thanks for the answer. I am pretty sure we have dfs.support.append enabled. I remember both the conf file and the logs, and don't recall seeing any errors on 60010. I crawled through logs all yesterday but don't remember anything indicating a specific error too. But i'm not sure about that. Let me check that and get back here on monday.
On Thu, Oct 28, 2010 at 7:30 PM, Jean-Daniel Cryans <[email protected]>wrote: > First thing I'd check is if your configuration has dfs.support.append, > you can confirm this by looking at your region server logs. When a RS > starts, it creates an HLog and will print out: "Using syncFs -- > HDFS-200" if it's configured, else you'll see "syncFs -- HDFS-200 -- > not available, dfs.support.append=false". Also the master web ui (on > port 60010) will print an error message regarding that. > > If it's all ok, then you should take a look at the master log when it > does the log splitting and see if it contains any obvious errors. > > J-D > > On Thu, Oct 28, 2010 at 12:58 AM, Erdem Agaoglu <[email protected]> > wrote: > > Hi all, > > > > We have a testing cluster of 6 nodes which we try to run an > HBase/MapReduce > > application on. In order to simulate a power failure we kill -9ed all > things > > hadoop related on one of the slave nodes (DataNode, RegionServer, > > TaskTracker, ZK quorum peer and i think SecondaryNameNode was on this > node > > too). We were expecting a smooth transition on all services but were > unable > > to get on HBase end. While our regions seemed intact (not confirmed), we > > lost table definitions that pointed some kind of META region fail. So our > > application failed with several TableNotFoundExceptions. Simulation was > > conducted with no-load and extremely small data (like 10 rows in 3 > tables). > > > > On our setup, HBase is 0.89.20100924, r1001068 while Hadoop > > runs 0.20.3-append-r964955-1240, r960957. Most of the configuration > > parameters are in default. > > > > If we did something wrong up to this point, please ignore the rest of the > > message as i'll try to explain what we did to reproduce it and might be > > irrelevant. > > > > Say the machines are named A, B, C, D, E, F; where A is master-like node, > > others are slaves and power fail is on F. Since we have little data, we > have > > one ROOT and only one META region. I'll try to sum up the whole scenario. > > > > A: NN, DN, JT, TT, HM, RS > > B: DN, TT, RS, ZK > > C: DN, TT, RS, ZK > > D: DN, TT, RS, ZK > > E: DN, TT, RS, ZK > > F: SNN, DN, TT, RS, ZK > > > > 0. Initial state -> ROOT: F, META: A > > 1. Power fail on F -> ROOT: C, META: E -> lost tables, waited for about > half > > an hour to get nothing BTW > > 2. Put F back online -> No effect > > 3. Create a table 'testtable' to see if we lose it > > 4. Kill -9ed DataNode on F -> No effect -> Start it again > > 5. Kill -9ed RegionServer on F -> No effect -> Start it again > > 6. Kill -9ed RegionServer on E -> ROOT: C, META: A -> We lost 'testtable' > > but get our tables from before the simulation. It seemed like because A > had > > META before the simulation, the table definitions were revived. > > 7. Restarted the whole cluster -> ROOT: A, META: F -> We lost 2 out of > our > > original 6 tables, 'testtable' revived. That small data seems corrupted > too > > as our Scans don't finish. > > 8. Run to mailing-list. > > > > First of all thanks for reading up to this point. From what we are now, > we > > are not even sure if this is the expected behavior, like if ROOT or META > > region dies we lose data and must do sth like hbck, or if we are missing > a > > configuration, or if this is a bug. No need to mention that we are > > relatively new to HBase so the last possibility is that if we didn't > > understand it at all. > > > > Thanks in advance for any ideas. > > > > -- > > erdem agaoglu > > > -- erdem agaoglu
