Hi again, I have re-checked our configuration to confirm that we have dfs.support.append enabled and saw "Using syncFs -- HDFS-200" in logs. I inspected logs around log splits to find something, but i'm not sure that's what we are looking for. In the first step of the scenario i mentioned before (where we kill -9ed everything on the node that hosts the ROOT region), HLog says (stripping hdfs:// prefixes and hostnames for clarity)
# Splitting 7 hlog(s) in .logs/F,60020,1287491528908 Then it goes over every single one like # Splitting hlog 1 of 7 # Splitting hlog 2 of 7 # ... # Splitting hlog 7 of 7 On the 7th hlog it WARNs with two lines # File .logs/F,60020,1287491528908/10.1.10.229%3A60020.1288021443546 might be still open, length is 0 # Could not open .logs/F,60020,1287491528908/10.1.10.229%3A60020.1288021443546 for reading. File is emptyjava.io.EOFException And completes with # log file splitting completed in 80372 millis for .logs/F,60020,1287491528908 This might be it, but on the sixth step (where we kill -9ed the RegionServer that hosts the only META region), it splits 2 hlogs without any empty file problems nor any log above INFO, but as i told before, our testtable still got lost. I'll try to reproduce the problem in a cleaner way, but in the meantime, any kind of pointers to any problems we might have is greatly appreciated. On Fri, Oct 29, 2010 at 9:25 AM, Erdem Agaoglu <[email protected]>wrote: > Thanks for the answer. > > I am pretty sure we have dfs.support.append enabled. I remember both the > conf file and the logs, and don't recall seeing any errors on 60010. I > crawled through logs all yesterday but don't remember anything indicating a > specific error too. But i'm not sure about that. Let me check that and get > back here on monday. > > On Thu, Oct 28, 2010 at 7:30 PM, Jean-Daniel Cryans > <[email protected]>wrote: > >> First thing I'd check is if your configuration has dfs.support.append, >> you can confirm this by looking at your region server logs. When a RS >> starts, it creates an HLog and will print out: "Using syncFs -- >> HDFS-200" if it's configured, else you'll see "syncFs -- HDFS-200 -- >> not available, dfs.support.append=false". Also the master web ui (on >> port 60010) will print an error message regarding that. >> >> If it's all ok, then you should take a look at the master log when it >> does the log splitting and see if it contains any obvious errors. >> >> J-D >> >> On Thu, Oct 28, 2010 at 12:58 AM, Erdem Agaoglu <[email protected]> >> wrote: >> > Hi all, >> > >> > We have a testing cluster of 6 nodes which we try to run an >> HBase/MapReduce >> > application on. In order to simulate a power failure we kill -9ed all >> things >> > hadoop related on one of the slave nodes (DataNode, RegionServer, >> > TaskTracker, ZK quorum peer and i think SecondaryNameNode was on this >> node >> > too). We were expecting a smooth transition on all services but were >> unable >> > to get on HBase end. While our regions seemed intact (not confirmed), we >> > lost table definitions that pointed some kind of META region fail. So >> our >> > application failed with several TableNotFoundExceptions. Simulation was >> > conducted with no-load and extremely small data (like 10 rows in 3 >> tables). >> > >> > On our setup, HBase is 0.89.20100924, r1001068 while Hadoop >> > runs 0.20.3-append-r964955-1240, r960957. Most of the configuration >> > parameters are in default. >> > >> > If we did something wrong up to this point, please ignore the rest of >> the >> > message as i'll try to explain what we did to reproduce it and might be >> > irrelevant. >> > >> > Say the machines are named A, B, C, D, E, F; where A is master-like >> node, >> > others are slaves and power fail is on F. Since we have little data, we >> have >> > one ROOT and only one META region. I'll try to sum up the whole >> scenario. >> > >> > A: NN, DN, JT, TT, HM, RS >> > B: DN, TT, RS, ZK >> > C: DN, TT, RS, ZK >> > D: DN, TT, RS, ZK >> > E: DN, TT, RS, ZK >> > F: SNN, DN, TT, RS, ZK >> > >> > 0. Initial state -> ROOT: F, META: A >> > 1. Power fail on F -> ROOT: C, META: E -> lost tables, waited for about >> half >> > an hour to get nothing BTW >> > 2. Put F back online -> No effect >> > 3. Create a table 'testtable' to see if we lose it >> > 4. Kill -9ed DataNode on F -> No effect -> Start it again >> > 5. Kill -9ed RegionServer on F -> No effect -> Start it again >> > 6. Kill -9ed RegionServer on E -> ROOT: C, META: A -> We lost >> 'testtable' >> > but get our tables from before the simulation. It seemed like because A >> had >> > META before the simulation, the table definitions were revived. >> > 7. Restarted the whole cluster -> ROOT: A, META: F -> We lost 2 out of >> our >> > original 6 tables, 'testtable' revived. That small data seems corrupted >> too >> > as our Scans don't finish. >> > 8. Run to mailing-list. >> > >> > First of all thanks for reading up to this point. From what we are now, >> we >> > are not even sure if this is the expected behavior, like if ROOT or META >> > region dies we lose data and must do sth like hbck, or if we are missing >> a >> > configuration, or if this is a bug. No need to mention that we are >> > relatively new to HBase so the last possibility is that if we didn't >> > understand it at all. >> > >> > Thanks in advance for any ideas. >> > >> > -- >> > erdem agaoglu >> > >> > > > > -- > erdem agaoglu > -- erdem agaoglu
