Re: Node failure causes weird META data?

Jean-Daniel Cryans Thu, 28 Oct 2010 09:31:25 -0700

First thing I'd check is if your configuration has dfs.support.append,
you can confirm this by looking at your region server logs. When a RS
starts, it creates an HLog and will print out: "Using syncFs --
HDFS-200" if it's configured, else you'll see "syncFs -- HDFS-200 --
not available, dfs.support.append=false". Also the master web ui (on
port 60010) will print an error message regarding that.


If it's all ok, then you should take a look at the master log when it
does the log splitting and see if it contains any obvious errors.

J-D

On Thu, Oct 28, 2010 at 12:58 AM, Erdem Agaoglu <[email protected]> wrote:
> Hi all,
>
> We have a testing cluster of 6 nodes which we try to run an HBase/MapReduce
> application on. In order to simulate a power failure we kill -9ed all things
> hadoop related on one of the slave nodes (DataNode, RegionServer,
> TaskTracker, ZK quorum peer and i think SecondaryNameNode was on this node
> too). We were expecting a smooth transition on all services but were unable
> to get on HBase end. While our regions seemed intact (not confirmed), we
> lost table definitions that pointed some kind of META region fail. So our
> application failed with several TableNotFoundExceptions. Simulation was
> conducted with no-load and extremely small data (like 10 rows in 3 tables).
>
> On our setup, HBase is 0.89.20100924, r1001068 while Hadoop
> runs 0.20.3-append-r964955-1240, r960957. Most of the configuration
> parameters are in default.
>
> If we did something wrong up to this point, please ignore the rest of the
> message as i'll try to explain what we did to reproduce it and might be
> irrelevant.
>
> Say the machines are named A, B, C, D, E, F; where A is master-like node,
> others are slaves and power fail is on F. Since we have little data, we have
> one ROOT and only one META region. I'll try to sum up the whole scenario.
>
> A: NN, DN, JT, TT, HM, RS
> B: DN, TT, RS, ZK
> C: DN, TT, RS, ZK
> D: DN, TT, RS, ZK
> E: DN, TT, RS, ZK
> F: SNN, DN, TT, RS, ZK
>
> 0. Initial state -> ROOT: F, META: A
> 1. Power fail on F -> ROOT: C, META: E -> lost tables, waited for about half
> an hour to get nothing BTW
> 2. Put F back online -> No effect
> 3. Create a table 'testtable' to see if we lose it
> 4. Kill -9ed DataNode on F -> No effect -> Start it again
> 5. Kill -9ed RegionServer on F -> No effect -> Start it again
> 6. Kill -9ed RegionServer on E -> ROOT: C, META: A -> We lost 'testtable'
> but get our tables from before the simulation. It seemed like because A had
> META before the simulation, the table definitions were revived.
> 7. Restarted the whole cluster -> ROOT: A, META: F -> We lost 2 out of our
> original 6 tables, 'testtable' revived. That small data seems corrupted too
> as our Scans don't finish.
> 8. Run to mailing-list.
>
> First of all thanks for reading up to this point. From what we are now, we
> are not even sure if this is the expected behavior, like if ROOT or META
> region dies we lose data and must do sth like hbck, or if we are missing a
> configuration, or if this is a bug. No need to mention that we are
> relatively new to HBase so the last possibility is that if we didn't
> understand it at all.
>
> Thanks in advance for any ideas.
>
> --
> erdem agaoglu
>

Re: Node failure causes weird META data?

Reply via email to