Re: Node failure causes weird META data?

Erdem Agaoglu Thu, 28 Oct 2010 23:26:23 -0700

Thanks for the answer.

I am pretty sure we have dfs.support.append enabled. I remember both the
conf file and the logs, and don't recall seeing any errors on 60010. I
crawled through logs all yesterday but don't remember anything indicating a
specific error too. But i'm not sure about that. Let me check that and get
back here on monday.


On Thu, Oct 28, 2010 at 7:30 PM, Jean-Daniel Cryans <[email protected]>wrote:

> First thing I'd check is if your configuration has dfs.support.append,
> you can confirm this by looking at your region server logs. When a RS
> starts, it creates an HLog and will print out: "Using syncFs --
> HDFS-200" if it's configured, else you'll see "syncFs -- HDFS-200 --
> not available, dfs.support.append=false". Also the master web ui (on
> port 60010) will print an error message regarding that.
>
> If it's all ok, then you should take a look at the master log when it
> does the log splitting and see if it contains any obvious errors.
>
> J-D
>
> On Thu, Oct 28, 2010 at 12:58 AM, Erdem Agaoglu <[email protected]>
> wrote:
> > Hi all,
> >
> > We have a testing cluster of 6 nodes which we try to run an
> HBase/MapReduce
> > application on. In order to simulate a power failure we kill -9ed all
> things
> > hadoop related on one of the slave nodes (DataNode, RegionServer,
> > TaskTracker, ZK quorum peer and i think SecondaryNameNode was on this
> node
> > too). We were expecting a smooth transition on all services but were
> unable
> > to get on HBase end. While our regions seemed intact (not confirmed), we
> > lost table definitions that pointed some kind of META region fail. So our
> > application failed with several TableNotFoundExceptions. Simulation was
> > conducted with no-load and extremely small data (like 10 rows in 3
> tables).
> >
> > On our setup, HBase is 0.89.20100924, r1001068 while Hadoop
> > runs 0.20.3-append-r964955-1240, r960957. Most of the configuration
> > parameters are in default.
> >
> > If we did something wrong up to this point, please ignore the rest of the
> > message as i'll try to explain what we did to reproduce it and might be
> > irrelevant.
> >
> > Say the machines are named A, B, C, D, E, F; where A is master-like node,
> > others are slaves and power fail is on F. Since we have little data, we
> have
> > one ROOT and only one META region. I'll try to sum up the whole scenario.
> >
> > A: NN, DN, JT, TT, HM, RS
> > B: DN, TT, RS, ZK
> > C: DN, TT, RS, ZK
> > D: DN, TT, RS, ZK
> > E: DN, TT, RS, ZK
> > F: SNN, DN, TT, RS, ZK
> >
> > 0. Initial state -> ROOT: F, META: A
> > 1. Power fail on F -> ROOT: C, META: E -> lost tables, waited for about
> half
> > an hour to get nothing BTW
> > 2. Put F back online -> No effect
> > 3. Create a table 'testtable' to see if we lose it
> > 4. Kill -9ed DataNode on F -> No effect -> Start it again
> > 5. Kill -9ed RegionServer on F -> No effect -> Start it again
> > 6. Kill -9ed RegionServer on E -> ROOT: C, META: A -> We lost 'testtable'
> > but get our tables from before the simulation. It seemed like because A
> had
> > META before the simulation, the table definitions were revived.
> > 7. Restarted the whole cluster -> ROOT: A, META: F -> We lost 2 out of
> our
> > original 6 tables, 'testtable' revived. That small data seems corrupted
> too
> > as our Scans don't finish.
> > 8. Run to mailing-list.
> >
> > First of all thanks for reading up to this point. From what we are now,
> we
> > are not even sure if this is the expected behavior, like if ROOT or META
> > region dies we lose data and must do sth like hbck, or if we are missing
> a
> > configuration, or if this is a bug. No need to mention that we are
> > relatively new to HBase so the last possibility is that if we didn't
> > understand it at all.
> >
> > Thanks in advance for any ideas.
> >
> > --
> > erdem agaoglu
> >
>



-- 
erdem agaoglu

Re: Node failure causes weird META data?

Reply via email to