Re: Node failure causes weird META data?

Erdem Agaoglu Mon, 01 Nov 2010 01:36:49 -0700

Hi again,

I have re-checked our configuration to confirm that we have
dfs.support.append enabled and saw "Using syncFs -- HDFS-200" in logs. I
inspected logs around log splits to find something, but i'm not sure that's
what we are looking for. In the first step of the scenario i mentioned
before (where we kill -9ed everything on the node that hosts the ROOT
region), HLog says (stripping hdfs:// prefixes and hostnames for clarity)


# Splitting 7 hlog(s) in .logs/F,60020,1287491528908

Then it goes over every single one like

# Splitting hlog 1 of 7
# Splitting hlog 2 of 7
# ...
# Splitting hlog 7 of 7

On the 7th hlog it WARNs with two lines

# File .logs/F,60020,1287491528908/10.1.10.229%3A60020.1288021443546 might
be still open, length is 0
# Could not open .logs/F,60020,1287491528908/10.1.10.229%3A60020.1288021443546
for reading. File is emptyjava.io.EOFException

And completes with

# log file splitting completed in 80372 millis for
.logs/F,60020,1287491528908

This might be it, but on the sixth step (where we kill -9ed the RegionServer
that hosts the only META region), it splits 2 hlogs without any empty file
problems nor any log above INFO, but as i told before, our testtable still
got lost.

I'll try to reproduce the problem in a cleaner way, but in the meantime, any
kind of pointers to any problems we might have is greatly appreciated.


On Fri, Oct 29, 2010 at 9:25 AM, Erdem Agaoglu <[email protected]>wrote:

> Thanks for the answer.
>
> I am pretty sure we have dfs.support.append enabled. I remember both the
> conf file and the logs, and don't recall seeing any errors on 60010. I
> crawled through logs all yesterday but don't remember anything indicating a
> specific error too. But i'm not sure about that. Let me check that and get
> back here on monday.
>
> On Thu, Oct 28, 2010 at 7:30 PM, Jean-Daniel Cryans 
> <[email protected]>wrote:
>
>> First thing I'd check is if your configuration has dfs.support.append,
>> you can confirm this by looking at your region server logs. When a RS
>> starts, it creates an HLog and will print out: "Using syncFs --
>> HDFS-200" if it's configured, else you'll see "syncFs -- HDFS-200 --
>> not available, dfs.support.append=false". Also the master web ui (on
>> port 60010) will print an error message regarding that.
>>
>> If it's all ok, then you should take a look at the master log when it
>> does the log splitting and see if it contains any obvious errors.
>>
>> J-D
>>
>> On Thu, Oct 28, 2010 at 12:58 AM, Erdem Agaoglu <[email protected]>
>> wrote:
>> > Hi all,
>> >
>> > We have a testing cluster of 6 nodes which we try to run an
>> HBase/MapReduce
>> > application on. In order to simulate a power failure we kill -9ed all
>> things
>> > hadoop related on one of the slave nodes (DataNode, RegionServer,
>> > TaskTracker, ZK quorum peer and i think SecondaryNameNode was on this
>> node
>> > too). We were expecting a smooth transition on all services but were
>> unable
>> > to get on HBase end. While our regions seemed intact (not confirmed), we
>> > lost table definitions that pointed some kind of META region fail. So
>> our
>> > application failed with several TableNotFoundExceptions. Simulation was
>> > conducted with no-load and extremely small data (like 10 rows in 3
>> tables).
>> >
>> > On our setup, HBase is 0.89.20100924, r1001068 while Hadoop
>> > runs 0.20.3-append-r964955-1240, r960957. Most of the configuration
>> > parameters are in default.
>> >
>> > If we did something wrong up to this point, please ignore the rest of
>> the
>> > message as i'll try to explain what we did to reproduce it and might be
>> > irrelevant.
>> >
>> > Say the machines are named A, B, C, D, E, F; where A is master-like
>> node,
>> > others are slaves and power fail is on F. Since we have little data, we
>> have
>> > one ROOT and only one META region. I'll try to sum up the whole
>> scenario.
>> >
>> > A: NN, DN, JT, TT, HM, RS
>> > B: DN, TT, RS, ZK
>> > C: DN, TT, RS, ZK
>> > D: DN, TT, RS, ZK
>> > E: DN, TT, RS, ZK
>> > F: SNN, DN, TT, RS, ZK
>> >
>> > 0. Initial state -> ROOT: F, META: A
>> > 1. Power fail on F -> ROOT: C, META: E -> lost tables, waited for about
>> half
>> > an hour to get nothing BTW
>> > 2. Put F back online -> No effect
>> > 3. Create a table 'testtable' to see if we lose it
>> > 4. Kill -9ed DataNode on F -> No effect -> Start it again
>> > 5. Kill -9ed RegionServer on F -> No effect -> Start it again
>> > 6. Kill -9ed RegionServer on E -> ROOT: C, META: A -> We lost
>> 'testtable'
>> > but get our tables from before the simulation. It seemed like because A
>> had
>> > META before the simulation, the table definitions were revived.
>> > 7. Restarted the whole cluster -> ROOT: A, META: F -> We lost 2 out of
>> our
>> > original 6 tables, 'testtable' revived. That small data seems corrupted
>> too
>> > as our Scans don't finish.
>> > 8. Run to mailing-list.
>> >
>> > First of all thanks for reading up to this point. From what we are now,
>> we
>> > are not even sure if this is the expected behavior, like if ROOT or META
>> > region dies we lose data and must do sth like hbck, or if we are missing
>> a
>> > configuration, or if this is a bug. No need to mention that we are
>> > relatively new to HBase so the last possibility is that if we didn't
>> > understand it at all.
>> >
>> > Thanks in advance for any ideas.
>> >
>> > --
>> > erdem agaoglu
>> >
>>
>
>
>
> --
> erdem agaoglu
>



-- 
erdem agaoglu

Re: Node failure causes weird META data?

Reply via email to