Re: Node failure causes weird META data?

Jean-Daniel Cryans Fri, 05 Nov 2010 10:44:12 -0700

Very interesting finding Erdem! I tried it on a local empty instance
and I set the roll period to 6 secs, here's what I saw:


org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 6000ms elapsed
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using
syncFs -- HDFS-200
org.apache.hadoop.hbase.regionserver.wal.HLog: Roll
/hbase-89-su/.logs/hbasedev,60020,1288977933643/10.10.1.177%3A60020.1288977933829,
entries=1, filesize=295. New hlog
/hbase-89-su/.logs/hbasedev,60020,1288977933643/10.10.1.177%3A60020.1288977943913
org.apache.hadoop.hbase.regionserver.wal.HLog: Found 1 hlogs to remove
 out of total 1; oldest outstanding sequenceid is 270055 from region
-ROOT-,,0
org.apache.hadoop.hbase.regionserver.wal.HLog: moving old hlog file
/hbase-89-su/.logs/hbasedev,60020,1288977933643/10.10.1.177%3A60020.1288977933829
whose highest sequenceid is 270054 to
/hbase-89-su/.oldlogs/10.10.1.177%3A60020.1288977933829

This looks like a very bad bug, I opened
https://issues.apache.org/jira/browse/HBASE-3198

Can you correlate this with the previous tests you did? Hopefully it's
the same problem.

Thanks a lot for your patience and investigative work!

J-D

On Fri, Nov 5, 2010 at 9:28 AM, Erdem Agaoglu <[email protected]> wrote:
> We have found the problem, again. But this time, we dug deeper and extracted
> more info.
>
> First, the reason that we were unable to reproduce was trying to kill the RS
> that holds -ROOT- too soon. That way,
> 1. log split goes fine,
> 2. -ROOT- gets assigned on somewhere else,
> 3. log replaying for that RS goes smoothly
> 4. .META. does not change.
> 5. everything is normal
>
> But if we try after a while of running the cluster,
> 1. log split fails with java.io.EOFException for the only hlog file,
> 2. -ROOT- gets assigned on somewhere else,
> 3. no log is replayed since none split
> 4. .META. is invalid, a new .META. is designated
> 5. data corrupted
>
> Before killing RS that holds -ROOT-, we checked the only hlog file of it,
> and see that it contains the updates for the place of the .META. in the
> first scenario. But in the second failing scenario, the file was empty. We
> found that the log rolling process, which happens once an hour by default,
> somehow thinks that the hlog is old and cleans it up.
>
> # LogRoller: Hlog roll period 3600000ms elapsed
> # SequenceFileLogWriter: Using syncFs -- HDFS-200
> # HLog: Roll /hbase/.logs/A/blahfile0316, entries=1, filesize=312. New hlog
> /hbase/.logs/A/blahfile0476
> # HLog: Found 1 hlogs to remove  out of total 1; oldest outstanding
> sequenceid is 4 from region -ROOT-,,0
> # HLog: moving old hlog file /hbase/.logs/A/blahfile0316 whose highest
> sequenceid is 3 to /hbase/.oldlogs/blahfile0316
>
> In that instance, blahfile0316 contains some rows while blahfile0476 is
> empty. It very much seems like the problem is here, as it thinks the data in
> memstore to be persisted while it is not. This seemed like a bug to us but
> might as well be a misconfiguration. Any idea is greatly appreciated.
>
>
> On Wed, Nov 3, 2010 at 4:36 PM, Erdem Agaoglu <[email protected]>wrote:
>
>> We suspect some misconfiguration too but are unsure what that might be.
>> Almost all of the configuration parameters are in default along with
>> dfs.support.append which is true in append branch. We checked RS logs and
>> couldn't find any Exceptions, everything seems normal so it is unlikely to
>> be a bug. The only WARN was about the last hlog file being empty but i guess
>> this is normal.
>>
>> To confirm it is a append-configuration related problem, we killed the RS
>> that holds -ROOT- right after running the cluster and checked the
>> /hbase/.oldlogs/. There was a file containing a line for -ROOT- so i think
>> append works correctly. The file got removed after a while.
>>
>> BTW, we tried to reproduce it with smaller data (to control manually, like
>> 2 rows in 1 table, newly put) but didn't have any problems. I am starting to
>> think the problem just disappeared. I'll increase rows to hundreds and check
>> again.
>>
>> Until i get the problem again, Thanks for your time.
>>
>> On Tue, Nov 2, 2010 at 7:28 PM, Jean-Daniel Cryans 
>> <[email protected]>wrote:
>>
>>> This confirms my suspicion that -ROOT-'s latest data was lost and that
>>> .META. got double assigned, which easily explains all the weirdness
>>> you're seeing after the log replay.
>>>
>>> The most probable reason would be misconfiguration, since the append
>>> feature is very well tested and a simple kill -9 wouldn't cause all
>>> sorts of issues. Else, it could be caused by an HBase bug, in which
>>> case exceptions would probably come out either during log splitting or
>>> during log replay (which happens when the region is reassigned to
>>> another region server). In your case you could take a look at the RS
>>> log where -ROOT- ends up and see if there's anything weird.
>>>
>>> To test if appends works correctly, in order to rule that out, just
>>> bring up your cluster and kill -9 right away the RS that's holding
>>> ROOT. Once the master did the log splitting, do a lookup directly in
>>> hadoop (bin/hadoop fs -cat) on that log file which should be under
>>> /hbase/.oldlogs/serverstring/somelog and look for that -ROOT- entry.
>>> If it's empty or missing, it's probably a misconfiguration.
>>>
>>> J-D
>>>
>>> On Tue, Nov 2, 2010 at 9:02 AM, Erdem Agaoglu <[email protected]>
>>> wrote:
>>> > We reproduced the problem by kill -9ing the region server that hosts
>>> -ROOT-
>>> > again. Results are the same with what i have explained before. After log
>>> > splitting metaScanner complained that .META. is not valid and reassigned
>>> it
>>> > to somewhere else. As a result, the number of rows in .META. reduced
>>> from 24
>>> > to 16 meaning we lost access to some regions and some table definitions.
>>> To
>>> > support what you have said about the data in the memstore, we seem to
>>> have
>>> > lost only recent regions.
>>> >
>>> > status 'detailed' showed no regionsInTransition, and showed a total of
>>> 39
>>> > regions assigned to region servers, but as i've said before, we have
>>> access
>>> > to only 16 of them. No need to mention our queries fail.
>>> >
>>> > Any idea we could try is greatly appreciated. Thanks in advance.
>>> >
>>> > PS. Something probably irrelevant happened this time, interface 60010
>>> lists
>>> > all our tables while 'list' command in shell lists only a few. We get
>>> > TableNotFoundExceptions on tables shown in web interface but missing in
>>> > 'list' command output. Seems like web interface is able to list
>>> inaccessible
>>> > tables.
>>> >
>>> > On Tue, Nov 2, 2010 at 9:56 AM, Erdem Agaoglu <[email protected]
>>> >wrote:
>>> >
>>> >> By "lost tables" i mean no user table definitions were listed in
>>> interface
>>> >> 60010 and my queries got me TableNotFoundExceptions. Routine
>>> BaseScanner
>>> >> logged like
>>> >>
>>> >> # RegionManager.metaScanner scan of 0 row(s) of meta region ...
>>> complete
>>> >>
>>> >> so i guess my .META. was empty. But unfortunately I don't know if any
>>> >> regions were stuck in transition, i'll be sure to check this out while
>>> i try
>>> >> to reproduce.
>>> >>
>>> >> I rechecked the logs, specifically after the splitting completed and it
>>> >> contains lines like "Current assignment is not valid..." so i guess
>>> this is
>>> >> something unexpected. But in hopes of some configuration error you may
>>> spot,
>>> >> whole log goes like that (stripped to be more readable):
>>> >>
>>> >> # hlog file splitting completed in 80372 millis
>>> >> # Log split complete, meta reassignment and scanning:
>>> >> # ProcessServerShutdown reassigning ROOT region
>>> >> # -ROOT- region unset (but not set to be reassigned)
>>> >> # ROOT inserted into regionsInTransition
>>> >> # Assigning for serverName=C...  total nregions to assign=1, regions to
>>> >> give other servers than this=0, isMetaAssign=true
>>> >> # Assigning serverName=C... 1 regions
>>> >> # Assigning region -ROOT-... to C...
>>> >> # Processing MSG_REPORT_OPEN: -ROOT-... from C...; 1 of 1
>>> >> # RegionManager.rootScanner scanning meta region {server: C ...
>>> >> # Current assignment of .META.... is not valid;  serverAddress=A,
>>> >> startCode=blah unknown; checking once more!
>>> >> # Current assignment of .META.... is not valid;  serverAddress=A,
>>> >> startCode=blah unknown.
>>> >> # RegionManager.rootScanner scan of 1 row(s) of meta region {server: C
>>> ...
>>> >> complete
>>> >> # RegionManager.rootScanner scanning meta region {server: C ...
>>> >> # RegionManager.rootScanner scan of 1 row(s) of meta region {server: C
>>> ...
>>> >> complete
>>> >> # Assigning for serverName=E...  total nregions to assign=-1, regions
>>> to
>>> >> give other servers than this=2, isMetaAssign=true
>>> >> # Assigning serverName=E... -1 regions
>>> >> # Assigning region .META.... to E...
>>> >> # Processing MSG_REPORT_OPEN: .META.... from E...; 1 of 1
>>> >> # Processing todo: PendingOpenOperation from E...
>>> >> # .META.... open on E
>>> >> # Updated row .META.... in region -ROOT- ...
>>> >> # Adding to onlineMetaRegions: {server: E ...
>>> >> # RegionManager.metaScanner scanning meta region {server: E ...
>>> >> # RegionManager.metaScanner scan of 0 row(s) of meta region {server: E
>>> ...
>>> >> # All 1 .META. region(s) scanned
>>> >> 10 secs later
>>> >> # Processing todo: ProcessServerShutdown of F
>>> >> # Process shutdown of server F...: logSplit: true, rootRescanned:
>>> false,
>>> >> numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
>>> >> # Log split complete, meta reassignment and scanning
>>> >> # Process server shutdown scanning root region on C
>>> >> # Process server shutdown scanning root region on C finished master
>>> >> # process server shutdown scanning .META.,,1 on E
>>> >> # process server shutdown finished scanning .META.,,1 on E
>>> >> # Removed F... from deadservers Map
>>> >>
>>> >> Thanks again.
>>> >>
>>> >> On Mon, Nov 1, 2010 at 6:58 PM, Jean-Daniel Cryans <
>>> [email protected]>wrote:
>>> >>
>>> >>> The fact that the tables are "revived" is a clue here IMO, but let's
>>> >>> go back to more basic questions...
>>> >>>
>>> >>> So when you say that during step 1 you lost tables, what do you mean
>>> >>> by "lost"? Were the rows of those tables still in .META.? Were the
>>> >>> regions stuck in transition (in the shell, do: status 'detailed')? Or
>>> >>> when you tried to query them you just got a TableNotFoundException?
>>> >>>
>>> >>> Also, the fact that only -ROOT- and not .META. was on this region
>>> >>> server means that if there was any data lost, it would be .META.'s
>>> >>> location and it would have been assigned somewhere else (E), but still
>>> >>> stayed assigned on A. Since the data is in the memstore, recent data
>>> >>> couldn't be read by this second assignment of .META. but... it could
>>> >>> also be reassigned for a "normal" reason like rebalancing. The best
>>> >>> way to confirm that is when the -ROOT- region gets reassigned at the
>>> >>> end of step 1 (so this is after the message that goes like "...file
>>> >>> splitting completed in 80372..."), do so see something like this in
>>> >>> the master's log: "Current assignment of .META.,,some_timestamp  is
>>> >>> not valid; serverAddress=blah, startcode=blah unknown."? If so, then
>>> >>> it seems that data was lost and this is really unexpected.
>>> >>>
>>> >>> J-D
>>> >>>
>>> >>> On Mon, Nov 1, 2010 at 1:36 AM, Erdem Agaoglu <
>>> [email protected]>
>>> >>> wrote:
>>> >>> > Hi again,
>>> >>> >
>>> >>> > I have re-checked our configuration to confirm that we have
>>> >>> > dfs.support.append enabled and saw "Using syncFs -- HDFS-200" in
>>> logs. I
>>> >>> > inspected logs around log splits to find something, but i'm not sure
>>> >>> that's
>>> >>> > what we are looking for. In the first step of the scenario i
>>> mentioned
>>> >>> > before (where we kill -9ed everything on the node that hosts the
>>> ROOT
>>> >>> > region), HLog says (stripping hdfs:// prefixes and hostnames for
>>> >>> clarity)
>>> >>> >
>>> >>> > # Splitting 7 hlog(s) in .logs/F,60020,1287491528908
>>> >>> >
>>> >>> > Then it goes over every single one like
>>> >>> >
>>> >>> > # Splitting hlog 1 of 7
>>> >>> > # Splitting hlog 2 of 7
>>> >>> > # ...
>>> >>> > # Splitting hlog 7 of 7
>>> >>> >
>>> >>> > On the 7th hlog it WARNs with two lines
>>> >>> >
>>> >>> > # File .logs/F,60020,1287491528908/10.1.10.229
>>> %3A60020.1288021443546
>>> >>> might
>>> >>> > be still open, length is 0
>>> >>> > # Could not open .logs/F,60020,1287491528908/10.1.10.229
>>> >>> %3A60020.1288021443546
>>> >>> > for reading. File is emptyjava.io.EOFException
>>> >>> >
>>> >>> > And completes with
>>> >>> >
>>> >>> > # log file splitting completed in 80372 millis for
>>> >>> > .logs/F,60020,1287491528908
>>> >>> >
>>> >>> > This might be it, but on the sixth step (where we kill -9ed the
>>> >>> RegionServer
>>> >>> > that hosts the only META region), it splits 2 hlogs without any
>>> empty
>>> >>> file
>>> >>> > problems nor any log above INFO, but as i told before, our testtable
>>> >>> still
>>> >>> > got lost.
>>> >>> >
>>> >>> > I'll try to reproduce the problem in a cleaner way, but in the
>>> meantime,
>>> >>> any
>>> >>> > kind of pointers to any problems we might have is greatly
>>> appreciated.
>>> >>> >
>>> >>> >
>>> >>> > On Fri, Oct 29, 2010 at 9:25 AM, Erdem Agaoglu <
>>> [email protected]
>>> >>> >wrote:
>>> >>> >
>>> >>> >> Thanks for the answer.
>>> >>> >>
>>> >>> >> I am pretty sure we have dfs.support.append enabled. I remember
>>> both
>>> >>> the
>>> >>> >> conf file and the logs, and don't recall seeing any errors on
>>> 60010. I
>>> >>> >> crawled through logs all yesterday but don't remember anything
>>> >>> indicating a
>>> >>> >> specific error too. But i'm not sure about that. Let me check that
>>> and
>>> >>> get
>>> >>> >> back here on monday.
>>> >>> >>
>>> >>> >> On Thu, Oct 28, 2010 at 7:30 PM, Jean-Daniel Cryans <
>>> >>> [email protected]>wrote:
>>> >>> >>
>>> >>> >>> First thing I'd check is if your configuration has
>>> dfs.support.append,
>>> >>> >>> you can confirm this by looking at your region server logs. When a
>>> RS
>>> >>> >>> starts, it creates an HLog and will print out: "Using syncFs --
>>> >>> >>> HDFS-200" if it's configured, else you'll see "syncFs -- HDFS-200
>>> --
>>> >>> >>> not available, dfs.support.append=false". Also the master web ui
>>> (on
>>> >>> >>> port 60010) will print an error message regarding that.
>>> >>> >>>
>>> >>> >>> If it's all ok, then you should take a look at the master log when
>>> it
>>> >>> >>> does the log splitting and see if it contains any obvious errors.
>>> >>> >>>
>>> >>> >>> J-D
>>> >>> >>>
>>> >>> >>> On Thu, Oct 28, 2010 at 12:58 AM, Erdem Agaoglu <
>>> >>> [email protected]>
>>> >>> >>> wrote:
>>> >>> >>> > Hi all,
>>> >>> >>> >
>>> >>> >>> > We have a testing cluster of 6 nodes which we try to run an
>>> >>> >>> HBase/MapReduce
>>> >>> >>> > application on. In order to simulate a power failure we kill
>>> -9ed
>>> >>> all
>>> >>> >>> things
>>> >>> >>> > hadoop related on one of the slave nodes (DataNode,
>>> RegionServer,
>>> >>> >>> > TaskTracker, ZK quorum peer and i think SecondaryNameNode was on
>>> >>> this
>>> >>> >>> node
>>> >>> >>> > too). We were expecting a smooth transition on all services but
>>> were
>>> >>> >>> unable
>>> >>> >>> > to get on HBase end. While our regions seemed intact (not
>>> >>> confirmed), we
>>> >>> >>> > lost table definitions that pointed some kind of META region
>>> fail.
>>> >>> So
>>> >>> >>> our
>>> >>> >>> > application failed with several TableNotFoundExceptions.
>>> Simulation
>>> >>> was
>>> >>> >>> > conducted with no-load and extremely small data (like 10 rows in
>>> 3
>>> >>> >>> tables).
>>> >>> >>> >
>>> >>> >>> > On our setup, HBase is 0.89.20100924, r1001068 while Hadoop
>>> >>> >>> > runs 0.20.3-append-r964955-1240, r960957. Most of the
>>> configuration
>>> >>> >>> > parameters are in default.
>>> >>> >>> >
>>> >>> >>> > If we did something wrong up to this point, please ignore the
>>> rest
>>> >>> of
>>> >>> >>> the
>>> >>> >>> > message as i'll try to explain what we did to reproduce it and
>>> might
>>> >>> be
>>> >>> >>> > irrelevant.
>>> >>> >>> >
>>> >>> >>> > Say the machines are named A, B, C, D, E, F; where A is
>>> master-like
>>> >>> >>> node,
>>> >>> >>> > others are slaves and power fail is on F. Since we have little
>>> data,
>>> >>> we
>>> >>> >>> have
>>> >>> >>> > one ROOT and only one META region. I'll try to sum up the whole
>>> >>> >>> scenario.
>>> >>> >>> >
>>> >>> >>> > A: NN, DN, JT, TT, HM, RS
>>> >>> >>> > B: DN, TT, RS, ZK
>>> >>> >>> > C: DN, TT, RS, ZK
>>> >>> >>> > D: DN, TT, RS, ZK
>>> >>> >>> > E: DN, TT, RS, ZK
>>> >>> >>> > F: SNN, DN, TT, RS, ZK
>>> >>> >>> >
>>> >>> >>> > 0. Initial state -> ROOT: F, META: A
>>> >>> >>> > 1. Power fail on F -> ROOT: C, META: E -> lost tables, waited
>>> for
>>> >>> about
>>> >>> >>> half
>>> >>> >>> > an hour to get nothing BTW
>>> >>> >>> > 2. Put F back online -> No effect
>>> >>> >>> > 3. Create a table 'testtable' to see if we lose it
>>> >>> >>> > 4. Kill -9ed DataNode on F -> No effect -> Start it again
>>> >>> >>> > 5. Kill -9ed RegionServer on F -> No effect -> Start it again
>>> >>> >>> > 6. Kill -9ed RegionServer on E -> ROOT: C, META: A -> We lost
>>> >>> >>> 'testtable'
>>> >>> >>> > but get our tables from before the simulation. It seemed like
>>> >>> because A
>>> >>> >>> had
>>> >>> >>> > META before the simulation, the table definitions were revived.
>>> >>> >>> > 7. Restarted the whole cluster -> ROOT: A, META: F -> We lost 2
>>> out
>>> >>> of
>>> >>> >>> our
>>> >>> >>> > original 6 tables, 'testtable' revived. That small data seems
>>> >>> corrupted
>>> >>> >>> too
>>> >>> >>> > as our Scans don't finish.
>>> >>> >>> > 8. Run to mailing-list.
>>> >>> >>> >
>>> >>> >>> > First of all thanks for reading up to this point. From what we
>>> are
>>> >>> now,
>>> >>> >>> we
>>> >>> >>> > are not even sure if this is the expected behavior, like if ROOT
>>> or
>>> >>> META
>>> >>> >>> > region dies we lose data and must do sth like hbck, or if we are
>>> >>> missing
>>> >>> >>> a
>>> >>> >>> > configuration, or if this is a bug. No need to mention that we
>>> are
>>> >>> >>> > relatively new to HBase so the last possibility is that if we
>>> didn't
>>> >>> >>> > understand it at all.
>>> >>> >>> >
>>> >>> >>> > Thanks in advance for any ideas.
>>> >>> >>> >
>>> >>> >>> > --
>>> >>> >>> > erdem agaoglu
>>> >>> >>> >
>>> >>> >>>
>>> >>> >>
>>> >>> >>
>>> >>> >>
>>> >>> >> --
>>> >>> >> erdem agaoglu
>>> >>> >>
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > --
>>> >>> > erdem agaoglu
>>> >>> >
>>> >>>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> erdem agaoglu
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > erdem agaoglu
>>> >
>>>
>>
>>
>>
>> --
>> erdem agaoglu
>>
>
>
>
> --
> erdem agaoglu
>

Re: Node failure causes weird META data?

Reply via email to