Re: Node failure causes weird META data?

Erdem Agaoglu Mon, 08 Nov 2010 08:12:27 -0800

We've applied and it works great. Our hlogs seem intact after rolls, and we
were unable to crash the cluster by kill -9ing any RS.


Thanks for your interest and rapid response.

On Sat, Nov 6, 2010 at 12:51 AM, Jean-Daniel Cryans <[email protected]>wrote:

> I resolved the problem in trunk, a patch for 0.89 can also be found
> attached in the jira. It should apply cleanly. Let me know if it works
> better for you!
>
> Thx,
>
> J-D
>
> On Fri, Nov 5, 2010 at 11:45 AM, Erdem Agaoglu <[email protected]>
> wrote:
> > Previous problems probably correlate. When log split and replay fails
> > [because of the empty log], master thinks the cluster is in a clean state
> > and assigns a new .META. on some regionserver. It does not know about the
> > previous assignment of .META. thus does not split any logs for it. When
> > regionserver opens new .META., it looks to hdfs and finds some persistent
> > file for it, and since there is no recovered.edits it starts to serve.
> > Depending on the state those files 2 things happen:
> >
> > 1. If the cluster is really new (like an hour new) all the .META. data
> will
> > be on previous RS memstore. So there will be no accessible regions = lost
> > tables
> > 2. If the cluster is running for a while or gracefully restarted before,
> the
> > .META. data from the previous run would be persisted. So the regions from
> > this previous run will be accessible but not the newly generated or
> > re-assigned regions = unfinishing scans or corrupt data
> > *. No matter what, since master thinks the .META. is new, it will not
> know
> > about and unassign original regions = double assignment
> >
> > So if we are right about those assumptions, it seems it's the same
> problem.
> >
> > We've learned a lot about HBase this week, and we owe it to your
> pointers,
> > thanks.
> > I am grateful have the chance to contribute to the community in any
> way,...
> > at last :).
> >
> > On Fri, Nov 5, 2010 at 7:43 PM, Jean-Daniel Cryans <[email protected]
> >wrote:
> >
> >> Very interesting finding Erdem! I tried it on a local empty instance
> >> and I set the roll period to 6 secs, here's what I saw:
> >>
> >> org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 6000ms
> >> elapsed
> >> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using
> >> syncFs -- HDFS-200
> >> org.apache.hadoop.hbase.regionserver.wal.HLog: Roll
> >> /hbase-89-su/.logs/hbasedev,60020,1288977933643/10.10.1.177
> >> %3A60020.1288977933829,
> >> entries=1, filesize=295. New hlog
> >> /hbase-89-su/.logs/hbasedev,60020,1288977933643/10.10.1.177
> >> %3A60020.1288977943913
> >> org.apache.hadoop.hbase.regionserver.wal.HLog: Found 1 hlogs to remove
> >>  out of total 1; oldest outstanding sequenceid is 270055 from region
> >> -ROOT-,,0
> >> org.apache.hadoop.hbase.regionserver.wal.HLog: moving old hlog file
> >> /hbase-89-su/.logs/hbasedev,60020,1288977933643/10.10.1.177
> >> %3A60020.1288977933829
> >> whose highest sequenceid is 270054 to
> >> /hbase-89-su/.oldlogs/10.10.1.177%3A60020.1288977933829
> >>
> >> This looks like a very bad bug, I opened
> >> https://issues.apache.org/jira/browse/HBASE-3198
> >>
> >> Can you correlate this with the previous tests you did? Hopefully it's
> >> the same problem.
> >>
> >> Thanks a lot for your patience and investigative work!
> >>
> >> J-D
> >>
> >> On Fri, Nov 5, 2010 at 9:28 AM, Erdem Agaoglu <[email protected]>
> >> wrote:
> >> > We have found the problem, again. But this time, we dug deeper and
> >> extracted
> >> > more info.
> >> >
> >> > First, the reason that we were unable to reproduce was trying to kill
> the
> >> RS
> >> > that holds -ROOT- too soon. That way,
> >> > 1. log split goes fine,
> >> > 2. -ROOT- gets assigned on somewhere else,
> >> > 3. log replaying for that RS goes smoothly
> >> > 4. .META. does not change.
> >> > 5. everything is normal
> >> >
> >> > But if we try after a while of running the cluster,
> >> > 1. log split fails with java.io.EOFException for the only hlog file,
> >> > 2. -ROOT- gets assigned on somewhere else,
> >> > 3. no log is replayed since none split
> >> > 4. .META. is invalid, a new .META. is designated
> >> > 5. data corrupted
> >> >
> >> > Before killing RS that holds -ROOT-, we checked the only hlog file of
> it,
> >> > and see that it contains the updates for the place of the .META. in
> the
> >> > first scenario. But in the second failing scenario, the file was
> empty.
> >> We
> >> > found that the log rolling process, which happens once an hour by
> >> default,
> >> > somehow thinks that the hlog is old and cleans it up.
> >> >
> >> > # LogRoller: Hlog roll period 3600000ms elapsed
> >> > # SequenceFileLogWriter: Using syncFs -- HDFS-200
> >> > # HLog: Roll /hbase/.logs/A/blahfile0316, entries=1, filesize=312. New
> >> hlog
> >> > /hbase/.logs/A/blahfile0476
> >> > # HLog: Found 1 hlogs to remove  out of total 1; oldest outstanding
> >> > sequenceid is 4 from region -ROOT-,,0
> >> > # HLog: moving old hlog file /hbase/.logs/A/blahfile0316 whose highest
> >> > sequenceid is 3 to /hbase/.oldlogs/blahfile0316
> >> >
> >> > In that instance, blahfile0316 contains some rows while blahfile0476
> is
> >> > empty. It very much seems like the problem is here, as it thinks the
> data
> >> in
> >> > memstore to be persisted while it is not. This seemed like a bug to us
> >> but
> >> > might as well be a misconfiguration. Any idea is greatly appreciated.
> >> >
> >> >
> >> > On Wed, Nov 3, 2010 at 4:36 PM, Erdem Agaoglu <
> [email protected]
> >> >wrote:
> >> >
> >> >> We suspect some misconfiguration too but are unsure what that might
> be.
> >> >> Almost all of the configuration parameters are in default along with
> >> >> dfs.support.append which is true in append branch. We checked RS logs
> >> and
> >> >> couldn't find any Exceptions, everything seems normal so it is
> unlikely
> >> to
> >> >> be a bug. The only WARN was about the last hlog file being empty but
> i
> >> guess
> >> >> this is normal.
> >> >>
> >> >> To confirm it is a append-configuration related problem, we killed
> the
> >> RS
> >> >> that holds -ROOT- right after running the cluster and checked the
> >> >> /hbase/.oldlogs/. There was a file containing a line for -ROOT- so i
> >> think
> >> >> append works correctly. The file got removed after a while.
> >> >>
> >> >> BTW, we tried to reproduce it with smaller data (to control manually,
> >> like
> >> >> 2 rows in 1 table, newly put) but didn't have any problems. I am
> >> starting to
> >> >> think the problem just disappeared. I'll increase rows to hundreds
> and
> >> check
> >> >> again.
> >> >>
> >> >> Until i get the problem again, Thanks for your time.
> >> >>
> >> >> On Tue, Nov 2, 2010 at 7:28 PM, Jean-Daniel Cryans <
> [email protected]
> >> >wrote:
> >> >>
> >> >>> This confirms my suspicion that -ROOT-'s latest data was lost and
> that
> >> >>> .META. got double assigned, which easily explains all the weirdness
> >> >>> you're seeing after the log replay.
> >> >>>
> >> >>> The most probable reason would be misconfiguration, since the append
> >> >>> feature is very well tested and a simple kill -9 wouldn't cause all
> >> >>> sorts of issues. Else, it could be caused by an HBase bug, in which
> >> >>> case exceptions would probably come out either during log splitting
> or
> >> >>> during log replay (which happens when the region is reassigned to
> >> >>> another region server). In your case you could take a look at the RS
> >> >>> log where -ROOT- ends up and see if there's anything weird.
> >> >>>
> >> >>> To test if appends works correctly, in order to rule that out, just
> >> >>> bring up your cluster and kill -9 right away the RS that's holding
> >> >>> ROOT. Once the master did the log splitting, do a lookup directly in
> >> >>> hadoop (bin/hadoop fs -cat) on that log file which should be under
> >> >>> /hbase/.oldlogs/serverstring/somelog and look for that -ROOT- entry.
> >> >>> If it's empty or missing, it's probably a misconfiguration.
> >> >>>
> >> >>> J-D
> >> >>>
> >> >>> On Tue, Nov 2, 2010 at 9:02 AM, Erdem Agaoglu <
> [email protected]
> >> >
> >> >>> wrote:
> >> >>> > We reproduced the problem by kill -9ing the region server that
> hosts
> >> >>> -ROOT-
> >> >>> > again. Results are the same with what i have explained before.
> After
> >> log
> >> >>> > splitting metaScanner complained that .META. is not valid and
> >> reassigned
> >> >>> it
> >> >>> > to somewhere else. As a result, the number of rows in .META.
> reduced
> >> >>> from 24
> >> >>> > to 16 meaning we lost access to some regions and some table
> >> definitions.
> >> >>> To
> >> >>> > support what you have said about the data in the memstore, we seem
> to
> >> >>> have
> >> >>> > lost only recent regions.
> >> >>> >
> >> >>> > status 'detailed' showed no regionsInTransition, and showed a
> total
> >> of
> >> >>> 39
> >> >>> > regions assigned to region servers, but as i've said before, we
> have
> >> >>> access
> >> >>> > to only 16 of them. No need to mention our queries fail.
> >> >>> >
> >> >>> > Any idea we could try is greatly appreciated. Thanks in advance.
> >> >>> >
> >> >>> > PS. Something probably irrelevant happened this time, interface
> 60010
> >> >>> lists
> >> >>> > all our tables while 'list' command in shell lists only a few. We
> get
> >> >>> > TableNotFoundExceptions on tables shown in web interface but
> missing
> >> in
> >> >>> > 'list' command output. Seems like web interface is able to list
> >> >>> inaccessible
> >> >>> > tables.
> >> >>> >
> >> >>> > On Tue, Nov 2, 2010 at 9:56 AM, Erdem Agaoglu <
> >> [email protected]
> >> >>> >wrote:
> >> >>> >
> >> >>> >> By "lost tables" i mean no user table definitions were listed in
> >> >>> interface
> >> >>> >> 60010 and my queries got me TableNotFoundExceptions. Routine
> >> >>> BaseScanner
> >> >>> >> logged like
> >> >>> >>
> >> >>> >> # RegionManager.metaScanner scan of 0 row(s) of meta region ...
> >> >>> complete
> >> >>> >>
> >> >>> >> so i guess my .META. was empty. But unfortunately I don't know if
> >> any
> >> >>> >> regions were stuck in transition, i'll be sure to check this out
> >> while
> >> >>> i try
> >> >>> >> to reproduce.
> >> >>> >>
> >> >>> >> I rechecked the logs, specifically after the splitting completed
> and
> >> it
> >> >>> >> contains lines like "Current assignment is not valid..." so i
> guess
> >> >>> this is
> >> >>> >> something unexpected. But in hopes of some configuration error
> you
> >> may
> >> >>> spot,
> >> >>> >> whole log goes like that (stripped to be more readable):
> >> >>> >>
> >> >>> >> # hlog file splitting completed in 80372 millis
> >> >>> >> # Log split complete, meta reassignment and scanning:
> >> >>> >> # ProcessServerShutdown reassigning ROOT region
> >> >>> >> # -ROOT- region unset (but not set to be reassigned)
> >> >>> >> # ROOT inserted into regionsInTransition
> >> >>> >> # Assigning for serverName=C...  total nregions to assign=1,
> regions
> >> to
> >> >>> >> give other servers than this=0, isMetaAssign=true
> >> >>> >> # Assigning serverName=C... 1 regions
> >> >>> >> # Assigning region -ROOT-... to C...
> >> >>> >> # Processing MSG_REPORT_OPEN: -ROOT-... from C...; 1 of 1
> >> >>> >> # RegionManager.rootScanner scanning meta region {server: C ...
> >> >>> >> # Current assignment of .META.... is not valid;  serverAddress=A,
> >> >>> >> startCode=blah unknown; checking once more!
> >> >>> >> # Current assignment of .META.... is not valid;  serverAddress=A,
> >> >>> >> startCode=blah unknown.
> >> >>> >> # RegionManager.rootScanner scan of 1 row(s) of meta region
> {server:
> >> C
> >> >>> ...
> >> >>> >> complete
> >> >>> >> # RegionManager.rootScanner scanning meta region {server: C ...
> >> >>> >> # RegionManager.rootScanner scan of 1 row(s) of meta region
> {server:
> >> C
> >> >>> ...
> >> >>> >> complete
> >> >>> >> # Assigning for serverName=E...  total nregions to assign=-1,
> >> regions
> >> >>> to
> >> >>> >> give other servers than this=2, isMetaAssign=true
> >> >>> >> # Assigning serverName=E... -1 regions
> >> >>> >> # Assigning region .META.... to E...
> >> >>> >> # Processing MSG_REPORT_OPEN: .META.... from E...; 1 of 1
> >> >>> >> # Processing todo: PendingOpenOperation from E...
> >> >>> >> # .META.... open on E
> >> >>> >> # Updated row .META.... in region -ROOT- ...
> >> >>> >> # Adding to onlineMetaRegions: {server: E ...
> >> >>> >> # RegionManager.metaScanner scanning meta region {server: E ...
> >> >>> >> # RegionManager.metaScanner scan of 0 row(s) of meta region
> {server:
> >> E
> >> >>> ...
> >> >>> >> # All 1 .META. region(s) scanned
> >> >>> >> 10 secs later
> >> >>> >> # Processing todo: ProcessServerShutdown of F
> >> >>> >> # Process shutdown of server F...: logSplit: true, rootRescanned:
> >> >>> false,
> >> >>> >> numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
> >> >>> >> # Log split complete, meta reassignment and scanning
> >> >>> >> # Process server shutdown scanning root region on C
> >> >>> >> # Process server shutdown scanning root region on C finished
> master
> >> >>> >> # process server shutdown scanning .META.,,1 on E
> >> >>> >> # process server shutdown finished scanning .META.,,1 on E
> >> >>> >> # Removed F... from deadservers Map
> >> >>> >>
> >> >>> >> Thanks again.
> >> >>> >>
> >> >>> >> On Mon, Nov 1, 2010 at 6:58 PM, Jean-Daniel Cryans <
> >> >>> [email protected]>wrote:
> >> >>> >>
> >> >>> >>> The fact that the tables are "revived" is a clue here IMO, but
> >> let's
> >> >>> >>> go back to more basic questions...
> >> >>> >>>
> >> >>> >>> So when you say that during step 1 you lost tables, what do you
> >> mean
> >> >>> >>> by "lost"? Were the rows of those tables still in .META.? Were
> the
> >> >>> >>> regions stuck in transition (in the shell, do: status
> 'detailed')?
> >> Or
> >> >>> >>> when you tried to query them you just got a
> TableNotFoundException?
> >> >>> >>>
> >> >>> >>> Also, the fact that only -ROOT- and not .META. was on this
> region
> >> >>> >>> server means that if there was any data lost, it would be
> .META.'s
> >> >>> >>> location and it would have been assigned somewhere else (E), but
> >> still
> >> >>> >>> stayed assigned on A. Since the data is in the memstore, recent
> >> data
> >> >>> >>> couldn't be read by this second assignment of .META. but... it
> >> could
> >> >>> >>> also be reassigned for a "normal" reason like rebalancing. The
> best
> >> >>> >>> way to confirm that is when the -ROOT- region gets reassigned at
> >> the
> >> >>> >>> end of step 1 (so this is after the message that goes like
> "...file
> >> >>> >>> splitting completed in 80372..."), do so see something like this
> in
> >> >>> >>> the master's log: "Current assignment of .META.,,some_timestamp
>  is
> >> >>> >>> not valid; serverAddress=blah, startcode=blah unknown."? If so,
> >> then
> >> >>> >>> it seems that data was lost and this is really unexpected.
> >> >>> >>>
> >> >>> >>> J-D
> >> >>> >>>
> >> >>> >>> On Mon, Nov 1, 2010 at 1:36 AM, Erdem Agaoglu <
> >> >>> [email protected]>
> >> >>> >>> wrote:
> >> >>> >>> > Hi again,
> >> >>> >>> >
> >> >>> >>> > I have re-checked our configuration to confirm that we have
> >> >>> >>> > dfs.support.append enabled and saw "Using syncFs -- HDFS-200"
> in
> >> >>> logs. I
> >> >>> >>> > inspected logs around log splits to find something, but i'm
> not
> >> sure
> >> >>> >>> that's
> >> >>> >>> > what we are looking for. In the first step of the scenario i
> >> >>> mentioned
> >> >>> >>> > before (where we kill -9ed everything on the node that hosts
> the
> >> >>> ROOT
> >> >>> >>> > region), HLog says (stripping hdfs:// prefixes and hostnames
> for
> >> >>> >>> clarity)
> >> >>> >>> >
> >> >>> >>> > # Splitting 7 hlog(s) in .logs/F,60020,1287491528908
> >> >>> >>> >
> >> >>> >>> > Then it goes over every single one like
> >> >>> >>> >
> >> >>> >>> > # Splitting hlog 1 of 7
> >> >>> >>> > # Splitting hlog 2 of 7
> >> >>> >>> > # ...
> >> >>> >>> > # Splitting hlog 7 of 7
> >> >>> >>> >
> >> >>> >>> > On the 7th hlog it WARNs with two lines
> >> >>> >>> >
> >> >>> >>> > # File .logs/F,60020,1287491528908/10.1.10.229
> >> >>> %3A60020.1288021443546
> >> >>> >>> might
> >> >>> >>> > be still open, length is 0
> >> >>> >>> > # Could not open .logs/F,60020,1287491528908/10.1.10.229
> >> >>> >>> %3A60020.1288021443546
> >> >>> >>> > for reading. File is emptyjava.io.EOFException
> >> >>> >>> >
> >> >>> >>> > And completes with
> >> >>> >>> >
> >> >>> >>> > # log file splitting completed in 80372 millis for
> >> >>> >>> > .logs/F,60020,1287491528908
> >> >>> >>> >
> >> >>> >>> > This might be it, but on the sixth step (where we kill -9ed
> the
> >> >>> >>> RegionServer
> >> >>> >>> > that hosts the only META region), it splits 2 hlogs without
> any
> >> >>> empty
> >> >>> >>> file
> >> >>> >>> > problems nor any log above INFO, but as i told before, our
> >> testtable
> >> >>> >>> still
> >> >>> >>> > got lost.
> >> >>> >>> >
> >> >>> >>> > I'll try to reproduce the problem in a cleaner way, but in the
> >> >>> meantime,
> >> >>> >>> any
> >> >>> >>> > kind of pointers to any problems we might have is greatly
> >> >>> appreciated.
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> > On Fri, Oct 29, 2010 at 9:25 AM, Erdem Agaoglu <
> >> >>> [email protected]
> >> >>> >>> >wrote:
> >> >>> >>> >
> >> >>> >>> >> Thanks for the answer.
> >> >>> >>> >>
> >> >>> >>> >> I am pretty sure we have dfs.support.append enabled. I
> remember
> >> >>> both
> >> >>> >>> the
> >> >>> >>> >> conf file and the logs, and don't recall seeing any errors on
> >> >>> 60010. I
> >> >>> >>> >> crawled through logs all yesterday but don't remember
> anything
> >> >>> >>> indicating a
> >> >>> >>> >> specific error too. But i'm not sure about that. Let me check
> >> that
> >> >>> and
> >> >>> >>> get
> >> >>> >>> >> back here on monday.
> >> >>> >>> >>
> >> >>> >>> >> On Thu, Oct 28, 2010 at 7:30 PM, Jean-Daniel Cryans <
> >> >>> >>> [email protected]>wrote:
> >> >>> >>> >>
> >> >>> >>> >>> First thing I'd check is if your configuration has
> >> >>> dfs.support.append,
> >> >>> >>> >>> you can confirm this by looking at your region server logs.
> >> When a
> >> >>> RS
> >> >>> >>> >>> starts, it creates an HLog and will print out: "Using syncFs
> --
> >> >>> >>> >>> HDFS-200" if it's configured, else you'll see "syncFs --
> >> HDFS-200
> >> >>> --
> >> >>> >>> >>> not available, dfs.support.append=false". Also the master
> web
> >> ui
> >> >>> (on
> >> >>> >>> >>> port 60010) will print an error message regarding that.
> >> >>> >>> >>>
> >> >>> >>> >>> If it's all ok, then you should take a look at the master
> log
> >> when
> >> >>> it
> >> >>> >>> >>> does the log splitting and see if it contains any obvious
> >> errors.
> >> >>> >>> >>>
> >> >>> >>> >>> J-D
> >> >>> >>> >>>
> >> >>> >>> >>> On Thu, Oct 28, 2010 at 12:58 AM, Erdem Agaoglu <
> >> >>> >>> [email protected]>
> >> >>> >>> >>> wrote:
> >> >>> >>> >>> > Hi all,
> >> >>> >>> >>> >
> >> >>> >>> >>> > We have a testing cluster of 6 nodes which we try to run
> an
> >> >>> >>> >>> HBase/MapReduce
> >> >>> >>> >>> > application on. In order to simulate a power failure we
> kill
> >> >>> -9ed
> >> >>> >>> all
> >> >>> >>> >>> things
> >> >>> >>> >>> > hadoop related on one of the slave nodes (DataNode,
> >> >>> RegionServer,
> >> >>> >>> >>> > TaskTracker, ZK quorum peer and i think SecondaryNameNode
> was
> >> on
> >> >>> >>> this
> >> >>> >>> >>> node
> >> >>> >>> >>> > too). We were expecting a smooth transition on all
> services
> >> but
> >> >>> were
> >> >>> >>> >>> unable
> >> >>> >>> >>> > to get on HBase end. While our regions seemed intact (not
> >> >>> >>> confirmed), we
> >> >>> >>> >>> > lost table definitions that pointed some kind of META
> region
> >> >>> fail.
> >> >>> >>> So
> >> >>> >>> >>> our
> >> >>> >>> >>> > application failed with several TableNotFoundExceptions.
> >> >>> Simulation
> >> >>> >>> was
> >> >>> >>> >>> > conducted with no-load and extremely small data (like 10
> rows
> >> in
> >> >>> 3
> >> >>> >>> >>> tables).
> >> >>> >>> >>> >
> >> >>> >>> >>> > On our setup, HBase is 0.89.20100924, r1001068 while
> Hadoop
> >> >>> >>> >>> > runs 0.20.3-append-r964955-1240, r960957. Most of the
> >> >>> configuration
> >> >>> >>> >>> > parameters are in default.
> >> >>> >>> >>> >
> >> >>> >>> >>> > If we did something wrong up to this point, please ignore
> the
> >> >>> rest
> >> >>> >>> of
> >> >>> >>> >>> the
> >> >>> >>> >>> > message as i'll try to explain what we did to reproduce it
> >> and
> >> >>> might
> >> >>> >>> be
> >> >>> >>> >>> > irrelevant.
> >> >>> >>> >>> >
> >> >>> >>> >>> > Say the machines are named A, B, C, D, E, F; where A is
> >> >>> master-like
> >> >>> >>> >>> node,
> >> >>> >>> >>> > others are slaves and power fail is on F. Since we have
> >> little
> >> >>> data,
> >> >>> >>> we
> >> >>> >>> >>> have
> >> >>> >>> >>> > one ROOT and only one META region. I'll try to sum up the
> >> whole
> >> >>> >>> >>> scenario.
> >> >>> >>> >>> >
> >> >>> >>> >>> > A: NN, DN, JT, TT, HM, RS
> >> >>> >>> >>> > B: DN, TT, RS, ZK
> >> >>> >>> >>> > C: DN, TT, RS, ZK
> >> >>> >>> >>> > D: DN, TT, RS, ZK
> >> >>> >>> >>> > E: DN, TT, RS, ZK
> >> >>> >>> >>> > F: SNN, DN, TT, RS, ZK
> >> >>> >>> >>> >
> >> >>> >>> >>> > 0. Initial state -> ROOT: F, META: A
> >> >>> >>> >>> > 1. Power fail on F -> ROOT: C, META: E -> lost tables,
> waited
> >> >>> for
> >> >>> >>> about
> >> >>> >>> >>> half
> >> >>> >>> >>> > an hour to get nothing BTW
> >> >>> >>> >>> > 2. Put F back online -> No effect
> >> >>> >>> >>> > 3. Create a table 'testtable' to see if we lose it
> >> >>> >>> >>> > 4. Kill -9ed DataNode on F -> No effect -> Start it again
> >> >>> >>> >>> > 5. Kill -9ed RegionServer on F -> No effect -> Start it
> again
> >> >>> >>> >>> > 6. Kill -9ed RegionServer on E -> ROOT: C, META: A -> We
> lost
> >> >>> >>> >>> 'testtable'
> >> >>> >>> >>> > but get our tables from before the simulation. It seemed
> like
> >> >>> >>> because A
> >> >>> >>> >>> had
> >> >>> >>> >>> > META before the simulation, the table definitions were
> >> revived.
> >> >>> >>> >>> > 7. Restarted the whole cluster -> ROOT: A, META: F -> We
> lost
> >> 2
> >> >>> out
> >> >>> >>> of
> >> >>> >>> >>> our
> >> >>> >>> >>> > original 6 tables, 'testtable' revived. That small data
> seems
> >> >>> >>> corrupted
> >> >>> >>> >>> too
> >> >>> >>> >>> > as our Scans don't finish.
> >> >>> >>> >>> > 8. Run to mailing-list.
> >> >>> >>> >>> >
> >> >>> >>> >>> > First of all thanks for reading up to this point. From
> what
> >> we
> >> >>> are
> >> >>> >>> now,
> >> >>> >>> >>> we
> >> >>> >>> >>> > are not even sure if this is the expected behavior, like
> if
> >> ROOT
> >> >>> or
> >> >>> >>> META
> >> >>> >>> >>> > region dies we lose data and must do sth like hbck, or if
> we
> >> are
> >> >>> >>> missing
> >> >>> >>> >>> a
> >> >>> >>> >>> > configuration, or if this is a bug. No need to mention
> that
> >> we
> >> >>> are
> >> >>> >>> >>> > relatively new to HBase so the last possibility is that if
> we
> >> >>> didn't
> >> >>> >>> >>> > understand it at all.
> >> >>> >>> >>> >
> >> >>> >>> >>> > Thanks in advance for any ideas.
> >> >>> >>> >>> >
> >> >>> >>> >>> > --
> >> >>> >>> >>> > erdem agaoglu
> >> >>> >>> >>> >
> >> >>> >>> >>>
> >> >>> >>> >>
> >> >>> >>> >>
> >> >>> >>> >>
> >> >>> >>> >> --
> >> >>> >>> >> erdem agaoglu
> >> >>> >>> >>
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> > --
> >> >>> >>> > erdem agaoglu
> >> >>> >>> >
> >> >>> >>>
> >> >>> >>
> >> >>> >>
> >> >>> >>
> >> >>> >> --
> >> >>> >> erdem agaoglu
> >> >>> >>
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>> > --
> >> >>> > erdem agaoglu
> >> >>> >
> >> >>>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> erdem agaoglu
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > erdem agaoglu
> >> >
> >>
> >
> >
> >
> > --
> > erdem agaoglu
> >
>



-- 
erdem agaoglu

Re: Node failure causes weird META data?

Reply via email to