Re: A data loss scenario with a single region server going down

Ryan Rawson Mon, 20 Sep 2010 13:56:17 -0700

When you say replication what exactly do you mean?  In normal HDFS, as
you write the data is sent to 3 nodes yes, but with the flaw I
outlined, it doesnt matter because the datanodes and namenode will
pretend a data block just didnt exist if it wasnt closed properly.


So even with the most careful white glove handling of hbase, you will
eventually have a crash and you will lose data w/o 0.89/CDH3 et. al.
You can circumvent this by storing the data elsewhere and spooling
into hbase, or perhaps just not minding if you lose data (yes those
applications exist).

Looking at those JIRAs in question, the first is already on trunk
which is 0.89.  The second isn't alas.  At this point the
transactional hbase just isnt being actively maintained by any
committer and we are reliant on kind people's contributions.  So I
can't promise when it will hit 0.89/0.90.

-ryan


On Mon, Sep 20, 2010 at 1:21 PM, George P. Stathis <[email protected]> wrote:
> Thanks for the response Ryan. I have no doubt that 0.89 can be used in
> production and that it has strong support. I just wanted to avoid moving to
> it now because we have limited resources and it would put a dent in our
> roadmap if we were to fast track the migration now. Specifically, we are
> using HBASE-2438 and HBASE-2426 to support pagination across indexes. So we
> either have to migrate those to 0.89 or somehow go stock and be able to
> support pagination across region servers.
>
> Of course, if the choice is between migrating or losing more data, data
> safety comes first. But if we can buy two or three more months of time and
> avoid region server crashes (like you did for a year), maybe we can go that
> route for now. What do we need to do achieve that?
>
> -GS
>
> PS: Out of curiosity, I understand the WAL log append issue for a single
> regionserver when it comes to losing the data on a single node. But if that
> data is also being replicated on another region server, why wouldn't it be
> available there? Or is the WAL log shared across multiple region servers
> (maybe that's what I'm missing)?
>
>
> On Mon, Sep 20, 2010 at 3:52 PM, Ryan Rawson <[email protected]> wrote:
>
>> Hey,
>>
>> The problem is that the stock 0.20 hadoop wont let you read from a
>> non-closed file.  It will report that length as 0.  So if a
>> regionserver crashes, that last WAL log that is still open becomes 0
>> length and the data within in unreadable.  That specifically is the
>> problem of data loss.  You could always make it so your regionservers
>> rarely crash - this is possible btw and I did it for over a year.
>>
>> But you will want to run CDH3 or the append-branch releases to get the
>> series of patches that fix this hole.  It also happens that only 0.89
>> runs on it.  I would like to avoid the hadoop "everyone uses 0.20
>> forever" problem and talk about what we could do to help you get on
>> 0.89.  Over here at SU we've made a commitment to the future of 0.89
>> and are running it in production.  Let us know what else you'd need.
>>
>> -ryan
>>
>> On Mon, Sep 20, 2010 at 12:39 PM, George P. Stathis
>> <[email protected]> wrote:
>> > Thanks Todd. We are not quite ready to move to 0.89 yet. We have made
>> custom
>> > modifications to the transactional contrib sources which are now taken
>> out
>> > of 0.89. We are planning on moving to 0.90 when it comes out and at that
>> > point, either migrate our customizations, or move back to the out-of-the
>> box
>> > features (which will require a re-write of our code).
>> >
>> > We are well aware of the CDH distros but at the time we started with
>> hbase,
>> > there was none that included HBase. I think CDH3 the first one to include
>> > HBase, correct? And is 0.89 the only one supported?
>> >
>> > Moreover, are we saying that there is no way to prevent stock hbase
>> 0.20.6
>> > and hadoop 0.20.2 from losing data when a single node goes down? It does
>> not
>> > matter if the data is replicated, it will still get lost?
>> >
>> > -GS
>> >
>> > On Sun, Sep 19, 2010 at 5:58 PM, Todd Lipcon <[email protected]> wrote:
>> >
>> >> Hi George,
>> >>
>> >> The data loss problems you mentioned below are known issues when running
>> on
>> >> stock Apache 0.20.x hadoop.
>> >>
>> >> You should consider upgrading to CDH3b2, which includes a number of HDFS
>> >> patches that allow HBase to durably store data. You'll also have to
>> upgrade
>> >> to HBase 0.89 - we ship a version as part of CDH that will work well.
>> >>
>> >> Thanks
>> >> -Todd
>> >>
>> >> On Sun, Sep 19, 2010 at 6:57 AM, George P. Stathis <
>> [email protected]
>> >> >wrote:
>> >>
>> >> > Hi folks. I'd like to run the following data loss scenario by you to
>> see
>> >> if
>> >> > we are doing something obviously wrong with our setup here.
>> >> >
>> >> > Setup:
>> >> >
>> >> >   - Hadoop 0.20.1
>> >> >   - HBase 0.20.3
>> >> >   - 1 Master Node running Nameserver, SecondaryNameserver, JobTracker,
>> >> >   HMaster and 1 Zookeeper (no zookeeper quorum right now)
>> >> >   - 4 child nodes running a Datanode, TaskTracker and RegionServer
>> each
>> >> >   - dfs.replication is set to 2
>> >> >   - Host: Amazon EC2
>> >> >
>> >> > Up until yesterday, we were frequently experiencing
>> >> > HBASE-2077<https://issues.apache.org/jira/browse/HBASE-2077>,
>> >> > which kept bringing our RegionServers down. What we realized though is
>> >> that
>> >> > we were losing data (a few hours worth) with just one out of four
>> >> > regionservers going down. This is problematic since we are supposed to
>> >> > replicate at x2 out of 4 nodes, so at least one other node should be
>> able
>> >> > to
>> >> > theoretically serve the data that the downed regionserver can't.
>> >> >
>> >> > Questions:
>> >> >
>> >> >   - When a regionserver goes down unexpectedly, the only data that
>> >> >   theoretically gets lost was whatever didn't make it to the WAL,
>> right?
>> >> Or
>> >> >   wrong? E.g.
>> >> >
>> >> >
>> >>
>> http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html
>> >> >   - We ran a hadoop fsck on our cluster and verified the replication
>> >> factor
>> >> >   as well as that the were no under replicated blocks. So why was our
>> >> data
>> >> > not
>> >> >   available from another node?
>> >> >   - If the log gets rolled every 60 minutes by default (we haven't
>> >> touched
>> >> >   the defaults), how can we lose data from up to 24 hours ago?
>> >> >   - When the downed regionserver comes back up, shouldn't that data be
>> >> >   available again? Ours wasn't.
>> >> >   - In such scenarios, is there a recommended approach for restoring
>> the
>> >> >   regionserver that goes down? We just brought them back up by logging
>> on
>> >> > the
>> >> >   node itself an manually restarting them first. Now we have automated
>> >> > crons
>> >> >   that listen for their ports and restart them if they go down within
>> two
>> >> >   minutes.
>> >> >   - Are there way to recover such lost data?
>> >> >   - Are versions 0.89 / 0.90 addressing any of these issues?
>> >> >   - Curiosity question: when a regionserver goes down, does the master
>> >> try
>> >> >   to replicate that node's data on another node to satisfy the
>> >> > dfs.replication
>> >> >   ratio?
>> >> >
>> >> > For now, we have upgraded our HBase to 0.20.6, which is supposed to
>> >> contain
>> >> > the HBASE-2077 <https://issues.apache.org/jira/browse/HBASE-2077> fix
>> >> (but
>> >> > no one has verified yet). Lars' blog also suggests that Hadoop 0.21.0
>> is
>> >> > the
>> >> > way to go to avoid the  file append issues but it's not production
>> ready
>> >> > yet. Should we stick to 0.20.1? Upgrade to 0.20.2?
>> >> >
>> >> > Any tips here are definitely appreciated. I'll be happy to provide
>> more
>> >> > information as well.
>> >> >
>> >> > -GS
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Todd Lipcon
>> >> Software Engineer, Cloudera
>> >>
>> >
>>
>

Re: A data loss scenario with a single region server going down

Reply via email to