Re: A data loss scenario with a single region server going down

Ryan Rawson Mon, 20 Sep 2010 17:44:22 -0700

hi,

sorry i dont.  i think the current transactional/indexed person is
working on bringing it up to 0.89, perhaps they would enjoy your help
in testing or porting the code?


I'll poke a few people into replying.

-ryan

On Mon, Sep 20, 2010 at 5:19 PM, George P. Stathis <[email protected]> wrote:
> On Mon, Sep 20, 2010 at 4:55 PM, Ryan Rawson <[email protected]> wrote:
>
>> When you say replication what exactly do you mean?  In normal HDFS, as
>> you write the data is sent to 3 nodes yes, but with the flaw I
>> outlined, it doesnt matter because the datanodes and namenode will
>> pretend a data block just didnt exist if it wasnt closed properly.
>>
>
> That's the part I was not understanding. I do now. Thanks.
>
>
>>
>> So even with the most careful white glove handling of hbase, you will
>> eventually have a crash and you will lose data w/o 0.89/CDH3 et. al.
>> You can circumvent this by storing the data elsewhere and spooling
>> into hbase, or perhaps just not minding if you lose data (yes those
>> applications exist).
>>
>> Looking at those JIRAs in question, the first is already on trunk
>> which is 0.89.  The second isn't alas.  At this point the
>> transactional hbase just isnt being actively maintained by any
>> committer and we are reliant on kind people's contributions.  So I
>> can't promise when it will hit 0.89/0.90.
>>
>
> Are you aware of any indexing alternatives in 0.89?
>
>
>>
>> -ryan
>>
>>
>> On Mon, Sep 20, 2010 at 1:21 PM, George P. Stathis <[email protected]>
>> wrote:
>> > Thanks for the response Ryan. I have no doubt that 0.89 can be used in
>> > production and that it has strong support. I just wanted to avoid moving
>> to
>> > it now because we have limited resources and it would put a dent in our
>> > roadmap if we were to fast track the migration now. Specifically, we are
>> > using HBASE-2438 and HBASE-2426 to support pagination across indexes. So
>> we
>> > either have to migrate those to 0.89 or somehow go stock and be able to
>> > support pagination across region servers.
>> >
>> > Of course, if the choice is between migrating or losing more data, data
>> > safety comes first. But if we can buy two or three more months of time
>> and
>> > avoid region server crashes (like you did for a year), maybe we can go
>> that
>> > route for now. What do we need to do achieve that?
>> >
>> > -GS
>> >
>> > PS: Out of curiosity, I understand the WAL log append issue for a single
>> > regionserver when it comes to losing the data on a single node. But if
>> that
>> > data is also being replicated on another region server, why wouldn't it
>> be
>> > available there? Or is the WAL log shared across multiple region servers
>> > (maybe that's what I'm missing)?
>> >
>> >
>> > On Mon, Sep 20, 2010 at 3:52 PM, Ryan Rawson <[email protected]> wrote:
>> >
>> >> Hey,
>> >>
>> >> The problem is that the stock 0.20 hadoop wont let you read from a
>> >> non-closed file.  It will report that length as 0.  So if a
>> >> regionserver crashes, that last WAL log that is still open becomes 0
>> >> length and the data within in unreadable.  That specifically is the
>> >> problem of data loss.  You could always make it so your regionservers
>> >> rarely crash - this is possible btw and I did it for over a year.
>> >>
>> >> But you will want to run CDH3 or the append-branch releases to get the
>> >> series of patches that fix this hole.  It also happens that only 0.89
>> >> runs on it.  I would like to avoid the hadoop "everyone uses 0.20
>> >> forever" problem and talk about what we could do to help you get on
>> >> 0.89.  Over here at SU we've made a commitment to the future of 0.89
>> >> and are running it in production.  Let us know what else you'd need.
>> >>
>> >> -ryan
>> >>
>> >> On Mon, Sep 20, 2010 at 12:39 PM, George P. Stathis
>> >> <[email protected]> wrote:
>> >> > Thanks Todd. We are not quite ready to move to 0.89 yet. We have made
>> >> custom
>> >> > modifications to the transactional contrib sources which are now taken
>> >> out
>> >> > of 0.89. We are planning on moving to 0.90 when it comes out and at
>> that
>> >> > point, either migrate our customizations, or move back to the
>> out-of-the
>> >> box
>> >> > features (which will require a re-write of our code).
>> >> >
>> >> > We are well aware of the CDH distros but at the time we started with
>> >> hbase,
>> >> > there was none that included HBase. I think CDH3 the first one to
>> include
>> >> > HBase, correct? And is 0.89 the only one supported?
>> >> >
>> >> > Moreover, are we saying that there is no way to prevent stock hbase
>> >> 0.20.6
>> >> > and hadoop 0.20.2 from losing data when a single node goes down? It
>> does
>> >> not
>> >> > matter if the data is replicated, it will still get lost?
>> >> >
>> >> > -GS
>> >> >
>> >> > On Sun, Sep 19, 2010 at 5:58 PM, Todd Lipcon <[email protected]>
>> wrote:
>> >> >
>> >> >> Hi George,
>> >> >>
>> >> >> The data loss problems you mentioned below are known issues when
>> running
>> >> on
>> >> >> stock Apache 0.20.x hadoop.
>> >> >>
>> >> >> You should consider upgrading to CDH3b2, which includes a number of
>> HDFS
>> >> >> patches that allow HBase to durably store data. You'll also have to
>> >> upgrade
>> >> >> to HBase 0.89 - we ship a version as part of CDH that will work well.
>> >> >>
>> >> >> Thanks
>> >> >> -Todd
>> >> >>
>> >> >> On Sun, Sep 19, 2010 at 6:57 AM, George P. Stathis <
>> >> [email protected]
>> >> >> >wrote:
>> >> >>
>> >> >> > Hi folks. I'd like to run the following data loss scenario by you
>> to
>> >> see
>> >> >> if
>> >> >> > we are doing something obviously wrong with our setup here.
>> >> >> >
>> >> >> > Setup:
>> >> >> >
>> >> >> >   - Hadoop 0.20.1
>> >> >> >   - HBase 0.20.3
>> >> >> >   - 1 Master Node running Nameserver, SecondaryNameserver,
>> JobTracker,
>> >> >> >   HMaster and 1 Zookeeper (no zookeeper quorum right now)
>> >> >> >   - 4 child nodes running a Datanode, TaskTracker and RegionServer
>> >> each
>> >> >> >   - dfs.replication is set to 2
>> >> >> >   - Host: Amazon EC2
>> >> >> >
>> >> >> > Up until yesterday, we were frequently experiencing
>> >> >> > HBASE-2077<https://issues.apache.org/jira/browse/HBASE-2077>,
>> >> >> > which kept bringing our RegionServers down. What we realized though
>> is
>> >> >> that
>> >> >> > we were losing data (a few hours worth) with just one out of four
>> >> >> > regionservers going down. This is problematic since we are supposed
>> to
>> >> >> > replicate at x2 out of 4 nodes, so at least one other node should
>> be
>> >> able
>> >> >> > to
>> >> >> > theoretically serve the data that the downed regionserver can't.
>> >> >> >
>> >> >> > Questions:
>> >> >> >
>> >> >> >   - When a regionserver goes down unexpectedly, the only data that
>> >> >> >   theoretically gets lost was whatever didn't make it to the WAL,
>> >> right?
>> >> >> Or
>> >> >> >   wrong? E.g.
>> >> >> >
>> >> >> >
>> >> >>
>> >>
>> http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html
>> >> >> >   - We ran a hadoop fsck on our cluster and verified the
>> replication
>> >> >> factor
>> >> >> >   as well as that the were no under replicated blocks. So why was
>> our
>> >> >> data
>> >> >> > not
>> >> >> >   available from another node?
>> >> >> >   - If the log gets rolled every 60 minutes by default (we haven't
>> >> >> touched
>> >> >> >   the defaults), how can we lose data from up to 24 hours ago?
>> >> >> >   - When the downed regionserver comes back up, shouldn't that data
>> be
>> >> >> >   available again? Ours wasn't.
>> >> >> >   - In such scenarios, is there a recommended approach for
>> restoring
>> >> the
>> >> >> >   regionserver that goes down? We just brought them back up by
>> logging
>> >> on
>> >> >> > the
>> >> >> >   node itself an manually restarting them first. Now we have
>> automated
>> >> >> > crons
>> >> >> >   that listen for their ports and restart them if they go down
>> within
>> >> two
>> >> >> >   minutes.
>> >> >> >   - Are there way to recover such lost data?
>> >> >> >   - Are versions 0.89 / 0.90 addressing any of these issues?
>> >> >> >   - Curiosity question: when a regionserver goes down, does the
>> master
>> >> >> try
>> >> >> >   to replicate that node's data on another node to satisfy the
>> >> >> > dfs.replication
>> >> >> >   ratio?
>> >> >> >
>> >> >> > For now, we have upgraded our HBase to 0.20.6, which is supposed to
>> >> >> contain
>> >> >> > the HBASE-2077 <https://issues.apache.org/jira/browse/HBASE-2077>
>> fix
>> >> >> (but
>> >> >> > no one has verified yet). Lars' blog also suggests that Hadoop
>> 0.21.0
>> >> is
>> >> >> > the
>> >> >> > way to go to avoid the  file append issues but it's not production
>> >> ready
>> >> >> > yet. Should we stick to 0.20.1? Upgrade to 0.20.2?
>> >> >> >
>> >> >> > Any tips here are definitely appreciated. I'll be happy to provide
>> >> more
>> >> >> > information as well.
>> >> >> >
>> >> >> > -GS
>> >> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Todd Lipcon
>> >> >> Software Engineer, Cloudera
>> >> >>
>> >> >
>> >>
>> >
>>
>

Re: A data loss scenario with a single region server going down

Reply via email to