Re: A data loss scenario with a single region server going down

George P. Stathis Mon, 20 Sep 2010 12:40:14 -0700

Thanks Todd. We are not quite ready to move to 0.89 yet. We have made custom
modifications to the transactional contrib sources which are now taken out
of 0.89. We are planning on moving to 0.90 when it comes out and at that
point, either migrate our customizations, or move back to the out-of-the box
features (which will require a re-write of our code).


We are well aware of the CDH distros but at the time we started with hbase,
there was none that included HBase. I think CDH3 the first one to include
HBase, correct? And is 0.89 the only one supported?

Moreover, are we saying that there is no way to prevent stock hbase 0.20.6
and hadoop 0.20.2 from losing data when a single node goes down? It does not
matter if the data is replicated, it will still get lost?

-GS

On Sun, Sep 19, 2010 at 5:58 PM, Todd Lipcon <[email protected]> wrote:

> Hi George,
>
> The data loss problems you mentioned below are known issues when running on
> stock Apache 0.20.x hadoop.
>
> You should consider upgrading to CDH3b2, which includes a number of HDFS
> patches that allow HBase to durably store data. You'll also have to upgrade
> to HBase 0.89 - we ship a version as part of CDH that will work well.
>
> Thanks
> -Todd
>
> On Sun, Sep 19, 2010 at 6:57 AM, George P. Stathis <[email protected]
> >wrote:
>
> > Hi folks. I'd like to run the following data loss scenario by you to see
> if
> > we are doing something obviously wrong with our setup here.
> >
> > Setup:
> >
> >   - Hadoop 0.20.1
> >   - HBase 0.20.3
> >   - 1 Master Node running Nameserver, SecondaryNameserver, JobTracker,
> >   HMaster and 1 Zookeeper (no zookeeper quorum right now)
> >   - 4 child nodes running a Datanode, TaskTracker and RegionServer each
> >   - dfs.replication is set to 2
> >   - Host: Amazon EC2
> >
> > Up until yesterday, we were frequently experiencing
> > HBASE-2077<https://issues.apache.org/jira/browse/HBASE-2077>,
> > which kept bringing our RegionServers down. What we realized though is
> that
> > we were losing data (a few hours worth) with just one out of four
> > regionservers going down. This is problematic since we are supposed to
> > replicate at x2 out of 4 nodes, so at least one other node should be able
> > to
> > theoretically serve the data that the downed regionserver can't.
> >
> > Questions:
> >
> >   - When a regionserver goes down unexpectedly, the only data that
> >   theoretically gets lost was whatever didn't make it to the WAL, right?
> Or
> >   wrong? E.g.
> >
> >
> http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html
> >   - We ran a hadoop fsck on our cluster and verified the replication
> factor
> >   as well as that the were no under replicated blocks. So why was our
> data
> > not
> >   available from another node?
> >   - If the log gets rolled every 60 minutes by default (we haven't
> touched
> >   the defaults), how can we lose data from up to 24 hours ago?
> >   - When the downed regionserver comes back up, shouldn't that data be
> >   available again? Ours wasn't.
> >   - In such scenarios, is there a recommended approach for restoring the
> >   regionserver that goes down? We just brought them back up by logging on
> > the
> >   node itself an manually restarting them first. Now we have automated
> > crons
> >   that listen for their ports and restart them if they go down within two
> >   minutes.
> >   - Are there way to recover such lost data?
> >   - Are versions 0.89 / 0.90 addressing any of these issues?
> >   - Curiosity question: when a regionserver goes down, does the master
> try
> >   to replicate that node's data on another node to satisfy the
> > dfs.replication
> >   ratio?
> >
> > For now, we have upgraded our HBase to 0.20.6, which is supposed to
> contain
> > the HBASE-2077 <https://issues.apache.org/jira/browse/HBASE-2077> fix
> (but
> > no one has verified yet). Lars' blog also suggests that Hadoop 0.21.0 is
> > the
> > way to go to avoid the  file append issues but it's not production ready
> > yet. Should we stick to 0.20.1? Upgrade to 0.20.2?
> >
> > Any tips here are definitely appreciated. I'll be happy to provide more
> > information as well.
> >
> > -GS
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: A data loss scenario with a single region server going down

Reply via email to