Re: HBase stability

Stack Tue, 14 Dec 2010 10:03:48 -0800

On Tue, Dec 14, 2010 at 5:56 AM, 陈加俊 <[email protected]> wrote:
>  Where should i download the branch-0.20-append?  I can't get the compiled
> jar from url as follow：
> http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append .
>


The link points to the svn repository.  The cited doc. says you need
to build it yourself, at least for now (Pardon me, the doc. should
include pointers on how to build and be more explicit about what this
means -- let me try and fix).  See 'Build Requirements' on this page
fo build instructions http://wiki.apache.org/hadoop/HowToRelease.

St.Ack


>
>
> On Tue, Dec 14, 2010 at 1:44 AM, Stack <[email protected]> wrote:
>
>> Some comments inline in the below.
>>
>> On Mon, Dec 13, 2010 at 8:45 AM, baggio liu <[email protected]> wrote:
>> > Hi  Anze,
>> >   Our production cluster used HBase 0.20.6 and hdfs (CDH3b2), and we work
>> > for stability about a month. Some issue we have been met, and may helpful
>> to
>> > you.
>> >
>>
>> Thanks for writing back to the list with your experiences.
>>
>> > HDFS:
>> >    1.  hbase file has short life cycle than map-red, some times there're
>> > many blocks should be delete, we should tuning for the speed of hdfs
>> invalid
>> > block.
>>
>>
>> This can be true.  Yes.  What are you suggesting here?  What should we
>> tune?
>>
>>
>> >    2. hadoop 0.20 branch can not deal with disk failure, HDFS-630 will be
>> > helpful.
>>
>>
>> hdfs-630 has been applied to the branch-0.20-append branch (Its also
>> in CDH IIRC).
>>
>>
>> >    3. region server can not deal IOException rightly. When DFSClient meet
>> > network error, it'll throw IOException, and it may be not fatal for
>> region
>> > server, so these IOException MUST be review.
>>
>>
>> Usually if RegionServer has issues getting to HDFS, it'll shut itself
>> down.  This is 'normal' perhaps overly-defensive behavior.  The story
>> should be better in 0.90 but would be interested in any list you might
>> have where you think we should be able to catch and continue.
>>
>>
>> >    4. In large scale scan, there're many concurrent reader in a short
>> time.
>>
>>
>> Just FYI, HBase opens all files and keeps them open on startup.
>> There'll be pressure on file handles, threads in data nodes, as soon
>> as you start up an HBase instance.  Scans use the already opened files
>> so whether 1 or N ongoing Scans, the pressure on HDFS is the same.
>>
>> > We must make datanode dataxceiver number to a large number, and file
>> handle
>> > limit should be tuning. In addition, the connection reuse between
>> DFSClient
>> > and datanode should be done.
>> >
>>
>> Yes.  This is in our requirements for HBase.  Here is the latest from
>> the 0.90.0RC HBase 'book':
>>
>> http://people.apache.org/~stack/hbase-0.90.0-candidate-1/docs/notsoquick.html#ulimit<http://people.apache.org/%7Estack/hbase-0.90.0-candidate-1/docs/notsoquick.html#ulimit>
>>
>> What do you mean by connection reuse?
>>
>>
>> > HBase
>> >    1. single thread compaction limit the speed of compaction, it should
>> be
>> > made multi-thread.( during multi-thread compaction we should limit
>> network
>> > bandwidth in compaction )
>>
>> True but also in 0.90 compaction algorithm is smarter; there is less to do.
>>
>>
>> >    2. single thread split HLog (read HLog) wile make Hbase down time
>> > longer, make it multi-thread can limit HBase down time.
>>
>>
>> True in 0.20 but in 0.90, splits are much faster; splits come up
>> immediately on the regionserver that hosted the parent that split
>> rather than go back to the master for the master to assign out the new
>> daughter regions.
>>
>> >    3.  Additional, some tools should be done such as meta region checker,
>> > fixer and so on.
>>
>>
>> Yes.  In 0.90, we have hbck tool to run checks and report on
>> inconsistencies.
>>
>>
>> >    4.  zookeeper session timeout should be tuning according to your load
>> on
>> > HBase cluster.
>>
>> Yes.  ZooKeeper ping is the regionservers lifeline to the cluster.  If
>> it goes amiss, then regionserver is considered lost and master will
>> take restorative action.
>>
>>
>> >    5.  gc stratigy should be tuning on your region server/HMaster.
>> >
>>
>>
>> Yes.  Any suggestions from your experience?
>>
>>
>> >    Beside upon,  in production cluster, data loss issue should be fix  as
>> > while.(currently hadoop 0.20 append branch and CDH3b2 hadoop can be
>> used.)
>>
>>
>> Yes.  Here is the 0.90 doc. on hadoop versions:
>>
>> http://people.apache.org/~stack/hbase-0.90.0-candidate-1/docs/notsoquick.html#hadoop<http://people.apache.org/%7Estack/hbase-0.90.0-candidate-1/docs/notsoquick.html#hadoop>
>>
>>
>> >    Because of hdfs make many optimization on throughput, for application
>> > like HBase (many random read/write) . Many tuning and change on hdfs
>> should
>> > be done.
>>
>> Do you have suggestions?  A list?
>>
>> Thanks for writing the list Baggio,
>> St.Ack
>>
>>
>> >    Hope this experience can be helpful to you.
>> >
>> >
>> > Thanks & Best regard
>> > Baggio
>> >
>> >
>> > 2010/12/14 Todd Lipcon <[email protected]>
>> >
>> >> HI Anze,
>> >>
>> >> In word, yes - 0.20.4 is not that stable in my experience, and
>> >> upgrading to the latest CDH3 beta (which includes HBase 0.89.20100924)
>> >> should give you a huge improvement in stability.
>> >>
>> >> You'll still need to do a bit of tuning of settings, but once it's
>> >> well tuned it should be able to hold up under load without crashing.
>> >>
>> >> -Todd
>> >>
>> >> On Mon, Dec 13, 2010 at 2:41 AM, Anze <[email protected]> wrote:
>> >> > Hi all!
>> >> >
>> >> > We have been using HBase 0.20.4 (cdh3b1) in production on 2 nodes for
>> a
>> >> few
>> >> > months now and we are having constant issues with it. We fell over all
>> >> > standard traps (like "Too many open files", network configuration
>> >> > problems,...). All in all, we had about one crash every week or so.
>> >> > Fortunately we are still using it just for background processing so
>> our
>> >> > service didn't suffer directly, but we have lost huge amounts of time
>> >> just
>> >> > fixing the data errors that resulted from data not being written to
>> >> permanent
>> >> > storage. Not to mention fixing the issues.
>> >> > As you can probably understand, we are very frustrated with this and
>> are
>> >> > seriously considering moving to another bigtable.
>> >> >
>> >> > Right now, HBase crashes whenever we run very intensive rebuild of
>> >> secondary
>> >> > index (normal table, but we use it as secondary index) to a huge
>> table. I
>> >> have
>> >> > found this:
>> >> > http://wiki.apache.org/hadoop/Hbase/Troubleshooting
>> >> > (see problem 9)
>> >> > One of the lines read:
>> >> > "Make sure you give plenty of RAM (in hbase-env.sh), the default of
>> 1GB
>> >> won't
>> >> > be able to sustain long running imports."
>> >> >
>> >> > So, if I understand correctly, no matter how HBase is set up, if I run
>> an
>> >> > intensive enough application, it will choke? I would expect it to be
>> >> slower
>> >> > when under (too much) pressure, but not to crash.
>> >> >
>> >> > Of course, we will somehow solve this issue (working on it), but... :(
>> >> >
>> >> > What are your experiences with HBase? Is it stable? Is it just us and
>> the
>> >> way
>> >> > we set it up?
>> >> >
>> >> > Also, would upgrading to 0.89 (cdh3b3) help?
>> >> >
>> >> > Thanks,
>> >> >
>> >> > Anze
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Todd Lipcon
>> >> Software Engineer, Cloudera
>> >>
>> >
>>
>
>
>
> --
> best wishes
> jiajun
>

Re: HBase stability

Reply via email to