Please see comment inline. :D 2010/12/14 Stack <[email protected]>
> Some comments inline in the below. > > On Mon, Dec 13, 2010 at 8:45 AM, baggio liu <[email protected]> wrote: > > Hi Anze, > > Our production cluster used HBase 0.20.6 and hdfs (CDH3b2), and we work > > for stability about a month. Some issue we have been met, and may helpful > to > > you. > > > > Thanks for writing back to the list with your experiences. > > > HDFS: > > 1. hbase file has short life cycle than map-red, some times there're > > many blocks should be delete, we should tuning for the speed of hdfs > invalid > > block. > > > This can be true. Yes. What are you suggesting here? What should we > tune? > > In fact, we found the low ivalid speed is because datanode invalid limit per heartbeat. Many invaild block stay in namenode, and can not dispatch to datanode. We simply increase block number which datanode fetch per heartbeat. > > 2. hadoop 0.20 branch can not deal with disk failure, HDFS-630 will be > > helpful. > > > hdfs-630 has been applied to the branch-0.20-append branch (Its also > in CDH IIRC). > Yes, Hdfs-630 is nessessary, but it's not enough. When disk failure found, it'll exclude datanode, We can kick failure disk out simplify and make block report to namenode. > > > > 3. region server can not deal IOException rightly. When DFSClient meet > > network error, it'll throw IOException, and it may be not fatal for > region > > server, so these IOException MUST be review. > > > Usually if RegionServer has issues getting to HDFS, it'll shut itself > down. This is 'normal' perhaps overly-defensive behavior. The story > should be better in 0.90 but would be interested in any list you might > have where you think we should be able to catch and continue. > > Yes, absolutly it's overly-defensive behavior, and if region server fail to make hdfs operation, fail-fast may be a well recovery mechanism. But some IOException is not fatal, in our branch, we add retry mechanism in common fs operation, such as exist(). > > > 4. In large scale scan, there're many concurrent reader in a short > time. > > > Just FYI, HBase opens all files and keeps them open on startup. > There'll be pressure on file handles, threads in data nodes, as soon > as you start up an HBase instance. Scans use the already opened files > so whether 1 or N ongoing Scans, the pressure on HDFS is the same. > Sure, it's my mistake. My itention is that whenever system start/scan, region server (as DFSClient) will create too many connections to datanode. And the number of connection will increase by store file number, when store file num reach a large value, the number of connection will out of control. In most scence, scan is locality, in our cluster , more than 95% connection is not alive. (connection is estabilish, but there's no data is being read.), In our branch, we add a time-out to close idle connection. And in long term, we can re-use connection between DFSClient and datanode. (may be this kind of re-use can be fulfill by RPC framework) > > > We must make datanode dataxceiver number to a large number, and file > handle > > limit should be tuning. In addition, the connection reuse between > DFSClient > > and datanode should be done. > > > > Yes. This is in our requirements for HBase. Here is the latest from > the 0.90.0RC HBase 'book': > > http://people.apache.org/~stack/hbase-0.90.0-candidate-1/docs/notsoquick.html#ulimit<http://people.apache.org/%7Estack/hbase-0.90.0-candidate-1/docs/notsoquick.html#ulimit> > > What do you mean by connection reuse? > > > > HBase > > 1. single thread compaction limit the speed of compaction, it should > be > > made multi-thread.( during multi-thread compaction we should limit > network > > bandwidth in compaction ) > > True but also in 0.90 compaction algorithm is smarter; there is less to do. > > > > > 2. single thread split HLog (read HLog) wile make Hbase down time > > longer, make it multi-thread can limit HBase down time. > > > True in 0.20 but in 0.90, splits are much faster; splits come up > immediately on the regionserver that hosted the parent that split > rather than go back to the master for the master to assign out the new > daughter regions. > > > 3. Additional, some tools should be done such as meta region checker, > > fixer and so on. > > > Yes. In 0.90, we have hbck tool to run checks and report on > inconsistencies. > > Our many fixs focous on stability base on 0.20.6 has equivalent in 0.90, 0.90 is a great millstone, we'll look forward it . > > 4. zookeeper session timeout should be tuning according to your load > on > > HBase cluster. > > Yes. ZooKeeper ping is the regionservers lifeline to the cluster. If > it goes amiss, then regionserver is considered lost and master will > take restorative action. > > > > 5. gc stratigy should be tuning on your region server/HMaster. > > > > > Yes. Any suggestions from your experience? > > -XX:GCTimeRatio=10 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0 -XX:+CMSClassUnloadingEnabled -XX:-CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=70 -XX:SoftRefLRUPolicyMSPerMB=0 -XX:MaxTenuringThreshold=7 we make some trys in gc tuning. Focus less application stop , we use Parallel gc in youny gen, and CMS gc in old gen, the thredshould CMSInitiatingOccupancyFraction is the same as our hadoop cluster config, we have no idea about why it's 70 , not 71 ... May I get gc stratigy in your cluster ? > > > Beside upon, in production cluster, data loss issue should be fix as > > while.(currently hadoop 0.20 append branch and CDH3b2 hadoop can be > used.) > > > Yes. Here is the 0.90 doc. on hadoop versions: > > http://people.apache.org/~stack/hbase-0.90.0-candidate-1/docs/notsoquick.html#hadoop<http://people.apache.org/%7Estack/hbase-0.90.0-candidate-1/docs/notsoquick.html#hadoop> > > > > Because of hdfs make many optimization on throughput, for application > > like HBase (many random read/write) . Many tuning and change on hdfs > should > > be done. > > Do you have suggestions? A list? > 1. Currently, datanode will send more data than DFSClient request, (mostly a whole block), it'll helpful in throughput , but it may cause some harm for latency, I just image we can add addtionly rpc read/write interface between DFSClient and datanode to reduce overhead in hdfs read/write. 2. in datanode side , meta file and block file will duplicate open and close in every block operation. To reduce latency, we can re-use these file handle. Even, we can re-design store mechanism in datanode. > Thanks for writing the list Baggio, > St.Ack > > > > Hope this experience can be helpful to you. > > > > > > Thanks & Best regard > > Baggio > > > > > > 2010/12/14 Todd Lipcon <[email protected]> > > > >> HI Anze, > >> > >> In word, yes - 0.20.4 is not that stable in my experience, and > >> upgrading to the latest CDH3 beta (which includes HBase 0.89.20100924) > >> should give you a huge improvement in stability. > >> > >> You'll still need to do a bit of tuning of settings, but once it's > >> well tuned it should be able to hold up under load without crashing. > >> > >> -Todd > >> > >> On Mon, Dec 13, 2010 at 2:41 AM, Anze <[email protected]> wrote: > >> > Hi all! > >> > > >> > We have been using HBase 0.20.4 (cdh3b1) in production on 2 nodes for > a > >> few > >> > months now and we are having constant issues with it. We fell over all > >> > standard traps (like "Too many open files", network configuration > >> > problems,...). All in all, we had about one crash every week or so. > >> > Fortunately we are still using it just for background processing so > our > >> > service didn't suffer directly, but we have lost huge amounts of time > >> just > >> > fixing the data errors that resulted from data not being written to > >> permanent > >> > storage. Not to mention fixing the issues. > >> > As you can probably understand, we are very frustrated with this and > are > >> > seriously considering moving to another bigtable. > >> > > >> > Right now, HBase crashes whenever we run very intensive rebuild of > >> secondary > >> > index (normal table, but we use it as secondary index) to a huge > table. I > >> have > >> > found this: > >> > http://wiki.apache.org/hadoop/Hbase/Troubleshooting > >> > (see problem 9) > >> > One of the lines read: > >> > "Make sure you give plenty of RAM (in hbase-env.sh), the default of > 1GB > >> won't > >> > be able to sustain long running imports." > >> > > >> > So, if I understand correctly, no matter how HBase is set up, if I run > an > >> > intensive enough application, it will choke? I would expect it to be > >> slower > >> > when under (too much) pressure, but not to crash. > >> > > >> > Of course, we will somehow solve this issue (working on it), but... :( > >> > > >> > What are your experiences with HBase? Is it stable? Is it just us and > the > >> way > >> > we set it up? > >> > > >> > Also, would upgrading to 0.89 (cdh3b3) help? > >> > > >> > Thanks, > >> > > >> > Anze > >> > > >> > > >> > >> > >> > >> -- > >> Todd Lipcon > >> Software Engineer, Cloudera > >> > > >
