On Tue, Dec 14, 2010 at 6:47 AM, baggio liu <[email protected]> wrote: >> This can be true. Yes. What are you suggesting here? What should we >> tune? >> >> In fact, we found the low ivalid speed is because datanode invalid limit > per heartbeat. Many invaild block stay in namenode, and can not dispatch to > datanode. We simply increase block number which datanode fetch per > heartbeat. >
Interesting. So you changed this hardcoding? public static final int BLOCK_INVALIDATE_CHUNK = 100; >> hdfs-630 has been applied to the branch-0.20-append branch (Its also >> in CDH IIRC). >> > > Yes, Hdfs-630 is nessessary, but it's not enough. When disk failure found, > it'll exclude datanode, > We can kick failure disk out simplify and make block report to namenode. > Is this a code change you made Baggio? >> Usually if RegionServer has issues getting to HDFS, it'll shut itself >> down. This is 'normal' perhaps overly-defensive behavior. The story >> should be better in 0.90 but would be interested in any list you might >> have where you think we should be able to catch and continue. >> >> Yes, absolutly it's overly-defensive behavior, and if region server fail > to make hdfs operation, fail-fast may be a well recovery mechanism. But some > IOException is not fatal, in our branch, we add retry mechanism in common fs > operation, such as exist(). > > Excellent. Any chance of your contributing back your internal branch fixes? They'd be welcome. > My itention is that whenever system start/scan, > region server (as DFSClient) will > create too many connections to datanode. And the number of connection will > increase by store file number, when store file num reach a large value, the > number of connection will out of control. Yes. > In most scence, scan is locality, in our cluster , more than 95% > connection is not alive. (connection is estabilish, but there's no data is > being read.), In our branch, we add a time-out to close idle connection. > And in long term, we can re-use connection between DFSClient and datanode. > (may be this kind of re-use can be fulfill by RPC framework) > The above sounds great. So, the connection is reestablished automatically by DFSClient when a read comes in (I suppose HADOOP-3831 does this for you)? Is the timeout in DFSClient or in HBase? >> Yes. Any suggestions from your experience? >> >> > -XX:GCTimeRatio=10 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC > -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0 > -XX:+CMSClassUnloadingEnabled -XX:-CMSParallelRemarkEnabled > -XX:CMSInitiatingOccupancyFraction=70 -XX:SoftRefLRUPolicyMSPerMB=0 > -XX:MaxTenuringThreshold=7 > > we make some trys in gc tuning. Focus less application stop , we use > Parallel gc in youny gen, and CMS gc in old gen, the thredshould > CMSInitiatingOccupancyFraction is the same as our hadoop cluster config, we > have no idea about why it's 70 , not 71 ... > May I get gc stratigy in your cluster ? > I just took a look at one of our production servers. Here is our config.: export SERVER_GC_OPTS="-XX:+DoEscapeAnalysis -XX:+AggressiveOpts -XX:+UseConcMarkSweepGC -XX:NewSize=64m -XX:MaxNewSize=64m -XX:CMSInitiatingOccupancyFraction=88 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" This is what we are running: java version "1.6.0_14-ea" Java(TM) SE Runtime Environment (build 1.6.0_14-ea-b04) Java HotSpot(TM) 64-Bit Server VM (build 14.0-b13, mixed mode) (I say what we are running because I believe DoEscapeAnalysis is disabled in later versions of JVM... I think its same for AggressiveOpts). I think NewSize should probably be changed -- the argument for such a small NewSize was that w/o it, the young generation pause times grew to become substantial. Regards CMSInitiatingOccupancyFraction of 88%, I wonder how much of an effect is having? That said, the above seems to be working for us. Regards your settings, you set: -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0 I haven't looked at the source but going by this message, http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2008-October/000226.html, the above just seems to be setting defaults. Is that your understanding? Do you monitor your GC activity? > 1. Currently, datanode will send more data than DFSClient request, > (mostly a whole block), it'll helpful in throughput , but it may cause some > harm for latency, I just image we can add addtionly rpc read/write interface > between DFSClient and datanode to reduce overhead in hdfs read/write. When you say block above, you mean hfile block? Thats what hbase is requesting though? Pardon me if I'm not understanding what you are suggesting. > 2. in datanode side , meta file and block file will duplicate open and > close in every block operation. To reduce latency, we can re-use these file > handle. Even, we can re-design store mechanism in datanode. > > Yes. Hopefully something can be done about this pretty soon. Thanks for the above, St.Ack
