Hi Baggio, Sounds like you have some good experience with HDFS. Some comments inline below:
On Tue, Dec 14, 2010 at 6:47 AM, baggio liu <[email protected]> wrote: > > > In fact, we found the low ivalid speed is because datanode invalid limit > per heartbeat. Many invaild block stay in namenode, and can not dispatch to > datanode. We simply increase block number which datanode fetch per > heartbeat. See HDFS-611 - this should really help block invalidation rate. It's included in CDH3. > > > > 2. hadoop 0.20 branch can not deal with disk failure, HDFS-630 will be > > > helpful. > > > > > > hdfs-630 has been applied to the branch-0.20-append branch (Its also > > in CDH IIRC). > > > > Yes, Hdfs-630 is nessessary, but it's not enough. When disk failure found, > it'll exclude datanode, > We can kick failure disk out simplify and make block report to namenode. > See HDFS-457 (in CDH3, configurable by HDFS-1161) > > > > > > > > 3. region server can not deal IOException rightly. When DFSClient meet > > > network error, it'll throw IOException, and it may be not fatal for > > region > > > server, so these IOException MUST be review. > > > > > > Usually if RegionServer has issues getting to HDFS, it'll shut itself > > down. This is 'normal' perhaps overly-defensive behavior. The story > > should be better in 0.90 but would be interested in any list you might > > have where you think we should be able to catch and continue. > > > > Yes, absolutly it's overly-defensive behavior, and if region server fail > to make hdfs operation, fail-fast may be a well recovery mechanism. But some > IOException is not fatal, in our branch, we add retry mechanism in common fs > operation, such as exist(). > In my experience this hasn't been a problem - most operations that fail would not have succeeded with a retry. But a patch would be interesting. > > > > > > 4. In large scale scan, there're many concurrent reader in a short > > time. > > > > > > Just FYI, HBase opens all files and keeps them open on startup. > > There'll be pressure on file handles, threads in data nodes, as soon > > as you start up an HBase instance. Scans use the already opened files > > so whether 1 or N ongoing Scans, the pressure on HDFS is the same. > > > > Sure, it's my mistake. My itention is that whenever system start/scan, > region server (as DFSClient) will > create too many connections to datanode. And the number of connection will > increase by store file number, when store file num reach a large value, the > number of connection will out of control. > In most scence, scan is locality, in our cluster , more than 95% > connection is not alive. (connection is estabilish, but there's no data is > being read.), In our branch, we add a time-out to close idle connection. > And in long term, we can re-use connection between DFSClient and datanode. > (may be this kind of re-use can be fulfill by RPC framework) > I recall you opened a JIRA for this, but I can't locate it. Can you please send the link? > > > > > What do you mean by connection reuse? > > HDFS-941 will help here - hopefully we can do that this year. > > > > -XX:GCTimeRatio=10 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC > -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0 > -XX:+CMSClassUnloadingEnabled -XX:-CMSParallelRemarkEnabled > -XX:CMSInitiatingOccupancyFraction=70 -XX:SoftRefLRUPolicyMSPerMB=0 > -XX:MaxTenuringThreshold=7 > > we make some trys in gc tuning. Focus less application stop , we use > Parallel gc in youny gen, and CMS gc in old gen, the thredshould > CMSInitiatingOccupancyFraction is the same as our hadoop cluster config, we > have no idea about why it's 70 , not 71 ... > May I get gc stratigy in your cluster ? Very similar to that. I don't usually tune MaxTenuringThreshold, GCTimeRatio or soft reference LRU. Class unloading isn't particularly necessary in HBase. The CMS settings look about right - I generally recommend between 70-80%. > > 1. Currently, datanode will send more data than DFSClient request, > (mostly a whole block), it'll helpful in throughput , but it may cause some > harm for latency, I just image we can add addtionly rpc read/write interface > between DFSClient and datanode to reduce overhead in hdfs read/write. I disagree - the DN does not send more data than requested, from the HDFS perspective. Perhaps in some cases the HFile reader is requesting full blocks unnecessarily - I'd like to see the logs that show this. > > 2. in datanode side , meta file and block file will duplicate open and > close in every block operation. To reduce latency, we can re-use these file > handle. Even, we can re-design store mechanism in datanode. > Jay Booth did some experiments here and saw some improvements of 10-20% - see HDFS-918. Combined with HDFS-941 it might be improved a bit more. We haven't focused much on incremental performance improvements this past year - this will probably become higher priority in 2011 as more people move to larger production clusters. -Todd -- Todd Lipcon Software Engineer, Cloudera
