Hi Baggio,
Sounds like you have some good experience with HDFS. Some comments inline below:

On Tue, Dec 14, 2010 at 6:47 AM, baggio liu <[email protected]> wrote:
>
> > In fact, we  found the low ivalid speed is because datanode invalid limit
> per heartbeat. Many invaild block stay in namenode, and can not dispatch to
> datanode. We simply increase block number which datanode fetch per
> heartbeat.

See HDFS-611 - this should really help block invalidation rate. It's
included in CDH3.

>
> > >    2. hadoop 0.20 branch can not deal with disk failure, HDFS-630 will be
> > > helpful.
> >
> >
> > hdfs-630 has been applied to the branch-0.20-append branch (Its also
> > in CDH IIRC).
> >
>
> Yes, Hdfs-630 is nessessary, but it's not enough. When disk failure found,
> it'll exclude datanode,
> We can kick  failure disk out simplify and make block report to namenode.
>

See HDFS-457 (in CDH3, configurable by HDFS-1161)
>
> >
> >
> > >    3. region server can not deal IOException rightly. When DFSClient meet
> > > network error, it'll throw IOException, and it may be not fatal for
> > region
> > > server, so these IOException MUST be review.
> >
> >
> > Usually if RegionServer has issues getting to HDFS, it'll shut itself
> > down.  This is 'normal' perhaps overly-defensive behavior.  The story
> > should be better in 0.90 but would be interested in any list you might
> > have where you think we should be able to catch and continue.
> >
> > Yes, absolutly it's  overly-defensive behavior, and if region server fail
> to make hdfs operation, fail-fast may be a well recovery mechanism. But some
> IOException is not fatal, in our branch, we add retry mechanism in common fs
> operation, such as exist().
>

In my experience this hasn't been a problem - most operations that
fail would not have succeeded with a retry. But a patch would be
interesting.

>
> >
> > >    4. In large scale scan, there're many concurrent reader in a short
> > time.
> >
> >
> > Just FYI, HBase opens all files and keeps them open on startup.
> > There'll be pressure on file handles, threads in data nodes, as soon
> > as you start up an HBase instance.  Scans use the already opened files
> > so whether 1 or N ongoing Scans, the pressure on HDFS is the same.
> >
>
>  Sure, it's my mistake. My itention is that whenever system start/scan,
> region server (as DFSClient) will
> create too many connections to datanode. And the number of connection will
> increase by store file number, when store file num reach a large value, the
> number of connection will out of control.
>  In most scence, scan is locality, in our cluster , more than 95%
> connection is not alive. (connection is estabilish, but there's no data is
> being read.), In our branch, we add a time-out  to close idle connection.
> And in long term,  we can re-use connection between DFSClient  and datanode.
> (may be this kind of re-use can be fulfill by RPC framework)
>
I recall you opened a JIRA for this, but I can't locate it. Can you
please send the link?

>
> >
> > What do you mean by connection reuse?
> >

HDFS-941 will help here - hopefully we can do that this year.
>
> >
> -XX:GCTimeRatio=10 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
> -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0
> -XX:+CMSClassUnloadingEnabled -XX:-CMSParallelRemarkEnabled
> -XX:CMSInitiatingOccupancyFraction=70 -XX:SoftRefLRUPolicyMSPerMB=0
> -XX:MaxTenuringThreshold=7
>
> we make some trys in gc tuning. Focus less application stop , we use
> Parallel gc in youny gen, and CMS gc in old gen, the thredshould
> CMSInitiatingOccupancyFraction is the same as our hadoop cluster config, we
> have no idea about why it's 70 , not 71 ...
> May I get gc stratigy in your cluster ?

Very similar to that. I don't usually tune MaxTenuringThreshold,
GCTimeRatio or soft reference LRU. Class unloading isn't particularly
necessary in HBase. The CMS settings look about right - I generally
recommend between 70-80%.

>
>    1. Currently, datanode will send more data than DFSClient request,
> (mostly a whole block), it'll helpful in throughput , but it may cause some
> harm for latency, I just image we can add addtionly rpc read/write interface
> between DFSClient and datanode to reduce overhead in hdfs read/write.

I disagree - the DN does not send more data than requested, from the
HDFS perspective. Perhaps in some cases the HFile reader is requesting
full blocks unnecessarily - I'd like to see the logs that show this.

>
>    2.  in datanode side , meta file and block file will duplicate open and
> close in every block operation. To reduce latency, we can re-use these file
> handle. Even, we can re-design store mechanism in datanode.
>

Jay Booth did some experiments here and saw some improvements of
10-20% - see HDFS-918. Combined with HDFS-941 it might be improved a
bit more. We haven't focused much on incremental performance
improvements this past year - this will probably become higher
priority in 2011 as more people move to larger production clusters.
-Todd
--
Todd Lipcon
Software Engineer, Cloudera

Reply via email to