Hi Todd,
Very appreciate to receive your reply :D
1. HDFS-611 can improve block invalidation rate, but it can not solve our
problem. We found the bottleneck is the number invalid block which datanode
fetched in a heartbeat. As stack say, we just make BLOCK_INVALIDATE_CHUNK
configuable, and increase this num will fix this issue. :D
2. HDFS-457 can work when datanode start up ,HDFS-1161 can work when new
block reciver, And when block is writing or reading, disk error can not be
found in these patch.
3.
>> In my experience this hasn't been a problem - most operations that
fail would not have succeeded with a retry. But a patch would be
interesting.
In our cluster, this issue is not rare. Region server may crash
because close HLog failure or in check some file exist(). To prevent
network failure/(or network busy), retry mechanism may be helpful.
4.
>> I recall you opened a JIRA for this, but I can't locate it. Can you
please send the link?
It's just an idea when we discuss with our workmate which has not
filed in jira.
5.
>> HDFS-941 will help here - hopefully we can do that this year.
This patch will be greate helpful in connection manage.
6.
>> Very similar to that. I don't usually tune MaxTenuringThreshold,
GCTimeRatio or soft reference LRU. Class unloading isn't particularly
necessary in HBase. The CMS settings look about right - I generally
recommend between 70-80%.
:D, this most importent tune is ths thredshould of CMS and
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC . In general , the life cycle of
HBase file will be very short than mapreduce , larger MaxTenuringThreshold
will be lazy to move data from young gen to old gen, may be larger
MaxTenuringThreshold will be helpful to HBase.
7.
>> I disagree - the DN does not send more data than requested, from the
HDFS perspective. Perhaps in some cases the HFile reader is requesting
full blocks unnecessarily - I'd like to see the logs that show this.
Currently, from the HDFS perspective, if we read the fist byte of file,
datanode will send nearly whole block (hdfs block), isn't it ?
8.
>> Jay Booth did some experiments here and saw some improvements of
10-20% - see HDFS-918. Combined with HDFS-941 it might be improved a
bit more. We haven't focused much on incremental performance
improvements this past year - this will probably become higher
priority in 2011 as more people move to larger production clusters
May I have your cluster scale and data scale in HBase?
To Stack:
1.
>> Interesting. So you changed this hardcoding?
Yes :D
2.
>> Is this a code change you made Baggio?
Yes, the implement has a little different to CDH3.
3.
>> The above sounds great. So, the connection is reestablished
automatically by DFSClient when a read comes in (I suppose HADOOP-3831
does this for you)? Is the timeout in DFSClient or in HBase?
This timeout is implement in DFSClient. Our purpose is that keep connection
management in HDFS, and make connection unware to HBase. HBase use hdfs in
file level.
4.
>> Do you monitor your GC activity?
Yes, we moniter jvm activity by jstat periodly. And currently, we pay
attention in application stop during full gc. And gc log will be more
helpful to GC activity, but we have not apply it into production cluster.
5.
>> When you say block above, you mean hfile block? Thats what hbase is
requesting though? Pardon me if I'm not understanding what you are
suggesting.
It's hdfs block.
So long mail, we'll recheck our branch and release it if boss can approve
it.
Thanks & Best regards
Baggio
2010/12/15 Todd Lipcon <[email protected]>
> Hi Baggio,
> Sounds like you have some good experience with HDFS. Some comments inline
> below:
>
> On Tue, Dec 14, 2010 at 6:47 AM, baggio liu <[email protected]> wrote:
> >
> > > In fact, we found the low ivalid speed is because datanode invalid
> limit
> > per heartbeat. Many invaild block stay in namenode, and can not dispatch
> to
> > datanode. We simply increase block number which datanode fetch per
> > heartbeat.
>
> See HDFS-611 - this should really help block invalidation rate. It's
> included in CDH3.
>
> >
> > > > 2. hadoop 0.20 branch can not deal with disk failure, HDFS-630
> will be
> > > > helpful.
> > >
> > >
> > > hdfs-630 has been applied to the branch-0.20-append branch (Its also
> > > in CDH IIRC).
> > >
> >
> > Yes, Hdfs-630 is nessessary, but it's not enough. When disk failure
> found,
> > it'll exclude datanode,
> > We can kick failure disk out simplify and make block report to namenode.
> >
>
> See HDFS-457 (in CDH3, configurable by HDFS-1161)
> >
> > >
> > >
> > > > 3. region server can not deal IOException rightly. When DFSClient
> meet
> > > > network error, it'll throw IOException, and it may be not fatal for
> > > region
> > > > server, so these IOException MUST be review.
> > >
> > >
> > > Usually if RegionServer has issues getting to HDFS, it'll shut itself
> > > down. This is 'normal' perhaps overly-defensive behavior. The story
> > > should be better in 0.90 but would be interested in any list you might
> > > have where you think we should be able to catch and continue.
> > >
> > > Yes, absolutly it's overly-defensive behavior, and if region server
> fail
> > to make hdfs operation, fail-fast may be a well recovery mechanism. But
> some
> > IOException is not fatal, in our branch, we add retry mechanism in common
> fs
> > operation, such as exist().
> >
>
> In my experience this hasn't been a problem - most operations that
> fail would not have succeeded with a retry. But a patch would be
> interesting.
>
> >
> > >
> > > > 4. In large scale scan, there're many concurrent reader in a short
> > > time.
> > >
> > >
> > > Just FYI, HBase opens all files and keeps them open on startup.
> > > There'll be pressure on file handles, threads in data nodes, as soon
> > > as you start up an HBase instance. Scans use the already opened files
> > > so whether 1 or N ongoing Scans, the pressure on HDFS is the same.
> > >
> >
> > Sure, it's my mistake. My itention is that whenever system start/scan,
> > region server (as DFSClient) will
> > create too many connections to datanode. And the number of connection
> will
> > increase by store file number, when store file num reach a large value,
> the
> > number of connection will out of control.
> > In most scence, scan is locality, in our cluster , more than 95%
> > connection is not alive. (connection is estabilish, but there's no data
> is
> > being read.), In our branch, we add a time-out to close idle connection.
> > And in long term, we can re-use connection between DFSClient and
> datanode.
> > (may be this kind of re-use can be fulfill by RPC framework)
> >
> I recall you opened a JIRA for this, but I can't locate it. Can you
> please send the link?
>
> >
> > >
> > > What do you mean by connection reuse?
> > >
>
> HDFS-941 will help here - hopefully we can do that this year.
> >
> > >
> > -XX:GCTimeRatio=10 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
> > -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0
> > -XX:+CMSClassUnloadingEnabled -XX:-CMSParallelRemarkEnabled
> > -XX:CMSInitiatingOccupancyFraction=70 -XX:SoftRefLRUPolicyMSPerMB=0
> > -XX:MaxTenuringThreshold=7
> >
> > we make some trys in gc tuning. Focus less application stop , we use
> > Parallel gc in youny gen, and CMS gc in old gen, the thredshould
> > CMSInitiatingOccupancyFraction is the same as our hadoop cluster config,
> we
> > have no idea about why it's 70 , not 71 ...
> > May I get gc stratigy in your cluster ?
>
> Very similar to that. I don't usually tune MaxTenuringThreshold,
> GCTimeRatio or soft reference LRU. Class unloading isn't particularly
> necessary in HBase. The CMS settings look about right - I generally
> recommend between 70-80%.
>
> >
> > 1. Currently, datanode will send more data than DFSClient request,
> > (mostly a whole block), it'll helpful in throughput , but it may cause
> some
> > harm for latency, I just image we can add addtionly rpc read/write
> interface
> > between DFSClient and datanode to reduce overhead in hdfs read/write.
>
> I disagree - the DN does not send more data than requested, from the
> HDFS perspective. Perhaps in some cases the HFile reader is requesting
> full blocks unnecessarily - I'd like to see the logs that show this.
>
> >
> > 2. in datanode side , meta file and block file will duplicate open
> and
> > close in every block operation. To reduce latency, we can re-use these
> file
> > handle. Even, we can re-design store mechanism in datanode.
> >
>
> Jay Booth did some experiments here and saw some improvements of
> 10-20% - see HDFS-918. Combined with HDFS-941 it might be improved a
> bit more. We haven't focused much on incremental performance
> improvements this past year - this will probably become higher
> priority in 2011 as more people move to larger production clusters.
> -Todd
> --
> Todd Lipcon
> Software Engineer, Cloudera
>