Thanks for the response, N.  I could be wrong here, but since this problem is 
in the HDFS client code, couldn't I set this dfs.socket.timeout in my 
hbase-site.xml and it would only affect hbase connections to hdfs?  I.e. we 
wouldn't have to worry about affecting connections between datanodes, etc. 

-- 
Bryan Beaudreault


On Wednesday, July 18, 2012 at 4:38 AM, N Keywal wrote:

> Hi Bryan,
> 
> It's a difficult question, because dfs.socket.timeout is used all over
> the place in hdfs. I'm currently documenting this.
> Especially:
> - it's used for connections between datanodes, and not only for
> connections between hdfs clients & hdfs datanodes.
> - It's also used for the two types of datanodes connection (ports
> beeing 50010 & 50020 by default).
> - It's used as a connect timeout, but as well as a read timeout
> (socket is connected, but the application does not write for a while)
> - It's used with various extensions, so when your seeing stuff like
> 69000 or 66000 it's often the same setting timeout + 3s (hardcoded) *
> #replica
> 
> For a single datanode issue, with everything going well, it will make
> the cluster much more reactive: hbase will go to another node
> immediately instead of waiting. But it will make it much more
> sensitive to gc and network issues. If you have a major hardware
> issue, something like 10% of your cluster going down, this setting
> will multiply the number of retries, and will add a lot of workload to
> your already damaged cluster, and this could make the things worse.
> 
> This said, I think we will need to make it shorter sooner or later, so
> if you do it on your cluster, it will be helpful...
> 
> N.
> 
> On Tue, Jul 17, 2012 at 7:11 PM, Bryan Beaudreault
> <[email protected] (mailto:[email protected])> wrote:
> > Today I needed to restart one of my region servers, and did so without 
> > gracefully shutting down the datanode. For the next 1-2 minutes we had a 
> > bunch of failed queries from various other region servers trying to access 
> > that datanode. Looking at the logs, I saw that they were all socket 
> > timeouts after 60000 milliseconds.
> > 
> > We use HBase mostly as an online datastore, with various APIs powering 
> > various web apps and external consumers. Writes come from both the API in 
> > some cases, but we have continuous hadoop jobs feeding data in as well.
> > 
> > Since we have web app consumers, this 60 second timeout seems unreasonably 
> > long. If a datanode goes down, ideally the impact would be much smaller 
> > than that. I want to lower the dfs.socket.timeout to something like 5-10 
> > seconds, but do not know the implications of this.
> > 
> > In googling I did not find much precedent for this, but I did find some 
> > people talking about upping the timeout to much longer than 60 seconds. Is 
> > it generally safe to lower this timeout dramatically if you want faster 
> > failures? Are there any downsides to this?
> > 
> > Thanks
> > 
> > --
> > Bryan Beaudreault
> > 
> 
> 
> 


Reply via email to