Hi Bryan,

It's a difficult question, because dfs.socket.timeout is used all over
the place in hdfs. I'm currently documenting this.
Especially:
- it's used for connections between datanodes, and not only for
connections between hdfs clients & hdfs datanodes.
- It's also used for the two types of datanodes connection (ports
beeing 50010 & 50020 by default).
- It's used as a connect timeout, but as well as a read timeout
(socket is connected, but the application does not write for a while)
- It's used with various extensions, so when your seeing stuff like
69000 or 66000 it's often the same setting timeout + 3s (hardcoded) *
#replica

For a single datanode issue, with everything going well, it will make
the cluster much more reactive: hbase will go to another node
immediately instead of waiting. But it will make it much more
sensitive to gc and network issues. If you have a major hardware
issue, something like 10% of your cluster going down, this setting
will multiply the number of retries, and will add a lot of workload to
your already damaged cluster, and this could make the things worse.

This said, I think we will need to make it shorter sooner or later, so
if you do it on your cluster, it will be helpful...

N.

On Tue, Jul 17, 2012 at 7:11 PM, Bryan Beaudreault
<[email protected]> wrote:
> Today I needed to restart one of my region servers, and did so without 
> gracefully shutting down the datanode.  For the next 1-2 minutes we had a 
> bunch of failed queries from various other region servers trying to access 
> that datanode.  Looking at the logs, I saw that they were all socket timeouts 
> after 60000 milliseconds.
>
> We use HBase mostly as an online datastore, with various APIs powering 
> various web apps and external consumers.  Writes come from both the API in 
> some cases, but we have continuous hadoop jobs feeding data in as well.
>
> Since we have web app consumers, this 60 second timeout seems unreasonably 
> long.  If a datanode goes down, ideally the impact would be much smaller than 
> that.  I want to lower the dfs.socket.timeout to something like 5-10 seconds, 
> but do not know the implications of this.
>
> In googling I did not find much precedent for this, but I did find some 
> people talking about upping the timeout to much longer than 60 seconds.  Is 
> it generally safe to lower this timeout dramatically if you want faster 
> failures? Are there any downsides to this?
>
> Thanks
>
> --
> Bryan Beaudreault
>

Reply via email to