Hi Bryan, It's a difficult question, because dfs.socket.timeout is used all over the place in hdfs. I'm currently documenting this. Especially: - it's used for connections between datanodes, and not only for connections between hdfs clients & hdfs datanodes. - It's also used for the two types of datanodes connection (ports beeing 50010 & 50020 by default). - It's used as a connect timeout, but as well as a read timeout (socket is connected, but the application does not write for a while) - It's used with various extensions, so when your seeing stuff like 69000 or 66000 it's often the same setting timeout + 3s (hardcoded) * #replica
For a single datanode issue, with everything going well, it will make the cluster much more reactive: hbase will go to another node immediately instead of waiting. But it will make it much more sensitive to gc and network issues. If you have a major hardware issue, something like 10% of your cluster going down, this setting will multiply the number of retries, and will add a lot of workload to your already damaged cluster, and this could make the things worse. This said, I think we will need to make it shorter sooner or later, so if you do it on your cluster, it will be helpful... N. On Tue, Jul 17, 2012 at 7:11 PM, Bryan Beaudreault <[email protected]> wrote: > Today I needed to restart one of my region servers, and did so without > gracefully shutting down the datanode. For the next 1-2 minutes we had a > bunch of failed queries from various other region servers trying to access > that datanode. Looking at the logs, I saw that they were all socket timeouts > after 60000 milliseconds. > > We use HBase mostly as an online datastore, with various APIs powering > various web apps and external consumers. Writes come from both the API in > some cases, but we have continuous hadoop jobs feeding data in as well. > > Since we have web app consumers, this 60 second timeout seems unreasonably > long. If a datanode goes down, ideally the impact would be much smaller than > that. I want to lower the dfs.socket.timeout to something like 5-10 seconds, > but do not know the implications of this. > > In googling I did not find much precedent for this, but I did find some > people talking about upping the timeout to much longer than 60 seconds. Is > it generally safe to lower this timeout dramatically if you want faster > failures? Are there any downsides to this? > > Thanks > > -- > Bryan Beaudreault >
