Thanks for the response, N. I could be wrong here, but since this problem is in the HDFS client code, couldn't I set this dfs.socket.timeout in my hbase-site.xml and it would only affect hbase connections to hdfs? I.e. we wouldn't have to worry about affecting connections between datanodes, etc.
-- Bryan Beaudreault On Wednesday, July 18, 2012 at 4:38 AM, N Keywal wrote: > Hi Bryan, > > It's a difficult question, because dfs.socket.timeout is used all over > the place in hdfs. I'm currently documenting this. > Especially: > - it's used for connections between datanodes, and not only for > connections between hdfs clients & hdfs datanodes. > - It's also used for the two types of datanodes connection (ports > beeing 50010 & 50020 by default). > - It's used as a connect timeout, but as well as a read timeout > (socket is connected, but the application does not write for a while) > - It's used with various extensions, so when your seeing stuff like > 69000 or 66000 it's often the same setting timeout + 3s (hardcoded) * > #replica > > For a single datanode issue, with everything going well, it will make > the cluster much more reactive: hbase will go to another node > immediately instead of waiting. But it will make it much more > sensitive to gc and network issues. If you have a major hardware > issue, something like 10% of your cluster going down, this setting > will multiply the number of retries, and will add a lot of workload to > your already damaged cluster, and this could make the things worse. > > This said, I think we will need to make it shorter sooner or later, so > if you do it on your cluster, it will be helpful... > > N. > > On Tue, Jul 17, 2012 at 7:11 PM, Bryan Beaudreault > <[email protected] (mailto:[email protected])> wrote: > > Today I needed to restart one of my region servers, and did so without > > gracefully shutting down the datanode. For the next 1-2 minutes we had a > > bunch of failed queries from various other region servers trying to access > > that datanode. Looking at the logs, I saw that they were all socket > > timeouts after 60000 milliseconds. > > > > We use HBase mostly as an online datastore, with various APIs powering > > various web apps and external consumers. Writes come from both the API in > > some cases, but we have continuous hadoop jobs feeding data in as well. > > > > Since we have web app consumers, this 60 second timeout seems unreasonably > > long. If a datanode goes down, ideally the impact would be much smaller > > than that. I want to lower the dfs.socket.timeout to something like 5-10 > > seconds, but do not know the implications of this. > > > > In googling I did not find much precedent for this, but I did find some > > people talking about upping the timeout to much longer than 60 seconds. Is > > it generally safe to lower this timeout dramatically if you want faster > > failures? Are there any downsides to this? > > > > Thanks > > > > -- > > Bryan Beaudreault > > > > >
