On Tue, Sep 13, 2011 at 8:20 AM, Geoff Hendrey <[email protected]> wrote: > ...but we don't have a slow region server.
I'm asking if regionservers are bottlenecking on a single network resource; a particular datanode, dns? Things hum along just fine. > Suddenly, at roughly the same time, all the region servers begin giving > ScannerTimeoutException and ClosedChannel exception. So, odd that its running fine then cluster slows. >From the stacktraces you showed me -- and you might want to check again and do a few stack traces to see that we are stuck trying to get data from hdfs -- the regionserver is going slow getting data out of hdfs. Whats iowait like at the time of slowness? Has it changed from when all was running nicely? > All the region > servers are loaded in a pretty identical way by this MR job I am > running. And they all begin showing the same error, at the same time, > after performing perfectly for ~40% of the MR job. > You talk to hbase in the reducer? Reducers don't start writing hbase until job is 66% complete IIRC. Perhaps its slowing as soon as it starts writing hbase? Is that so? > We have an ops team that monitors all these systems with Nagios. They've > reviewed dmsg, and many other low level details which are over my head. > In the past they've adjusted MTU's, and unbounded the network cards (we > saw some network stack lockups in the past, etc.). I'm going to meet > with them again, and see if we can put setup some more specific > monitoring around this job, which we can basically view as a test > harness. > OK. Hopefully these lads can help > That said, is there any condition that should cause HBase to get a > ClosedChannelException, and *not* tell the zookeeper that it is > effectively dead? Well, sounds like regionserver is not dead. Its just crawling so its still 'alive'. St.Ack
