On Tue, Sep 13, 2011 at 8:20 AM, Geoff Hendrey <[email protected]> wrote:
> ...but we don't have a slow region server.

I'm asking if regionservers are bottlenecking on a single network
resource; a particular datanode, dns?

Things hum along just fine.
> Suddenly, at roughly the same time, all the region servers begin giving
> ScannerTimeoutException and ClosedChannel exception.

So, odd that its running fine then cluster slows.

>From the stacktraces you showed me -- and you might want to check
again and do a few stack traces to see that we are stuck trying to get
data from hdfs -- the regionserver is going slow getting data out of
hdfs.  Whats iowait like at the time of slowness?  Has it changed from
when all was running nicely?


> All the region
> servers are loaded in a pretty identical way by this MR job I am
> running. And they all begin showing the same error, at the same time,
> after performing perfectly for ~40% of the MR job.
>

You talk to hbase in the reducer?   Reducers don't start writing hbase
until job is 66% complete IIRC.    Perhaps its slowing as soon as it
starts writing hbase?  Is that so?

> We have an ops team that monitors all these systems with Nagios. They've
> reviewed dmsg, and many other low level details which are over my head.
> In the past they've adjusted MTU's, and unbounded the network cards (we
> saw some network stack lockups in the past, etc.). I'm going to meet
> with them again, and see if we can put setup some more specific
> monitoring around this job, which we can basically view as a test
> harness.
>

OK.  Hopefully these lads can help


> That said, is there any condition that should cause HBase to get a
> ClosedChannelException, and *not* tell the zookeeper that it is
> effectively dead?


Well, sounds like regionserver is not dead.  Its just crawling so its
still 'alive'.

St.Ack

Reply via email to