" Yeah. On the slow RS, check who its talking too... take a look at a few of the nodes referenced."
...but we don't have a slow region server. Things hum along just fine. Suddenly, at roughly the same time, all the region servers begin giving ScannerTimeoutException and ClosedChannel exception. All the region servers are loaded in a pretty identical way by this MR job I am running. And they all begin showing the same error, at the same time, after performing perfectly for ~40% of the MR job. We have an ops team that monitors all these systems with Nagios. They've reviewed dmsg, and many other low level details which are over my head. In the past they've adjusted MTU's, and unbounded the network cards (we saw some network stack lockups in the past, etc.). I'm going to meet with them again, and see if we can put setup some more specific monitoring around this job, which we can basically view as a test harness. That said, is there any condition that should cause HBase to get a ClosedChannelException, and *not* tell the zookeeper that it is effectively dead? It seems that regardless of the cause of the ClosedChannelException, the region servers enter a state in which they are useless, but the master never shuts them down. Consequently, the client will start to see "NotServingRegionException" (sorry I forgot the exact exception name).If I restart HBase while the MR job is in progress, the problem of goes away...until it recurs in exactly the same way as before. It seems to me like if there was a bad disk or a network issue, that this issue would be present from the get go, rather than suddenly happening on all RS, after some long period of loading. Anyway, I am speculating. I will try to gather more facts. Thanks for your ongoing help. -geoff -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Stack Sent: Monday, September 12, 2011 11:46 PM To: [email protected] Cc: Tony Wang; Rohit Nigam; Parmod Mehta; James Ladd Subject: Re: scanner deadlock? On Mon, Sep 12, 2011 at 11:27 PM, Geoff Hendrey <[email protected]> wrote: > Do you have any advice on what to look for (or how to sort it) when I do > lsof or netstat? A glance at it doesn't show any "standouts" but then > I'm not entirely sure what to look for. I see lots of connections to > various nodes in the cluster, from any given node, but I suppose that's > quite normal. Yeah. On the slow RS, check who its talking too... take a look at a few of the nodes referenced. Check dmesg across your cluster see if any complaining. > Ganglia offers no clues either. It's pretty uniform for > all graphs across all servers. > No anomalies around datanodes? Spikes or troughs? Welcome to the joys of distributed computing. Once you figure whats going on, you'll be able to enable an alert for the future but meantime its no fun. St.Ack
