" Yeah.  On the slow RS, check who its talking too... take a look at a
few of the nodes referenced."

...but we don't have a slow region server. Things hum along just fine.
Suddenly, at roughly the same time, all the region servers begin giving
ScannerTimeoutException and ClosedChannel exception. All the region
servers are loaded in a pretty identical way by this MR job I am
running. And they all begin showing the same error, at the same time,
after performing perfectly for ~40% of the MR job. 

We have an ops team that monitors all these systems with Nagios. They've
reviewed dmsg, and many other low level details which are over my head.
In the past they've adjusted MTU's, and unbounded the network cards (we
saw some network stack lockups in the past, etc.). I'm going to meet
with them again, and see if we can put setup some more specific
monitoring around this job, which we can basically view as a test
harness.

That said, is there any condition that should cause HBase to get a
ClosedChannelException, and *not* tell the zookeeper that it is
effectively dead? It seems that regardless of the cause of the
ClosedChannelException, the region servers enter a state in which they
are useless, but the master never shuts them down. Consequently, the
client will start to see "NotServingRegionException" (sorry I forgot the
exact exception name).If I restart HBase while the MR job is in
progress, the problem of goes away...until it recurs in exactly the same
way as before. It seems to me like if there was a bad disk or a network
issue, that this issue would be present from the get go, rather than
suddenly happening on all RS, after some long period of loading. Anyway,
I am speculating. I will try to gather more facts.

Thanks for your ongoing help.

-geoff




-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of
Stack
Sent: Monday, September 12, 2011 11:46 PM
To: [email protected]
Cc: Tony Wang; Rohit Nigam; Parmod Mehta; James Ladd
Subject: Re: scanner deadlock?

On Mon, Sep 12, 2011 at 11:27 PM, Geoff Hendrey <[email protected]>
wrote:
> Do you have any advice on what to look for (or how to sort it) when I
do
> lsof or netstat? A glance at it doesn't show any "standouts" but then
> I'm not entirely sure what to look for. I see lots of connections to
> various nodes in the cluster, from any given node, but I suppose
that's
> quite normal.

Yeah.  On the slow RS, check who its talking too... take a look at a
few of the nodes referenced.

Check dmesg across your cluster see if any complaining.

>  Ganglia offers no clues either. It's pretty uniform for
> all graphs across all servers.
>

No anomalies around datanodes?  Spikes or troughs?

Welcome to the joys of distributed computing.  Once you figure whats
going on, you'll be able to enable an alert for the future but
meantime its no fun.

St.Ack

Reply via email to