Yeah like Stack said, the ClosedChannelException is how we figure the client is gone. As you have a 60s timeout on the RPC call the client _will_ go away (and possibly come right back in through another handler) when a call takes longer than that. One of my theories was that in your case if a region server slowed down it would start piling up calls in the queues, some of them would timeout, so the client would come right back with the same request, making the whole situation worse.
When you did set the config that I told you about, you still got the CCEs? Meaning that the calls were still slow? If so, then the issue is elsewhere and this proves it. Like Stack, I think we'll have to be fed more data about your system :) J-D On Wed, Sep 14, 2011 at 9:32 AM, Geoff Hendrey <[email protected]> wrote: > I've already been able to replicate the problem using just two reducers, > on a completely fresh table. So it seemed to me when I did that the > problem was independent of the number of reducers... > > -----Original Message----- > From: [email protected] [mailto:[email protected]] On Behalf Of > Stack > Sent: Wednesday, September 14, 2011 8:47 AM > To: [email protected] > Subject: Re: scanner deadlock? > > On Wed, Sep 14, 2011 at 8:42 AM, Geoff Hendrey <[email protected]> > wrote: >> 17 MR nodes, 8 reducers per machine = 138 concurrent reducers. >> (machines are 12-core, and I've found 8 reducers with 1GB allocated > heap to be a happy medium that doesn't freeze out the data nodes or the > region servers - or so I think :-). >> > > Are you swapping at all? > > What if you restored your config. to something sane -- 100 handlers > with queue size of 10, default timeout -- with 1/4 of the reducers? > > What does this MR job do? > > St.Ack >
