Sorry for flood of emails. I found this post which seems to describe a very similar issue to mine, with ClosedChannelException.
http://www.mail-archive.com/[email protected]/msg10609.html I get exactly the same stack trace: 2011-09-11 17:30:27,977 WARN [IPC Server handler 2 on 60020] ipc.HBaseServer$Handler(1100): IPC Server handler 2 on 60020 caught: java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:144) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:342) at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:13 39) at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseS erver.java:727) at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer. java:792) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:108 3) He mentions that his MR job does Puts. Ours does scan then put in the reducer. Little different, but still, symptom is identical. He sees the same problem while major compact in process, which as I mentioned in previous emails, I initiated a major_compact from the shell several days ago, but I still see a lot of compaction activity in the regionserver logs such as the following (and many of the compactions take several seconds).: 2011-09-11 18:09:23,563 INFO [regionserver60020.compactor] regionserver.Store(728): Started compaction of 3 file(s) in cf=V1 into hdfs://<REDACTED>:54310/hbase/NAM_CLUSTERKEYS3/99d858c926c1e6c05feb638b6 4269602/.tmp, seqid=125878278, totalSize=58.1m 2011-09-11 18:09:24,009 INFO [regionserver60020.cacheFlusher] regionserver.Store(494): Renaming flushed file at hdfs://<REDACTED>:54310/hbase/NAM_CLUSTERKEYS3/adfd4049834dc7492cbb7bb7b 564759f/.tmp/7031901971078401778 to hdfs:// <REDACTED>:54310/hbase/NAM_CLUSTERKEYS3/adfd4049834dc7492cbb7bb7b564759f /V1/4520835400045954408 2011-09-11 18:09:24,016 INFO [regionserver60020.cacheFlusher] regionserver.Store(504): Added hdfs:// <REDACTED>:54310/hbase/NAM_CLUSTERKEYS3/adfd4049834dc7492cbb7bb7b564759f /V1/4520835400045954408, entries=544, sequenceid=125878282, memsize=25.0m, filesize=24.9m 2011-09-11 18:09:24,022 INFO [regionserver60020.compactor] regionserver.Store(737): Completed compaction of 3 file(s), new file=hdfs:// <REDACTED>:54310/hbase/NAM_CLUSTERKEYS3/99d858c926c1e6c05feb638b64269602 /V1/6694303434913089134, size=14.5m; total size for store is 218.7m 2011-09-11 18:09:24,022 INFO [regionserver60020.compactor] regionserver.HRegion(781): completed compaction on region <REDACTED>:3,1315072186065.99d858c926c1e6c05feb638b64269602. after 0sec However, he observed OOM in regionserver logs. I grepped my logs and there is no OOM. I also ran "lsof | wc -l" and response is 14000; we are nowhere near any limits ("ulimit -n" is 100000), so ruled that out. -geoff From: Geoff Hendrey Sent: Sunday, September 11, 2011 5:52 PM To: '[email protected]' Cc: James Ladd; Rohit Nigam; Tony Wang; Parmod Mehta Subject: summary of issue/status OK. Here is the summary of what I know: A region server, after some amount of scanning, can begin to get ClosedChannelException when it tries to respond to the client. Unfortunately, this only effects the response to the client. The region server apparently continues to tell zookeeper and say "I'm alive and OK". Consequently, the regionserver is never shutdown. This causes the client to still attempt to access regions on the effectively-dead server. But each request will eventually time out on the client side, since all the client sees is "I sent a request, and never receive any response on the socket.". However, the client has no capability to inform the master of the problem. If I manually shutdown the region server where the problem exists, the regions get redistributed other region servers automatically, and then the client will receive new information about the new location of the regions, on a different region server, and the client can begin functioning again. However, the problem will soon reappear on a different region server. -geoff
