RE: summary of issue/status

Geoff Hendrey Sun, 11 Sep 2011 18:18:20 -0700

Sorry for flood of emails. I found this post which seems to describe a
very similar issue to mine, with ClosedChannelException.


 

http://www.mail-archive.com/[email protected]/msg10609.html

 

I get exactly the same stack trace:

 

2011-09-11 17:30:27,977 WARN  [IPC Server handler 2 on 60020]
ipc.HBaseServer$Handler(1100): IPC Server handler 2 on 60020 caught:
java.nio.channels.ClosedChannelException

        at
sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:144)

        at
sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:342)

        at
org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:13
39)

        at
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseS
erver.java:727)

        at
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.
java:792)

        at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:108
3)

 

He mentions that his MR job does Puts. Ours does scan then put in the
reducer. Little different, but still, symptom is identical. He sees the
same problem while major compact in process, which as I mentioned in
previous emails, I initiated a major_compact from the shell several days
ago, but I still see a lot of compaction activity in the regionserver
logs such as the following (and many of the compactions take several
seconds).:

 

2011-09-11 18:09:23,563 INFO  [regionserver60020.compactor]
regionserver.Store(728): Started compaction of 3 file(s) in cf=V1  into
hdfs://<REDACTED>:54310/hbase/NAM_CLUSTERKEYS3/99d858c926c1e6c05feb638b6
4269602/.tmp, seqid=125878278, totalSize=58.1m

2011-09-11 18:09:24,009 INFO  [regionserver60020.cacheFlusher]
regionserver.Store(494): Renaming flushed file at
hdfs://<REDACTED>:54310/hbase/NAM_CLUSTERKEYS3/adfd4049834dc7492cbb7bb7b
564759f/.tmp/7031901971078401778 to hdfs://
<REDACTED>:54310/hbase/NAM_CLUSTERKEYS3/adfd4049834dc7492cbb7bb7b564759f
/V1/4520835400045954408

2011-09-11 18:09:24,016 INFO  [regionserver60020.cacheFlusher]
regionserver.Store(504): Added hdfs://
<REDACTED>:54310/hbase/NAM_CLUSTERKEYS3/adfd4049834dc7492cbb7bb7b564759f
/V1/4520835400045954408, entries=544, sequenceid=125878282,
memsize=25.0m, filesize=24.9m

2011-09-11 18:09:24,022 INFO  [regionserver60020.compactor]
regionserver.Store(737): Completed compaction of 3 file(s), new
file=hdfs://
<REDACTED>:54310/hbase/NAM_CLUSTERKEYS3/99d858c926c1e6c05feb638b64269602
/V1/6694303434913089134, size=14.5m; total size for store is 218.7m

2011-09-11 18:09:24,022 INFO  [regionserver60020.compactor]
regionserver.HRegion(781): completed compaction on region
<REDACTED>:3,1315072186065.99d858c926c1e6c05feb638b64269602. after 0sec

 

However, he observed OOM in regionserver logs. I grepped my logs and
there is no OOM. I also ran "lsof | wc -l" and response is 14000; we are
nowhere near any limits ("ulimit -n" is 100000), so ruled that out.

 

-geoff

 

From: Geoff Hendrey 
Sent: Sunday, September 11, 2011 5:52 PM
To: '[email protected]'
Cc: James Ladd; Rohit Nigam; Tony Wang; Parmod Mehta
Subject: summary of issue/status

 

OK. Here is the summary of what I know:

 

A region server, after some amount of scanning, can begin to get
ClosedChannelException when it tries to respond to the client.
Unfortunately, this only effects the response to the client. The region
server apparently continues to tell zookeeper and say "I'm alive and
OK". Consequently, the regionserver is never shutdown. This causes the
client to still attempt to access regions on the effectively-dead
server. But each request will eventually time out on the client side,
since all the client sees is "I sent a request, and never receive any
response on the socket.".  However, the client has no capability to
inform the master of the problem. 

 

If I manually shutdown the region server where the problem exists, the
regions get redistributed other region servers automatically, and then
the client will receive new information about the new location of the
regions, on a different region server, and the client can begin
functioning again. However, the problem will soon reappear on a
different region server.

 

 

-geoff

RE: summary of issue/status

Reply via email to