I checked the regionservers' .out log and found that they were producing an out of memory exception. I found a couple of fatal memory bugs in my code and now everything seems to be fine. I don't know how I forgot to look at the .out log file.
On Fri, Jul 20, 2012 at 9:35 AM, Ted Yu <[email protected]> wrote: > Can you check the following config param to see if its value is high enough > ? > > <name>hbase.zookeeper.property.maxClientCnxns</name> > > Cheers > > On Fri, Jul 20, 2012 at 6:23 AM, Kevin <[email protected]> wrote: > > > In zookeeper I see that regionserver connections are timing out. I open > an > > HTable, call coprocessorExec, then I close the HTable. This is done in > the > > for-loop. I'm not sure why the regionservers are timing out. I think > don't > > think anymore it's a client-side issue but maybe a server-side issue with > > the regionservers. I don't know what else to do to debug it further. > > > > 2012-07-19 07:01:26,738 [myid:] - INFO [NIOServerCxn.Factory: > > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for > > client /10.10.23.124:46863 which had sessionid 0x138763251670a3d > > 2012-07-19 07:01:56,000 [myid:] - INFO > [SessionTracker:ZooKeeperServer@325 > > ] > > - Expiring session 0x138763251670a37, timeout of 40000ms exceeded > > 2012-07-19 07:01:56,000 [myid:] - INFO [ProcessThread(sid:0 > > cport:-1)::PrepRequestProcessor@476] - Processed session termination for > > sessionid: 0x138763251670a37 > > 2012-07-19 07:02:00,000 [myid:] - INFO > [SessionTracker:ZooKeeperServer@325 > > ] > > - Expiring session 0x138763251670a3d, timeout of 40000ms exceeded > > 2012-07-19 07:02:00,000 [myid:] - INFO > [SessionTracker:ZooKeeperServer@325 > > ] > > - Expiring session 0x138763251670a39, timeout of 40000ms exceeded > > 2012-07-19 07:02:00,001 [myid:] - INFO > [SessionTracker:ZooKeeperServer@325 > > ] > > - Expiring session 0x138763251670a3f, timeout of 40000ms exceeded > > 2012-07-19 07:02:00,001 [myid:] - INFO [ProcessThread(sid:0 > > cport:-1)::PrepRequestProcessor@476] - Processed session termination for > > sessionid: 0x138763251670a3d > > 2012-07-19 07:02:00,001 [myid:] - INFO [ProcessThread(sid:0 > > cport:-1)::PrepRequestProcessor@476] - Processed session termination for > > sessionid: 0x138763251670a39 > > 2012-07-19 07:02:00,001 [myid:] - INFO [ProcessThread(sid:0 > > cport:-1)::PrepRequestProcessor@476] - Processed session termination for > > sessionid: 0x138763251670a3f > > 2012-07-19 07:02:04,000 [myid:] - INFO > [SessionTracker:ZooKeeperServer@325 > > ] > > - Expiring session 0x138763251670a38, timeout of 40000ms exceeded > > 2012-07-19 07:02:04,000 [myid:] - INFO > [SessionTracker:ZooKeeperServer@325 > > ] > > - Expiring session 0x138763251670a36, timeout of 40000ms exceeded > > 2012-07-19 07:02:04,001 [myid:] - INFO > [SessionTracker:ZooKeeperServer@325 > > ] > > - Expiring session 0x138763251670a3c, timeout of 40000ms exceeded > > 2012-07-19 07:02:04,001 [myid:] - INFO > [SessionTracker:ZooKeeperServer@325 > > ] > > - Expiring session 0x138763251670a3e, timeout of 40000ms exceeded > > 2012-07-19 07:02:04,001 [myid:] - INFO [ProcessThread(sid:0 > > cport:-1)::PrepRequestProcessor@476] - Processed session termination for > > sessionid: 0x138763251670a38 > > 2012-07-19 07:02:04,001 [myid:] - INFO [ProcessThread(sid:0 > > cport:-1)::PrepRequestProcessor@476] - Processed session termination for > > sessionid: 0x138763251670a36 > > 2012-07-19 07:02:04,002 [myid:] - INFO [ProcessThread(sid:0 > > cport:-1)::PrepRequestProcessor@476] - Processed session termination for > > sessionid: 0x138763251670a3c > > 2012-07-19 07:02:04,002 [myid:] - INFO [ProcessThread(sid:0 > > cport:-1)::PrepRequestProcessor@476] - Processed session termination for > > sessionid: 0x138763251670a3e > > > > > > On Thu, Jul 19, 2012 at 4:55 PM, Kevin <[email protected]> > wrote: > > > > > That's it. That's the end of the regionserver log. In the master's web > UI > > > the regionserver is in the table labeled "Dead Region Servers." > > > > > > In the master's log there is: > > > 2012-07-19 07:02:04,016 INFO > > > org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer > > > ephemeral node deleted, processing expiration > > [slave2,60020,1342694622535] > > > 2012-07-19 07:02:04,025 DEBUG > > > org.apache.hadoop.hbase.master.AssignmentManager: based on AM, current > > > region=-ROOT-,,0.70236052 is on server=slave2,60020,1342694622535 > server > > > being checked: slave2,60020,1342694622535 > > > 2012-07-19 07:02:04,025 DEBUG > > > org.apache.hadoop.hbase.master.AssignmentManager: based on AM, current > > > region=.META.,,1.1028785192 is on server=slave2,60020,1342694622535 > > server > > > being checked: slave2,60020,1342694622535 > > > 2012-07-19 07:02:04,027 DEBUG > > > org.apache.hadoop.hbase.master.ServerManager: > > > Added=slave2,60020,1342694622535 to dead servers, submitted shutdown > > > handler to be executed, root=true, meta=true > > > 2012-07-19 07:02:04,027 INFO > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting > > > logs for slave2,60020,1342694622535 > > > > > > The timestamps are different than above but it's from the same > > application > > > just at a different time than I sent before. The message would look the > > > same only with the timestamps being 2012-07-19 09:49ish > > > > > > > > > On Thu, Jul 19, 2012 at 4:50 PM, Ted Yu <[email protected]> wrote: > > > > > >> Can you paste more of the region server log after 09:49:18,551 (till > the > > >> region server died) ? > > >> > > >> Thanks > > >> > > >> On Thu, Jul 19, 2012 at 1:46 PM, Kevin <[email protected]> > > wrote: > > >> > > >> > The log snippet just before the regionservers die look like this: > > >> > > > >> > 2012-07-19 09:49:18,551 INFO project.coproc.IndexEndpoint: putting > new > > >> > rowkey > > >> > 2012-07-19 09:49:18,551 INFO project.coproc.IndexEndpoint: new > rowkey > > >> put > > >> > 2012-07-19 09:49:18,551 INFO project.coproc.IndexEndpoint: coproc > > time: > > >> > 1227 ms > > >> > 2012-07-19 09:49:18,551 INFO project.coproc.IndexEndpoint: closing > > >> scanner > > >> > 2012-07-19 09:49:18,551 INFO project.coproc.IndexEndpoint: scanner > > >> closed > > >> > <after this log statement in the endpoint code is the return > > statement> > > >> > > > >> > A coprocessorExec call may be from 3-20 seconds after the previous > (it > > >> > depends how long the last call took). But I know the endpoints are > > >> > finishing their code fast because throughout the log each "coproc > > time:" > > >> > statement is under 5 seconds. > > >> > > > >> > I am using CDH4b2, which uses HBase 0.92.1. > > >> > > > >> > On Thu, Jul 19, 2012 at 4:35 PM, Ted Yu <[email protected]> > wrote: > > >> > > > >> > > Kevin: > > >> > > Can you pastebin the log snippet from region server just before it > > >> died ? > > >> > > > > >> > > How frequent were your coprocessorExec() calls ? > > >> > > What HBase version were you using ? > > >> > > > > >> > > Thanks > > >> > > > > >> > > On Thu, Jul 19, 2012 at 12:44 PM, Kevin < > [email protected]> > > >> > wrote: > > >> > > > > >> > > > Hi, > > >> > > > > > >> > > > I'm using endpoint coprocessors to do intense scans in parallel > on > > >> some > > >> > > > tables. I log the time it takes for each coprocessor to finish > its > > >> job > > >> > on > > >> > > > the region. Each coprocessor rarely takes longer than a few > > seconds, > > >> > > > maximum of 5 seconds (there are only 5 regions on the tables > right > > >> > now). > > >> > > As > > >> > > > my cluster grows with data the call HTable.coprocessorExec takes > > >> longer > > >> > > and > > >> > > > longer but the coprocessors themselves finish quickly (under 5 > > >> > seconds). > > >> > > > Eventually I see all my regionservers die because the > > >> coprocessorExec > > >> > > call > > >> > > > timed out and zookeeper kills the connection, which makes the > > >> > > regionserver > > >> > > > die. > > >> > > > > > >> > > > In terms of code structure, the coprocessorExec call is done > > inside > > >> a > > >> > > > for-loop. The for-loop iterates over a List of objects to help > > form > > >> > > filters > > >> > > > for the endpoint and then calls the coprocessorExec once per > > object > > >> > > > processed. > > >> > > > > > >> > > > What would be the bottleneck? Is calling the coprocessor like > this > > >> in a > > >> > > > for-loop loading the regions down and not allowing them time to > do > > >> GC? > > >> > Is > > >> > > > there a way to ping a table and judge if it'll be ready for the > > >> > endpoint > > >> > > > call? > > >> > > > > > >> > > > Thanks, > > >> > > > -Kevin > > >> > > > > > >> > > > > >> > > > >> > > > > > > > > >
