In zookeeper I see that regionserver connections are timing out. I open an HTable, call coprocessorExec, then I close the HTable. This is done in the for-loop. I'm not sure why the regionservers are timing out. I think don't think anymore it's a client-side issue but maybe a server-side issue with the regionservers. I don't know what else to do to debug it further.
2012-07-19 07:01:26,738 [myid:] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for client /10.10.23.124:46863 which had sessionid 0x138763251670a3d 2012-07-19 07:01:56,000 [myid:] - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x138763251670a37, timeout of 40000ms exceeded 2012-07-19 07:01:56,000 [myid:] - INFO [ProcessThread(sid:0 cport:-1)::PrepRequestProcessor@476] - Processed session termination for sessionid: 0x138763251670a37 2012-07-19 07:02:00,000 [myid:] - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x138763251670a3d, timeout of 40000ms exceeded 2012-07-19 07:02:00,000 [myid:] - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x138763251670a39, timeout of 40000ms exceeded 2012-07-19 07:02:00,001 [myid:] - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x138763251670a3f, timeout of 40000ms exceeded 2012-07-19 07:02:00,001 [myid:] - INFO [ProcessThread(sid:0 cport:-1)::PrepRequestProcessor@476] - Processed session termination for sessionid: 0x138763251670a3d 2012-07-19 07:02:00,001 [myid:] - INFO [ProcessThread(sid:0 cport:-1)::PrepRequestProcessor@476] - Processed session termination for sessionid: 0x138763251670a39 2012-07-19 07:02:00,001 [myid:] - INFO [ProcessThread(sid:0 cport:-1)::PrepRequestProcessor@476] - Processed session termination for sessionid: 0x138763251670a3f 2012-07-19 07:02:04,000 [myid:] - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x138763251670a38, timeout of 40000ms exceeded 2012-07-19 07:02:04,000 [myid:] - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x138763251670a36, timeout of 40000ms exceeded 2012-07-19 07:02:04,001 [myid:] - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x138763251670a3c, timeout of 40000ms exceeded 2012-07-19 07:02:04,001 [myid:] - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x138763251670a3e, timeout of 40000ms exceeded 2012-07-19 07:02:04,001 [myid:] - INFO [ProcessThread(sid:0 cport:-1)::PrepRequestProcessor@476] - Processed session termination for sessionid: 0x138763251670a38 2012-07-19 07:02:04,001 [myid:] - INFO [ProcessThread(sid:0 cport:-1)::PrepRequestProcessor@476] - Processed session termination for sessionid: 0x138763251670a36 2012-07-19 07:02:04,002 [myid:] - INFO [ProcessThread(sid:0 cport:-1)::PrepRequestProcessor@476] - Processed session termination for sessionid: 0x138763251670a3c 2012-07-19 07:02:04,002 [myid:] - INFO [ProcessThread(sid:0 cport:-1)::PrepRequestProcessor@476] - Processed session termination for sessionid: 0x138763251670a3e On Thu, Jul 19, 2012 at 4:55 PM, Kevin <[email protected]> wrote: > That's it. That's the end of the regionserver log. In the master's web UI > the regionserver is in the table labeled "Dead Region Servers." > > In the master's log there is: > 2012-07-19 07:02:04,016 INFO > org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer > ephemeral node deleted, processing expiration [slave2,60020,1342694622535] > 2012-07-19 07:02:04,025 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: based on AM, current > region=-ROOT-,,0.70236052 is on server=slave2,60020,1342694622535 server > being checked: slave2,60020,1342694622535 > 2012-07-19 07:02:04,025 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: based on AM, current > region=.META.,,1.1028785192 is on server=slave2,60020,1342694622535 server > being checked: slave2,60020,1342694622535 > 2012-07-19 07:02:04,027 DEBUG > org.apache.hadoop.hbase.master.ServerManager: > Added=slave2,60020,1342694622535 to dead servers, submitted shutdown > handler to be executed, root=true, meta=true > 2012-07-19 07:02:04,027 INFO > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting > logs for slave2,60020,1342694622535 > > The timestamps are different than above but it's from the same application > just at a different time than I sent before. The message would look the > same only with the timestamps being 2012-07-19 09:49ish > > > On Thu, Jul 19, 2012 at 4:50 PM, Ted Yu <[email protected]> wrote: > >> Can you paste more of the region server log after 09:49:18,551 (till the >> region server died) ? >> >> Thanks >> >> On Thu, Jul 19, 2012 at 1:46 PM, Kevin <[email protected]> wrote: >> >> > The log snippet just before the regionservers die look like this: >> > >> > 2012-07-19 09:49:18,551 INFO project.coproc.IndexEndpoint: putting new >> > rowkey >> > 2012-07-19 09:49:18,551 INFO project.coproc.IndexEndpoint: new rowkey >> put >> > 2012-07-19 09:49:18,551 INFO project.coproc.IndexEndpoint: coproc time: >> > 1227 ms >> > 2012-07-19 09:49:18,551 INFO project.coproc.IndexEndpoint: closing >> scanner >> > 2012-07-19 09:49:18,551 INFO project.coproc.IndexEndpoint: scanner >> closed >> > <after this log statement in the endpoint code is the return statement> >> > >> > A coprocessorExec call may be from 3-20 seconds after the previous (it >> > depends how long the last call took). But I know the endpoints are >> > finishing their code fast because throughout the log each "coproc time:" >> > statement is under 5 seconds. >> > >> > I am using CDH4b2, which uses HBase 0.92.1. >> > >> > On Thu, Jul 19, 2012 at 4:35 PM, Ted Yu <[email protected]> wrote: >> > >> > > Kevin: >> > > Can you pastebin the log snippet from region server just before it >> died ? >> > > >> > > How frequent were your coprocessorExec() calls ? >> > > What HBase version were you using ? >> > > >> > > Thanks >> > > >> > > On Thu, Jul 19, 2012 at 12:44 PM, Kevin <[email protected]> >> > wrote: >> > > >> > > > Hi, >> > > > >> > > > I'm using endpoint coprocessors to do intense scans in parallel on >> some >> > > > tables. I log the time it takes for each coprocessor to finish its >> job >> > on >> > > > the region. Each coprocessor rarely takes longer than a few seconds, >> > > > maximum of 5 seconds (there are only 5 regions on the tables right >> > now). >> > > As >> > > > my cluster grows with data the call HTable.coprocessorExec takes >> longer >> > > and >> > > > longer but the coprocessors themselves finish quickly (under 5 >> > seconds). >> > > > Eventually I see all my regionservers die because the >> coprocessorExec >> > > call >> > > > timed out and zookeeper kills the connection, which makes the >> > > regionserver >> > > > die. >> > > > >> > > > In terms of code structure, the coprocessorExec call is done inside >> a >> > > > for-loop. The for-loop iterates over a List of objects to help form >> > > filters >> > > > for the endpoint and then calls the coprocessorExec once per object >> > > > processed. >> > > > >> > > > What would be the bottleneck? Is calling the coprocessor like this >> in a >> > > > for-loop loading the regions down and not allowing them time to do >> GC? >> > Is >> > > > there a way to ping a table and judge if it'll be ready for the >> > endpoint >> > > > call? >> > > > >> > > > Thanks, >> > > > -Kevin >> > > > >> > > >> > >> > >
