I checked the regionservers' .out log and found that they were producing an
out of memory exception. I found a couple of fatal memory bugs in my code
and now everything seems to be fine. I don't know how I forgot to look at
the .out log file.

On Fri, Jul 20, 2012 at 9:35 AM, Ted Yu <[email protected]> wrote:

> Can you check the following config param to see if its value is high enough
> ?
>
>     <name>hbase.zookeeper.property.maxClientCnxns</name>
>
> Cheers
>
> On Fri, Jul 20, 2012 at 6:23 AM, Kevin <[email protected]> wrote:
>
> > In zookeeper I see that regionserver connections are timing out. I open
> an
> > HTable, call coprocessorExec, then I close the HTable. This is done in
> the
> > for-loop. I'm not sure why the regionservers are timing out. I think
> don't
> > think anymore it's a client-side issue but maybe a server-side issue with
> > the regionservers. I don't know what else to do to debug it further.
> >
> > 2012-07-19 07:01:26,738 [myid:] - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for
> > client /10.10.23.124:46863 which had sessionid 0x138763251670a3d
> > 2012-07-19 07:01:56,000 [myid:] - INFO
>  [SessionTracker:ZooKeeperServer@325
> > ]
> > - Expiring session 0x138763251670a37, timeout of 40000ms exceeded
> > 2012-07-19 07:01:56,000 [myid:] - INFO  [ProcessThread(sid:0
> > cport:-1)::PrepRequestProcessor@476] - Processed session termination for
> > sessionid: 0x138763251670a37
> > 2012-07-19 07:02:00,000 [myid:] - INFO
>  [SessionTracker:ZooKeeperServer@325
> > ]
> > - Expiring session 0x138763251670a3d, timeout of 40000ms exceeded
> > 2012-07-19 07:02:00,000 [myid:] - INFO
>  [SessionTracker:ZooKeeperServer@325
> > ]
> > - Expiring session 0x138763251670a39, timeout of 40000ms exceeded
> > 2012-07-19 07:02:00,001 [myid:] - INFO
>  [SessionTracker:ZooKeeperServer@325
> > ]
> > - Expiring session 0x138763251670a3f, timeout of 40000ms exceeded
> > 2012-07-19 07:02:00,001 [myid:] - INFO  [ProcessThread(sid:0
> > cport:-1)::PrepRequestProcessor@476] - Processed session termination for
> > sessionid: 0x138763251670a3d
> > 2012-07-19 07:02:00,001 [myid:] - INFO  [ProcessThread(sid:0
> > cport:-1)::PrepRequestProcessor@476] - Processed session termination for
> > sessionid: 0x138763251670a39
> > 2012-07-19 07:02:00,001 [myid:] - INFO  [ProcessThread(sid:0
> > cport:-1)::PrepRequestProcessor@476] - Processed session termination for
> > sessionid: 0x138763251670a3f
> > 2012-07-19 07:02:04,000 [myid:] - INFO
>  [SessionTracker:ZooKeeperServer@325
> > ]
> > - Expiring session 0x138763251670a38, timeout of 40000ms exceeded
> > 2012-07-19 07:02:04,000 [myid:] - INFO
>  [SessionTracker:ZooKeeperServer@325
> > ]
> > - Expiring session 0x138763251670a36, timeout of 40000ms exceeded
> > 2012-07-19 07:02:04,001 [myid:] - INFO
>  [SessionTracker:ZooKeeperServer@325
> > ]
> > - Expiring session 0x138763251670a3c, timeout of 40000ms exceeded
> > 2012-07-19 07:02:04,001 [myid:] - INFO
>  [SessionTracker:ZooKeeperServer@325
> > ]
> > - Expiring session 0x138763251670a3e, timeout of 40000ms exceeded
> > 2012-07-19 07:02:04,001 [myid:] - INFO  [ProcessThread(sid:0
> > cport:-1)::PrepRequestProcessor@476] - Processed session termination for
> > sessionid: 0x138763251670a38
> > 2012-07-19 07:02:04,001 [myid:] - INFO  [ProcessThread(sid:0
> > cport:-1)::PrepRequestProcessor@476] - Processed session termination for
> > sessionid: 0x138763251670a36
> > 2012-07-19 07:02:04,002 [myid:] - INFO  [ProcessThread(sid:0
> > cport:-1)::PrepRequestProcessor@476] - Processed session termination for
> > sessionid: 0x138763251670a3c
> > 2012-07-19 07:02:04,002 [myid:] - INFO  [ProcessThread(sid:0
> > cport:-1)::PrepRequestProcessor@476] - Processed session termination for
> > sessionid: 0x138763251670a3e
> >
> >
> > On Thu, Jul 19, 2012 at 4:55 PM, Kevin <[email protected]>
> wrote:
> >
> > > That's it. That's the end of the regionserver log. In the master's web
> UI
> > > the regionserver is in the table labeled "Dead Region Servers."
> > >
> > > In the master's log there is:
> > > 2012-07-19 07:02:04,016 INFO
> > > org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer
> > > ephemeral node deleted, processing expiration
> > [slave2,60020,1342694622535]
> > > 2012-07-19 07:02:04,025 DEBUG
> > > org.apache.hadoop.hbase.master.AssignmentManager: based on AM, current
> > > region=-ROOT-,,0.70236052 is on server=slave2,60020,1342694622535
> server
> > > being checked: slave2,60020,1342694622535
> > > 2012-07-19 07:02:04,025 DEBUG
> > > org.apache.hadoop.hbase.master.AssignmentManager: based on AM, current
> > > region=.META.,,1.1028785192 is on server=slave2,60020,1342694622535
> > server
> > > being checked: slave2,60020,1342694622535
> > > 2012-07-19 07:02:04,027 DEBUG
> > > org.apache.hadoop.hbase.master.ServerManager:
> > > Added=slave2,60020,1342694622535 to dead servers, submitted shutdown
> > > handler to be executed, root=true, meta=true
> > > 2012-07-19 07:02:04,027 INFO
> > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting
> > > logs for slave2,60020,1342694622535
> > >
> > > The timestamps are different than above but it's from the same
> > application
> > > just at a different time than I sent before. The message would look the
> > > same only with the timestamps being 2012-07-19 09:49ish
> > >
> > >
> > > On Thu, Jul 19, 2012 at 4:50 PM, Ted Yu <[email protected]> wrote:
> > >
> > >> Can you paste more of the region server log after 09:49:18,551 (till
> the
> > >> region server died) ?
> > >>
> > >> Thanks
> > >>
> > >> On Thu, Jul 19, 2012 at 1:46 PM, Kevin <[email protected]>
> > wrote:
> > >>
> > >> > The log snippet just before the regionservers die look like this:
> > >> >
> > >> > 2012-07-19 09:49:18,551 INFO project.coproc.IndexEndpoint: putting
> new
> > >> > rowkey
> > >> > 2012-07-19 09:49:18,551 INFO project.coproc.IndexEndpoint: new
> rowkey
> > >> put
> > >> > 2012-07-19 09:49:18,551 INFO project.coproc.IndexEndpoint: coproc
> > time:
> > >> > 1227 ms
> > >> > 2012-07-19 09:49:18,551 INFO project.coproc.IndexEndpoint: closing
> > >> scanner
> > >> > 2012-07-19 09:49:18,551 INFO project.coproc.IndexEndpoint: scanner
> > >> closed
> > >> > <after this log statement in the endpoint code is the return
> > statement>
> > >> >
> > >> > A coprocessorExec call may be from 3-20 seconds after the previous
> (it
> > >> > depends how long the last call took). But I know the endpoints are
> > >> > finishing their code fast because throughout the log each "coproc
> > time:"
> > >> > statement is under 5 seconds.
> > >> >
> > >> > I am using CDH4b2, which uses HBase 0.92.1.
> > >> >
> > >> > On Thu, Jul 19, 2012 at 4:35 PM, Ted Yu <[email protected]>
> wrote:
> > >> >
> > >> > > Kevin:
> > >> > > Can you pastebin the log snippet from region server just before it
> > >> died ?
> > >> > >
> > >> > > How frequent were your coprocessorExec() calls ?
> > >> > > What HBase version were you using ?
> > >> > >
> > >> > > Thanks
> > >> > >
> > >> > > On Thu, Jul 19, 2012 at 12:44 PM, Kevin <
> [email protected]>
> > >> > wrote:
> > >> > >
> > >> > > > Hi,
> > >> > > >
> > >> > > > I'm using endpoint coprocessors to do intense scans in parallel
> on
> > >> some
> > >> > > > tables. I log the time it takes for each coprocessor to finish
> its
> > >> job
> > >> > on
> > >> > > > the region. Each coprocessor rarely takes longer than a few
> > seconds,
> > >> > > > maximum of 5 seconds (there are only 5 regions on the tables
> right
> > >> > now).
> > >> > > As
> > >> > > > my cluster grows with data the call HTable.coprocessorExec takes
> > >> longer
> > >> > > and
> > >> > > > longer but the coprocessors themselves finish quickly (under 5
> > >> > seconds).
> > >> > > > Eventually I see all my regionservers die because the
> > >> coprocessorExec
> > >> > > call
> > >> > > > timed out and zookeeper kills the connection, which makes the
> > >> > > regionserver
> > >> > > > die.
> > >> > > >
> > >> > > > In terms of code structure, the coprocessorExec call is done
> > inside
> > >> a
> > >> > > > for-loop. The for-loop iterates over a List of objects to help
> > form
> > >> > > filters
> > >> > > > for the endpoint and then calls the coprocessorExec once per
> > object
> > >> > > > processed.
> > >> > > >
> > >> > > > What would be the bottleneck? Is calling the coprocessor like
> this
> > >> in a
> > >> > > > for-loop loading the regions down and not allowing them time to
> do
> > >> GC?
> > >> > Is
> > >> > > > there a way to ping a table and judge if it'll be ready for the
> > >> > endpoint
> > >> > > > call?
> > >> > > >
> > >> > > > Thanks,
> > >> > > > -Kevin
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Reply via email to