The log snippet just before the regionservers die look like this: 2012-07-19 09:49:18,551 INFO project.coproc.IndexEndpoint: putting new rowkey 2012-07-19 09:49:18,551 INFO project.coproc.IndexEndpoint: new rowkey put 2012-07-19 09:49:18,551 INFO project.coproc.IndexEndpoint: coproc time: 1227 ms 2012-07-19 09:49:18,551 INFO project.coproc.IndexEndpoint: closing scanner 2012-07-19 09:49:18,551 INFO project.coproc.IndexEndpoint: scanner closed <after this log statement in the endpoint code is the return statement>
A coprocessorExec call may be from 3-20 seconds after the previous (it depends how long the last call took). But I know the endpoints are finishing their code fast because throughout the log each "coproc time:" statement is under 5 seconds. I am using CDH4b2, which uses HBase 0.92.1. On Thu, Jul 19, 2012 at 4:35 PM, Ted Yu <[email protected]> wrote: > Kevin: > Can you pastebin the log snippet from region server just before it died ? > > How frequent were your coprocessorExec() calls ? > What HBase version were you using ? > > Thanks > > On Thu, Jul 19, 2012 at 12:44 PM, Kevin <[email protected]> wrote: > > > Hi, > > > > I'm using endpoint coprocessors to do intense scans in parallel on some > > tables. I log the time it takes for each coprocessor to finish its job on > > the region. Each coprocessor rarely takes longer than a few seconds, > > maximum of 5 seconds (there are only 5 regions on the tables right now). > As > > my cluster grows with data the call HTable.coprocessorExec takes longer > and > > longer but the coprocessors themselves finish quickly (under 5 seconds). > > Eventually I see all my regionservers die because the coprocessorExec > call > > timed out and zookeeper kills the connection, which makes the > regionserver > > die. > > > > In terms of code structure, the coprocessorExec call is done inside a > > for-loop. The for-loop iterates over a List of objects to help form > filters > > for the endpoint and then calls the coprocessorExec once per object > > processed. > > > > What would be the bottleneck? Is calling the coprocessor like this in a > > for-loop loading the regions down and not allowing them time to do GC? Is > > there a way to ping a table and judge if it'll be ready for the endpoint > > call? > > > > Thanks, > > -Kevin > > >
