Yes, that is what I would expect, but the client is stuck in the Object.wait() 
call and doesn't get notified until the RPC timeout passes.

On Dec 18, 2012, at 12:27 PM, "Mesika, Asaf" <[email protected]> wrote:

> One thing I don't get:
> If the RS went down, then the RPC connection should have been reset, thus 
> causing the client to interrupt, right? It shouldn't be a matter of timeout 
> at all.
> 
> On Dec 17, 2012, at 7:18 PM, Bryan Keller wrote:
> 
>> It seems there was a cascading effect. The regionservers were busy with 
>> scanning a table, which resulted in some long GC's. The GC's were long 
>> enough to trigger the Zookeeper timeout on at least one regionserver, which 
>> resulted in the regionserver shutting itself down. This then caused the 
>> Object.wait() call which got stuck, and only exited after the very long RPC 
>> timeout.
>> 
>> I have done a fair amount of work optimizing the GCs, and I increased the 
>> regionserver timeouts, which should help with the regionserver shutdowns. 
>> But if a regionserver does shut down for some other reason, this will still 
>> result in the Object.wait() hang.
>> 
>> One approach might be to have the regionservers send back a keep-alive, or 
>> progress, message during a scan, and that message would reset the RPC timer. 
>> The regionserver could do this every x number of rows processed server-side. 
>> Then the RPC timeout could be something more sensible rather than being set 
>> to the longest time it takes to scan a region.
>> 
>> HBASE-5416 looks useful, it will make scans faster, but the problem I'm 
>> encountering will still be present, but perhaps I could set the RPC timeout 
>> a bit lower. HBASE-6313 might fix the hang, in which case I could live with 
>> the longer RPC timeout setting.
>> 
>> 
>> On Dec 14, 2012, at 9:49 PM, Ted Yu <[email protected]> wrote:
>> 
>>> Bryan:
>>> 
>>> bq. My only thought would be to forego using filters
>>> Please keep using filters.
>>> 
>>> I and Sergey are working on HBASE-5416: Improve performance of scans with
>>> some kind of filters
>>> This feature allows you to specify one column family as being essential.
>>> The other column family is only returned to client when essential column
>>> family matches. I wonder if this may be of help to you.
>>> 
>>> You mentioned regionserver going down or being busy. I assume it was not
>>> often that regionserver(s) went down. For busy region server, did you try
>>> jstack'ing regionserver process ?
>>> 
>>> Thanks
>>> 
>>> On Fri, Dec 14, 2012 at 2:59 PM, Bryan Keller <[email protected]> wrote:
>>> 
>>>> I have encountered a problem with HBaseClient.call() hanging. This occurs
>>>> when one of my regionservers goes down while performing a table scan.
>>>> 
>>>> What exacerbates this problem is that the scan I am performing uses
>>>> filters, and the region size of the table is large (4gb). Because of this,
>>>> it can take several minutes for a row to be returned when calling
>>>> scanner.next(). Apparently there is no keep alive message being sent back
>>>> to the scanner while the region server is busy, so I had to increase the
>>>> hbase.rpc.timeout value to a large number (60 min), otherwise the next()
>>>> call will timeout waiting for the regionserver to send something back.
>>>> 
>>>> The result is that this HBaseClient.call() hang is made much worse,
>>>> because it won't time out for 60 minutes.
>>>> 
>>>> I have a couple of questions:
>>>> 
>>>> 1. Any thoughts on why the HBaseClient.call() is getting stuck? I noticed
>>>> that call.wait() is not using any timeout so it will wait indefinitely
>>>> until interrupted externally
>>>> 
>>>> 2. Is there a solution where I do not need to set hbase.rpc.timeout to a
>>>> very large number? My only thought would be to forego using filters and do
>>>> the filtering client side, which seems pretty inefficient
>>>> 
>>>> Here is a stack dump of the thread that was hung:
>>>> 
>>>> Thread 10609: (state = BLOCKED)
>>>> - java.lang.Object.wait(long) @bci=0 (Interpreted frame)
>>>> - java.lang.Object.wait() @bci=2, line=485 (Interpreted frame)
>>>> -
>>>> org.apache.hadoop.hbase.ipc.HBaseClient.call(org.apache.hadoop.io.Writable,
>>>> java.net.InetSocketAddress, java.lang.Class,
>>>> org.apache.hadoop.hbase.security.User, int) @bci=51, line=904 (Interpreted
>>>> frame)
>>>> -
>>>> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(java.lang.Object,
>>>> java.lang.reflect.Method, java.lang.Object[]) @bci=52, line=150
>>>> (Interpreted frame)
>>>> - $Proxy12.next(long, int) @bci=26 (Interpreted frame)
>>>> - org.apache.hadoop.hbase.client.ScannerCallable.call() @bci=72, line=92
>>>> (Interpreted frame)
>>>> - org.apache.hadoop.hbase.client.ScannerCallable.call() @bci=1, line=42
>>>> (Interpreted frame)
>>>> -
>>>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(org.apache.hadoop.hbase.client.ServerCallable)
>>>> @bci=36, line=1325 (Interpreted frame)
>>>> - org.apache.hadoop.hbase.client.HTable$ClientScanner.next() @bci=117,
>>>> line=1299 (Compiled frame)
>>>> - org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue()
>>>> @bci=41, line=150 (Interpreted frame)
>>>> - org.apache.hadoop.hbase.mapreduce.TableRecordReader.nextKeyValue()
>>>> @bci=4, line=142 (Interpreted frame)
>>>> - org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue()
>>>> @bci=4, line=458 (Interpreted frame)
>>>> - org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue() @bci=4,
>>>> line=76 (Interpreted frame)
>>>> -
>>>> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue()
>>>> @bci=4, line=85 (Interpreted frame)
>>>> -
>>>> org.apache.hadoop.mapreduce.Mapper.run(org.apache.hadoop.mapreduce.Mapper$Context)
>>>> @bci=6, line=139 (Interpreted frame)
>>>> -
>>>> org.apache.hadoop.mapred.MapTask.runNewMapper(org.apache.hadoop.mapred.JobConf,
>>>> org.apache.hadoop.mapreduce.split.JobSplit$TaskSplitIndex,
>>>> org.apache.hadoop.mapred.TaskUmbilicalProtocol,
>>>> org.apache.hadoop.mapred.Task$TaskReporter) @bci=201, line=645 (Interpreted
>>>> frame)
>>>> - org.apache.hadoop.mapred.MapTask.run(org.apache.hadoop.mapred.JobConf,
>>>> org.apache.hadoop.mapred.TaskUmbilicalProtocol) @bci=100, line=325
>>>> (Interpreted frame)
>>>> - org.apache.hadoop.mapred.Child$4.run() @bci=29, line=268 (Interpreted
>>>> frame)
>>>> -
>>>> java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction,
>>>> java.security.AccessControlContext) @bci=0 (Interpreted frame)
>>>> - javax.security.auth.Subject.doAs(javax.security.auth.Subject,
>>>> java.security.PrivilegedExceptionAction) @bci=42, line=396 (Interpreted
>>>> frame)
>>>> -
>>>> org.apache.hadoop.security.UserGroupInformation.doAs(java.security.PrivilegedExceptionAction)
>>>> @bci=14, line=1332 (Interpreted frame)
>>>> - org.apache.hadoop.mapred.Child.main(java.lang.String[]) @bci=776,
>>>> line=262 (Interpreted frame)
>>>> 
>>>> 
>> 
> 

Reply via email to