Thanks Sergey for your response, it helped a lot indeed. It looks like the problem is located in the RS that keep SYSTEM.CATALOG. In this RS all RPC handlers but one are in WAIT status waiting for a row lock ( see stacktrace below ).
So I think that basically all deletes are being processed in sequence. I also noticed around 400 threads with name htable-poolXXXXX-t1 waiting on condition in the same RS -- all other RS are normal - . This threads dissapear when the delete process finishes. This problem started happening after an HBase cluster reboot, but... why? "B.defaultRpcServer.handler=0,queue=0,port=60020" daemon prio=10 tid=0x00007f142e96c000 nid=0x68bb waiting on condition [0x00007f0ba7289000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00007f102be9b3a0> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1033) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:282) at org.apache.hadoop.hbase.regionserver.HRegion.getRowLockInternal(HRegion.java:4830) at org.apache.hadoop.hbase.regionserver.HRegion.getRowLock(HRegion.java:4800) at org.apache.hadoop.hbase.regionserver.HRegion.getRowLock(HRegion.java:4853) at org.apache.phoenix.coprocessor.MetaDataEndpointImpl.doGetTable(MetaDataEndpointImpl.java:2386) at org.apache.phoenix.coprocessor.MetaDataEndpointImpl.doGetTable(MetaDataEndpointImpl.java:2354) at org.apache.phoenix.coprocessor.MetaDataEndpointImpl.getTable(MetaDataEndpointImpl.java:436) at org.apache.phoenix.coprocessor.generated.MetaDataProtos$MetaDataService.callMethod(MetaDataProtos.java:11609) at org.apache.hadoop.hbase.regionserver.HRegion.execService(HRegion.java:7395) at org.apache.hadoop.hbase.regionserver.RSRpcServices.execServiceOnRegion(RSRpcServices.java:1776) at org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:1758) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32209) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) at java.lang.Thread.run(Thread.java:745) Many thanks!! Pedro On 18 August 2017 at 21:15, Sergey Soldatov <sergeysolda...@gmail.com> wrote: > Hi Pedro, > > Usually that kind of behavior should be reflected in the region server > logs. Try to turn DEBUG level and check what exactly RS is doing during > that time. Also you may check the thread dump of RS during the execution > and see what are rpc handlers are doing. One thing that should be checked > first - the RPC handlers. If they are all busy you may consider to increase > the number of handlers. If you have RPC scheduler and controller > configured, double check that regular handlers are used, but not IndexRPC > (there was a bug that client is sending all rpc with index priority). If > you see it, remove controller factory property on client side. > > Thanks, > Sergey > > On Fri, Aug 18, 2017 at 4:46 AM, Pedro Boado <pedro.bo...@gmail.com> > wrote: > >> Hi all, >> >> We have two HBase 1.0 clusters running the same process in parallel >> -effectively keeps the same data in both Phoenix tables- >> >> This process feeds data into Phoenix 4.5 via HFile and once the data is >> loaded a Spark process deletes a few thousand rows from the tables >> -secondary indexing is disabled in our installation- . >> >> After an HBase restart -no config changes involved-, one of the clusters >> have started running these deletes too slowly (the fast run is taking 5min >> and the slow one around 1h). And more worryingly while the process is >> running Phoenix queries are taking hundreds of seconds instead of being sub >> second (even opening sqlline is very slow). >> >> We've almost run out of ideas trying to find the cause of this behaviour. >> There are no evident GC pauses, CPU usage, Hdfs IO is normal, Memory usage >> is normal, etc. >> >> As soon as the delete process finishes Phoenix goes back to normal >> behaviour. >> >> Does anybody have any ideas for potential causes of this behaviour? >> >> Many thanks!! >> >> Pedro. >> > > -- Un saludo. Pedro Boado.