Hello,

We recently upgrade our staging cluster running on S3 backed EMR from HBase 
1.3.0 to HBase 1.4.9. After doing so and running a test load, we noticed a 
sudden spike in CPU utilization on one of the nodes in the cluster (jumping 
from 20% to 60% all at once). After looking at CPU time of the threads and 
taking a thread dump of the application, we noticed that there was exactly two 
threads each with enormous CPU usage running since the start of the spike in 
CPU. They both appear to be stuck on the CellComparator compare method. We 
turned off all request to the cluster and the CPU usage and those threads 
remain in the same state. I’ve pasted some data and stack traces below and also 
attached full thread dumps. We haven’t had this issue when running Hbase 1.3.

What top shows:
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
7198 hbase     20   0 7944m 2.6g  40m S 213.0 17.9 135:39.08 java
4755 root      20   0 4329m 262m  23m S  0.7  1.7   0:30.52 java
6630 yarn      20   0 4014m 430m  36m S  0.3  2.9   0:48.25 java
    1 root      20   0 19680 2708 2324 S  0.0  0.0   0:01.37 init
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd
    4 root       0 -20     0    0    0 I  0.0  0.0   0:00.00 kworker/0:0H
    6 root       0 -20     0    0    0 I  0.0  0.0   0:00.00 mm_percpu_wq
    7 root      20   0     0    0    0 S  0.0  0.0   0:01.47 ksoftirqd/0
    8 root      20   0     0    0    0 I  0.0  0.0   0:02.74 rcu_sched
    9 root      20   0     0    0    0 I  0.0  0.0   0:00.00 rcu_bh
   10 root      RT   0     0    0    0 S  0.0  0.0   0:00.04 migration/0
   11 root      RT   0     0    0    0 S  0.0  0.0   0:00.01 watchdog/0
   12 root      20   0     0    0    0 S  0.0  0.0   0:00.00 cpuhp/0
   13 root      20   0     0    0    0 S  0.0  0.0   0:00.00 cpuhp/1
   14 root      RT   0     0    0    0 S  0.0  0.0   0:00.01 watchdog/1
   15 root      RT   0     0    0    0 S  0.0  0.0   0:00.03 migration/1
   16 root      20   0     0    0    0 S  0.0  0.0   0:00.03 ksoftirqd/1
   18 root       0 -20     0    0    0 I  0.0  0.0   0:00.00 kworker/1:0H

A breakdown of the threads

sudo top -n 1 -H -p 7198
top - 23:25:50 up  1:50,  2 users,  load average: 2.11, 2.28, 2.27
Tasks: 232 total,   2 running, 230 sleeping,   0 stopped,   0 zombie
Cpu(s): 36.5%us,  1.0%sy,  0.0%ni, 62.0%id,  0.1%wa,  0.0%hi,  0.2%si,  0.2%st
Mem:  15394140k total, 10347008k used,  5047132k free,    60708k buffers
Swap:        0k total,        0k used,        0k free,  5358004k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
7402 hbase     20   0 7944m 2.7g  40m R 92.8 18.7  65:12.87 java
7399 hbase     20   0 7944m 2.7g  40m R 90.8 18.7  65:17.01 java
7257 hbase     20   0 7944m 2.7g  40m S  7.9 18.7   4:01.49 java
7258 hbase     20   0 7944m 2.7g  40m S  7.9 18.7   4:01.30 java
7259 hbase     20   0 7944m 2.7g  40m S  7.9 18.7   4:01.46 java
7260 hbase     20   0 7944m 2.7g  40m S  7.9 18.7   4:01.69 java
10443 hbase     20   0 7944m 2.7g  40m S  2.0 18.7   0:14.43 java
7198 hbase     20   0 7944m 2.7g  40m S  0.0 18.7   0:00.01 java
7256 hbase     20   0 7944m 2.7g  40m S  0.0 18.7   0:06.54 java
7261 hbase     20   0 7944m 2.7g  40m S  0.0 18.7   0:13.84 java
7262 hbase     20   0 7944m 2.7g  40m S  0.0 18.7   0:41.45 java
7263 hbase     20   0 7944m 2.7g  40m S  0.0 18.7   0:00.48 java
7264 hbase     20   0 7944m 2.7g  40m S  0.0 18.7   0:00.41 java
7265 hbase     20   0 7944m 2.7g  40m S  0.0 18.7   0:00.00 java
7266 hbase     20   0 7944m 2.7g  40m S  0.0 18.7   0:00.00 java
7267 hbase     20   0 7944m 2.7g  40m S  0.0 18.7   0:33.95 java
7268 hbase     20   0 7944m 2.7g  40m S  0.0 18.7   0:32.26 java
7269 hbase     20   0 7944m 2.7g  40m S  0.0 18.7   0:08.90 java

Stacktrace for PID 7402:
RpcServer.default.FPBQ.Fifo.handler=94,queue=4,port=16020
priority:5 - threadId:0x00007f0ec5ef7000 - nativeId:0x1cea - nativeId 
(decimal):7402 - state:RUNNABLE
stackTrace:
java.lang.Thread.State: RUNNABLE
at 
org.apache.hadoop.hbase.CellComparator.compareColumns(CellComparator.java:158)
at 
org.apache.hadoop.hbase.CellComparator.compareWithoutRow(CellComparator.java:224)
at org.apache.hadoop.hbase.CellComparator.compare(CellComparator.java:66)
at org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:2026)
at 
org.apache.hadoop.hbase.filter.FilterListBase.compareCell(FilterListBase.java:86)
at 
org.apache.hadoop.hbase.filter.FilterListWithOR.getNextCellHint(FilterListWithOR.java:371)
at 
org.apache.hadoop.hbase.filter.FilterList.getNextCellHint(FilterList.java:265)
at 
org.apache.hadoop.hbase.regionserver.querymatcher.UserScanQueryMatcher.getNextKeyHint(UserScanQueryMatcher.java:96)
at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:686)
at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:152)
at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:6290)
at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:6450)
at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:6222)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2847)
- locked <0x00000006964deaf0> (a 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3096)
at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36613)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2380)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
Locked ownable synchronizers:
- None

Stacktrace for PID 7399
priority:5 - threadId:0x00007f0ec5ef1000 - nativeId:0x1ce7 - nativeId 
(decimal):7399 - state:RUNNABLE
stackTrace:
java.lang.Thread.State: RUNNABLE
at org.apache.hadoop.hbase.CellComparator.compareRows(CellComparator.java:187)
at org.apache.hadoop.hbase.CellComparator.compare(CellComparator.java:63)
at org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:2026)
at 
org.apache.hadoop.hbase.filter.FilterListBase.compareCell(FilterListBase.java:86)
at 
org.apache.hadoop.hbase.filter.FilterListWithOR.getNextCellHint(FilterListWithOR.java:371)
at 
org.apache.hadoop.hbase.filter.FilterList.getNextCellHint(FilterList.java:265)
at 
org.apache.hadoop.hbase.regionserver.querymatcher.UserScanQueryMatcher.getNextKeyHint(UserScanQueryMatcher.java:96)
at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:686)
at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:152)
at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:6290)
at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:6450)
at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:6222)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2847)
- locked <0x00000006ae08a770> (a 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3096)
at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36613)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2380)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
Locked ownable synchronizers:
- None

Does anyone know what is going on? Is there possibly an infinite loop or some 
other mechanism causing these threads to get stuck?

Thanks,
Srinidhi


Reply via email to