Hello, We recently upgrade our staging cluster running on S3 backed EMR from HBase 1.3.0 to HBase 1.4.9. After doing so and running a test load, we noticed a sudden spike in CPU utilization on one of the nodes in the cluster (jumping from 20% to 60% all at once). After looking at CPU time of the threads and taking a thread dump of the application, we noticed that there was exactly two threads each with enormous CPU usage running since the start of the spike in CPU. They both appear to be stuck on the CellComparator compare method. We turned off all request to the cluster and the CPU usage and those threads remain in the same state. I’ve pasted some data and stack traces below and also attached full thread dumps. We haven’t had this issue when running Hbase 1.3.
What top shows: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7198 hbase 20 0 7944m 2.6g 40m S 213.0 17.9 135:39.08 java 4755 root 20 0 4329m 262m 23m S 0.7 1.7 0:30.52 java 6630 yarn 20 0 4014m 430m 36m S 0.3 2.9 0:48.25 java 1 root 20 0 19680 2708 2324 S 0.0 0.0 0:01.37 init 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd 4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H 6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq 7 root 20 0 0 0 0 S 0.0 0.0 0:01.47 ksoftirqd/0 8 root 20 0 0 0 0 I 0.0 0.0 0:02.74 rcu_sched 9 root 20 0 0 0 0 I 0.0 0.0 0:00.00 rcu_bh 10 root RT 0 0 0 0 S 0.0 0.0 0:00.04 migration/0 11 root RT 0 0 0 0 S 0.0 0.0 0:00.01 watchdog/0 12 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/0 13 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/1 14 root RT 0 0 0 0 S 0.0 0.0 0:00.01 watchdog/1 15 root RT 0 0 0 0 S 0.0 0.0 0:00.03 migration/1 16 root 20 0 0 0 0 S 0.0 0.0 0:00.03 ksoftirqd/1 18 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/1:0H A breakdown of the threads sudo top -n 1 -H -p 7198 top - 23:25:50 up 1:50, 2 users, load average: 2.11, 2.28, 2.27 Tasks: 232 total, 2 running, 230 sleeping, 0 stopped, 0 zombie Cpu(s): 36.5%us, 1.0%sy, 0.0%ni, 62.0%id, 0.1%wa, 0.0%hi, 0.2%si, 0.2%st Mem: 15394140k total, 10347008k used, 5047132k free, 60708k buffers Swap: 0k total, 0k used, 0k free, 5358004k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7402 hbase 20 0 7944m 2.7g 40m R 92.8 18.7 65:12.87 java 7399 hbase 20 0 7944m 2.7g 40m R 90.8 18.7 65:17.01 java 7257 hbase 20 0 7944m 2.7g 40m S 7.9 18.7 4:01.49 java 7258 hbase 20 0 7944m 2.7g 40m S 7.9 18.7 4:01.30 java 7259 hbase 20 0 7944m 2.7g 40m S 7.9 18.7 4:01.46 java 7260 hbase 20 0 7944m 2.7g 40m S 7.9 18.7 4:01.69 java 10443 hbase 20 0 7944m 2.7g 40m S 2.0 18.7 0:14.43 java 7198 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:00.01 java 7256 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:06.54 java 7261 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:13.84 java 7262 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:41.45 java 7263 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:00.48 java 7264 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:00.41 java 7265 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:00.00 java 7266 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:00.00 java 7267 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:33.95 java 7268 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:32.26 java 7269 hbase 20 0 7944m 2.7g 40m S 0.0 18.7 0:08.90 java Stacktrace for PID 7402: RpcServer.default.FPBQ.Fifo.handler=94,queue=4,port=16020 priority:5 - threadId:0x00007f0ec5ef7000 - nativeId:0x1cea - nativeId (decimal):7402 - state:RUNNABLE stackTrace: java.lang.Thread.State: RUNNABLE at org.apache.hadoop.hbase.CellComparator.compareColumns(CellComparator.java:158) at org.apache.hadoop.hbase.CellComparator.compareWithoutRow(CellComparator.java:224) at org.apache.hadoop.hbase.CellComparator.compare(CellComparator.java:66) at org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:2026) at org.apache.hadoop.hbase.filter.FilterListBase.compareCell(FilterListBase.java:86) at org.apache.hadoop.hbase.filter.FilterListWithOR.getNextCellHint(FilterListWithOR.java:371) at org.apache.hadoop.hbase.filter.FilterList.getNextCellHint(FilterList.java:265) at org.apache.hadoop.hbase.regionserver.querymatcher.UserScanQueryMatcher.getNextKeyHint(UserScanQueryMatcher.java:96) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:686) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:152) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:6290) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:6450) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:6222) at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2847) - locked <0x00000006964deaf0> (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3096) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36613) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2380) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277) Locked ownable synchronizers: - None Stacktrace for PID 7399 priority:5 - threadId:0x00007f0ec5ef1000 - nativeId:0x1ce7 - nativeId (decimal):7399 - state:RUNNABLE stackTrace: java.lang.Thread.State: RUNNABLE at org.apache.hadoop.hbase.CellComparator.compareRows(CellComparator.java:187) at org.apache.hadoop.hbase.CellComparator.compare(CellComparator.java:63) at org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:2026) at org.apache.hadoop.hbase.filter.FilterListBase.compareCell(FilterListBase.java:86) at org.apache.hadoop.hbase.filter.FilterListWithOR.getNextCellHint(FilterListWithOR.java:371) at org.apache.hadoop.hbase.filter.FilterList.getNextCellHint(FilterList.java:265) at org.apache.hadoop.hbase.regionserver.querymatcher.UserScanQueryMatcher.getNextKeyHint(UserScanQueryMatcher.java:96) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:686) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:152) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:6290) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:6450) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:6222) at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2847) - locked <0x00000006ae08a770> (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3096) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36613) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2380) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277) Locked ownable synchronizers: - None Does anyone know what is going on? Is there possibly an infinite loop or some other mechanism causing these threads to get stuck? Thanks, Srinidhi