consistently observe "not seeked" RS exceptions after specific schema change

Aaron Beppu Mon, 13 May 2019 13:29:32 -0700

Hey HBase users,

I've been struggling with a weird issue. Our team has a table which
currently has a large number of versions per row, and we're seeking to
apply a schema change which both constrains the number and age of versions
stored:
```
alter 'api_grains', {NAME => 'g', MIN_VERSIONS => 5, VERSIONS => 500, TTL
=> 7257600},  {NAME => 'isg', MIN_VERSIONS => 5, VERSIONS => 500, TTL =>
7257600}
```
When attempting to apply a schema change to a large table on a 5.2.0 (CDH5)
cluster, the alter seems to be applied across all regions without problems,
but almost immediately after finishing, I consistently see the region
servers surface the following error.


```

Unexpected throwable object
org.apache.hadoop.hbase.io.hfile.AbstractHFileReader$NotSeekedException:
Not seeked to a key/value
        at 
org.apache.hadoop.hbase.io.hfile.AbstractHFileReader$Scanner.assertSeeked(AbstractHFileReader.java:313)
        at 
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.next(HFileReaderV2.java:878)
        at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:181)
        at 
org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:108)
        at 
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:588)
        at 
org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:147)
        at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:5775)
        at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:5931)
        at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:5709)
        at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:5685)
        at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:5671)
        at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:6904)
        at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:6862)
        at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2010)
        at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33644)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2191)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:183)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:163)

```

i.e., it seems to have not appropriately set up scanners to read its own
HFiles. This occurs in the logs from many RSs in the cluster, and happens
continuously. This breaks the service which queries this table, and
continues until I restore a snapshot from before the schema change. The
issue is reproducible (I've caused it about 8 times in our preprod
environments), and is always resolved if I restore a snapshot from before
the schema change.

During the period where region servers throw these exceptions, I don't see
any other indications that Hbase is in poor health. There are no regions in
transition, hbck doesn't report anything interesting, and other tables seem
unaffected.

Just to confirm that the issue is not actually about the HFiles themselves
being malformed, I took a snapshot from the table while it was in the
"broken" state. After exporting this to a different environment, I
confirmed that at a minimum, I can run spark or Hadoop jobs which run over
the files in the snapshot without encountering any issues. So I believe
that the files themselves are fine, because they're readable by HFile input
formats.

A further source of confusion is that we have recently done extremely
similar `alter table ...` commands for other tables in the same cluster,
without issue.

If anyone can comment on how the region servers might into such a state
(where it doesn't appropriately initialize and seek an HFile reader), or
how that state would be related to specific  table admin operations, please
share any insights you may have.

I understand that due to the older version we're running it may be tempting
to recommend that we upgrade to 2.1 and report back if our issue is
unresolved. Please understand that we're running large cluster which
support some high throughput, customer-facing services and that such a
migration is a substantial project. If you make such a recommendation,
please point to a specific issue or bug which has been resolved in more
recent versions.

Thanks,
Aaron

consistently observe "not seeked" RS exceptions after specific schema change

Reply via email to