Repeated split causes HRegionServer failures and breaks table
--
Key: HBASE-5665
URL: https://issues.apache.org/jira/browse/HBASE-5665
Project: HBase
Issue Type: Bug
Components: regionserver
Affects Versions: 0.92.1, 0.92.0
Reporter: Cosmin Lehene
Priority: Blocker
Repeated splits on large tables (2 consecutive would suffice) will essentially
break the table (and the cluster), unrecoverable.
The regionserver doing the split dies and the master will get into an infinite
loop trying to assign regions that seem to have the files missing from HDFS.
The table can be disabled once. upon trying to re-enable it, it will remain in
an intermediary state forever.
I was able to reproduce this on a smaller table consistently.
{code}
hbase(main):030:0 (0..1).each{|x| put 't1', #{x}, 'f1:t', 'dd'}
hbase(main):030:0 (0..1000).each{|x| split 't1', #{x*10}}
{code}
Running overlapping splits in parallel (e.g. #{x*10+1}, #{x*10+2}... ) will
reproduce the issue almost instantly and consistently.
{code}
2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor:
Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in
META
2012-03-28 10:57:16,321 DEBUG
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for
t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1.. compaction_queue=(0:1),
split_queue=10
2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest:
Running rollback/cleanup of failed split of
t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed
ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
java.io.IOException: Failed
ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
at
org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
at
org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
at
org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.FileNotFoundException: File does not exist:
/hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.init(DFSClient.java:1813)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
at
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
at
org.apache.hadoop.hbase.regionserver.StoreFile$Reader.init(StoreFile.java:1008)
at
org.apache.hadoop.hbase.io.HalfStoreFileReader.init(HalfStoreFileReader.java:65)
at
org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
at
org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
at
org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
at org.apache.hadoop.hbase.regionserver.Store.init(Store.java:221)
at
org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
at
org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
at
org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
at
org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
at
org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
... 1 more
2012-03-28 10:57:16,345 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
{code}
http://hastebin.com/diqinibajo.avrasm
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira