[jira] [Updated] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

2012-04-02 Thread stack (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-5665:
-

   Resolution: Fixed
Fix Version/s: 0.94.0
   0.92.2
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

Committed to 0.92, 0.94, and trunk.  Thanks Cosmin and Matteo.

 Repeated split causes HRegionServer failures and breaks table 
 --

 Key: HBASE-5665
 URL: https://issues.apache.org/jira/browse/HBASE-5665
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0, 0.92.1
Reporter: Cosmin Lehene
Assignee: Cosmin Lehene
Priority: Blocker
 Fix For: 0.92.2, 0.94.0

 Attachments: 5665trunk.v2.patch, HBASE-5665-0.92.patch, 
 HBASE-5665-trunk.patch


 Repeated splits on large tables (2 consecutive would suffice) will 
 essentially break the table (and the cluster), unrecoverable.
 The regionserver doing the split dies and the master will get into an 
 infinite loop trying to assign regions that seem to have the files missing 
 from HDFS.
 The table can be disabled once. upon trying to re-enable it, it will remain 
 in an intermediary state forever.
 I was able to reproduce this on a smaller table consistently.
 {code}
 hbase(main):030:0 (0..1).each{|x| put 't1', #{x}, 'f1:t', 'dd'}
 hbase(main):030:0 (0..1000).each{|x| split 't1', #{x*10}}
 {code}
 Running overlapping splits in parallel (e.g. #{x*10+1}, #{x*10+2}... ) 
 will reproduce the issue almost instantly and consistently. 
 {code}
 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: 
 Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in 
 META
 2012-03-28 10:57:16,321 DEBUG 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for 
 t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  
 compaction_queue=(0:1), split_queue=10
 2012-03-28 10:57:16,343 INFO 
 org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup 
 of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; 
 Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 java.io.IOException: Failed 
 ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
 at 
 org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.io.FileNotFoundException: File does not exist: 
 /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.init(DFSClient.java:1813)
 at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
 at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile$Reader.init(StoreFile.java:1008)
 at 
 org.apache.hadoop.hbase.io.HalfStoreFileReader.init(HalfStoreFileReader.java:65)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
 at 
 org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
 at org.apache.hadoop.hbase.regionserver.Store.init(Store.java:221)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
 ... 1 more
 2012-03-28 10:57:16,345 FATAL 
 org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 
 ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
 {code}
 

[jira] [Updated] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

2012-04-02 Thread stack (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-5665:
-

Attachment: 5665trunk.v2.patch

Same as last patch but w/ fixed javadoc... isAvailable is not closed and not 
closing.

 Repeated split causes HRegionServer failures and breaks table 
 --

 Key: HBASE-5665
 URL: https://issues.apache.org/jira/browse/HBASE-5665
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0, 0.92.1
Reporter: Cosmin Lehene
Assignee: Cosmin Lehene
Priority: Blocker
 Fix For: 0.92.2, 0.94.0

 Attachments: 5665trunk.v2.patch, HBASE-5665-0.92.patch, 
 HBASE-5665-trunk.patch


 Repeated splits on large tables (2 consecutive would suffice) will 
 essentially break the table (and the cluster), unrecoverable.
 The regionserver doing the split dies and the master will get into an 
 infinite loop trying to assign regions that seem to have the files missing 
 from HDFS.
 The table can be disabled once. upon trying to re-enable it, it will remain 
 in an intermediary state forever.
 I was able to reproduce this on a smaller table consistently.
 {code}
 hbase(main):030:0 (0..1).each{|x| put 't1', #{x}, 'f1:t', 'dd'}
 hbase(main):030:0 (0..1000).each{|x| split 't1', #{x*10}}
 {code}
 Running overlapping splits in parallel (e.g. #{x*10+1}, #{x*10+2}... ) 
 will reproduce the issue almost instantly and consistently. 
 {code}
 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: 
 Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in 
 META
 2012-03-28 10:57:16,321 DEBUG 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for 
 t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  
 compaction_queue=(0:1), split_queue=10
 2012-03-28 10:57:16,343 INFO 
 org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup 
 of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; 
 Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 java.io.IOException: Failed 
 ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
 at 
 org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.io.FileNotFoundException: File does not exist: 
 /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.init(DFSClient.java:1813)
 at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
 at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile$Reader.init(StoreFile.java:1008)
 at 
 org.apache.hadoop.hbase.io.HalfStoreFileReader.init(HalfStoreFileReader.java:65)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
 at 
 org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
 at org.apache.hadoop.hbase.regionserver.Store.init(Store.java:221)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
 ... 1 more
 2012-03-28 10:57:16,345 FATAL 
 org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 
 ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
 {code}
 http://hastebin.com/diqinibajo.avrasm
 later edit:
 (I'm using the last 4 characters from each string)
 Region 94e3 has storefile 

[jira] [Updated] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

2012-04-01 Thread Matteo Bertozzi (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matteo Bertozzi updated HBASE-5665:
---

Attachment: HBASE-5665-trunk.patch

 Repeated split causes HRegionServer failures and breaks table 
 --

 Key: HBASE-5665
 URL: https://issues.apache.org/jira/browse/HBASE-5665
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0, 0.92.1
Reporter: Cosmin Lehene
Assignee: Cosmin Lehene
Priority: Blocker
 Attachments: HBASE-5665-0.92.patch, HBASE-5665-trunk.patch


 Repeated splits on large tables (2 consecutive would suffice) will 
 essentially break the table (and the cluster), unrecoverable.
 The regionserver doing the split dies and the master will get into an 
 infinite loop trying to assign regions that seem to have the files missing 
 from HDFS.
 The table can be disabled once. upon trying to re-enable it, it will remain 
 in an intermediary state forever.
 I was able to reproduce this on a smaller table consistently.
 {code}
 hbase(main):030:0 (0..1).each{|x| put 't1', #{x}, 'f1:t', 'dd'}
 hbase(main):030:0 (0..1000).each{|x| split 't1', #{x*10}}
 {code}
 Running overlapping splits in parallel (e.g. #{x*10+1}, #{x*10+2}... ) 
 will reproduce the issue almost instantly and consistently. 
 {code}
 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: 
 Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in 
 META
 2012-03-28 10:57:16,321 DEBUG 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for 
 t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  
 compaction_queue=(0:1), split_queue=10
 2012-03-28 10:57:16,343 INFO 
 org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup 
 of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; 
 Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 java.io.IOException: Failed 
 ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
 at 
 org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.io.FileNotFoundException: File does not exist: 
 /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.init(DFSClient.java:1813)
 at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
 at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile$Reader.init(StoreFile.java:1008)
 at 
 org.apache.hadoop.hbase.io.HalfStoreFileReader.init(HalfStoreFileReader.java:65)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
 at 
 org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
 at org.apache.hadoop.hbase.regionserver.Store.init(Store.java:221)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
 ... 1 more
 2012-03-28 10:57:16,345 FATAL 
 org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 
 ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
 {code}
 http://hastebin.com/diqinibajo.avrasm
 later edit:
 (I'm using the last 4 characters from each string)
 Region 94e3 has storefile 7237
 Region 94e3 gets splited in daughters a: ffa1 and b: eee1
 Daughter region ffa1 get's splitted in daughters a: 3124 and b: dc77
 

[jira] [Updated] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

2012-03-29 Thread Cosmin Lehene (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cosmin Lehene updated HBASE-5665:
-

Attachment: HBASE-5665-0.92.patch

Adding patch.
From Reference.java documentation 

{code}
 * Note, a region is itself not splitable if it has instances of store file
 * references.  References are cleaned up by compactions.
{code}

SplitTransaction.prepare() should check if the parent region has references.

I added a unit test and patch. Funny enough the HRegion.hasReferences was 
implemented but only used from a unit test.

I think recursive references wouldn't be so hard to have if there's a good 
reason to have them in the first place.

 Repeated split causes HRegionServer failures and breaks table 
 --

 Key: HBASE-5665
 URL: https://issues.apache.org/jira/browse/HBASE-5665
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0, 0.92.1
Reporter: Cosmin Lehene
Assignee: Cosmin Lehene
Priority: Blocker
 Attachments: HBASE-5665-0.92.patch


 Repeated splits on large tables (2 consecutive would suffice) will 
 essentially break the table (and the cluster), unrecoverable.
 The regionserver doing the split dies and the master will get into an 
 infinite loop trying to assign regions that seem to have the files missing 
 from HDFS.
 The table can be disabled once. upon trying to re-enable it, it will remain 
 in an intermediary state forever.
 I was able to reproduce this on a smaller table consistently.
 {code}
 hbase(main):030:0 (0..1).each{|x| put 't1', #{x}, 'f1:t', 'dd'}
 hbase(main):030:0 (0..1000).each{|x| split 't1', #{x*10}}
 {code}
 Running overlapping splits in parallel (e.g. #{x*10+1}, #{x*10+2}... ) 
 will reproduce the issue almost instantly and consistently. 
 {code}
 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: 
 Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in 
 META
 2012-03-28 10:57:16,321 DEBUG 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for 
 t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  
 compaction_queue=(0:1), split_queue=10
 2012-03-28 10:57:16,343 INFO 
 org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup 
 of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; 
 Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 java.io.IOException: Failed 
 ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
 at 
 org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.io.FileNotFoundException: File does not exist: 
 /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.init(DFSClient.java:1813)
 at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
 at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile$Reader.init(StoreFile.java:1008)
 at 
 org.apache.hadoop.hbase.io.HalfStoreFileReader.init(HalfStoreFileReader.java:65)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
 at 
 org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
 at org.apache.hadoop.hbase.regionserver.Store.init(Store.java:221)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
 at 
 

[jira] [Updated] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

2012-03-29 Thread Cosmin Lehene (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cosmin Lehene updated HBASE-5665:
-

Affects Version/s: 0.94.1
   0.96.0
   0.94.0

0.94 and trunk seem to suffer from this as well and not checking if parent has 
references.

 Repeated split causes HRegionServer failures and breaks table 
 --

 Key: HBASE-5665
 URL: https://issues.apache.org/jira/browse/HBASE-5665
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0, 0.92.1, 0.94.0, 0.96.0, 0.94.1
Reporter: Cosmin Lehene
Assignee: Cosmin Lehene
Priority: Blocker
 Attachments: HBASE-5665-0.92.patch


 Repeated splits on large tables (2 consecutive would suffice) will 
 essentially break the table (and the cluster), unrecoverable.
 The regionserver doing the split dies and the master will get into an 
 infinite loop trying to assign regions that seem to have the files missing 
 from HDFS.
 The table can be disabled once. upon trying to re-enable it, it will remain 
 in an intermediary state forever.
 I was able to reproduce this on a smaller table consistently.
 {code}
 hbase(main):030:0 (0..1).each{|x| put 't1', #{x}, 'f1:t', 'dd'}
 hbase(main):030:0 (0..1000).each{|x| split 't1', #{x*10}}
 {code}
 Running overlapping splits in parallel (e.g. #{x*10+1}, #{x*10+2}... ) 
 will reproduce the issue almost instantly and consistently. 
 {code}
 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: 
 Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in 
 META
 2012-03-28 10:57:16,321 DEBUG 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for 
 t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  
 compaction_queue=(0:1), split_queue=10
 2012-03-28 10:57:16,343 INFO 
 org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup 
 of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; 
 Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 java.io.IOException: Failed 
 ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
 at 
 org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.io.FileNotFoundException: File does not exist: 
 /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.init(DFSClient.java:1813)
 at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
 at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile$Reader.init(StoreFile.java:1008)
 at 
 org.apache.hadoop.hbase.io.HalfStoreFileReader.init(HalfStoreFileReader.java:65)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
 at 
 org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
 at org.apache.hadoop.hbase.regionserver.Store.init(Store.java:221)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
 ... 1 more
 2012-03-28 10:57:16,345 FATAL 
 org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 
 ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
 {code}
 http://hastebin.com/diqinibajo.avrasm
 later edit:
 (I'm using the last 4 characters from each string)
 Region 

[jira] [Updated] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

2012-03-29 Thread Cosmin Lehene (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cosmin Lehene updated HBASE-5665:
-

Affects Version/s: (was: 0.94.1)
   (was: 0.96.0)
   (was: 0.94.0)
   Status: Patch Available  (was: Open)

 Repeated split causes HRegionServer failures and breaks table 
 --

 Key: HBASE-5665
 URL: https://issues.apache.org/jira/browse/HBASE-5665
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.1, 0.92.0
Reporter: Cosmin Lehene
Assignee: Cosmin Lehene
Priority: Blocker
 Attachments: HBASE-5665-0.92.patch


 Repeated splits on large tables (2 consecutive would suffice) will 
 essentially break the table (and the cluster), unrecoverable.
 The regionserver doing the split dies and the master will get into an 
 infinite loop trying to assign regions that seem to have the files missing 
 from HDFS.
 The table can be disabled once. upon trying to re-enable it, it will remain 
 in an intermediary state forever.
 I was able to reproduce this on a smaller table consistently.
 {code}
 hbase(main):030:0 (0..1).each{|x| put 't1', #{x}, 'f1:t', 'dd'}
 hbase(main):030:0 (0..1000).each{|x| split 't1', #{x*10}}
 {code}
 Running overlapping splits in parallel (e.g. #{x*10+1}, #{x*10+2}... ) 
 will reproduce the issue almost instantly and consistently. 
 {code}
 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: 
 Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in 
 META
 2012-03-28 10:57:16,321 DEBUG 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for 
 t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  
 compaction_queue=(0:1), split_queue=10
 2012-03-28 10:57:16,343 INFO 
 org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup 
 of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; 
 Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 java.io.IOException: Failed 
 ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
 at 
 org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.io.FileNotFoundException: File does not exist: 
 /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.init(DFSClient.java:1813)
 at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
 at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile$Reader.init(StoreFile.java:1008)
 at 
 org.apache.hadoop.hbase.io.HalfStoreFileReader.init(HalfStoreFileReader.java:65)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
 at 
 org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
 at org.apache.hadoop.hbase.regionserver.Store.init(Store.java:221)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
 ... 1 more
 2012-03-28 10:57:16,345 FATAL 
 org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 
 ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
 {code}
 http://hastebin.com/diqinibajo.avrasm
 later edit:
 (I'm using the last 4 characters from each string)
 Region 94e3 has storefile 7237
 Region 94e3 gets 

[jira] [Updated] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

2012-03-28 Thread Cosmin Lehene (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cosmin Lehene updated HBASE-5665:
-

Description: 
Repeated splits on large tables (2 consecutive would suffice) will essentially 
break the table (and the cluster), unrecoverable.
The regionserver doing the split dies and the master will get into an infinite 
loop trying to assign regions that seem to have the files missing from HDFS.

The table can be disabled once. upon trying to re-enable it, it will remain in 
an intermediary state forever.

I was able to reproduce this on a smaller table consistently.

{code}
hbase(main):030:0 (0..1).each{|x| put 't1', #{x}, 'f1:t', 'dd'}
hbase(main):030:0 (0..1000).each{|x| split 't1', #{x*10}}
{code}

Running overlapping splits in parallel (e.g. #{x*10+1}, #{x*10+2}... ) will 
reproduce the issue almost instantly and consistently. 

{code}
2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: 
Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in 
META
2012-03-28 10:57:16,321 DEBUG 
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for 
t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  compaction_queue=(0:1), 
split_queue=10
2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: 
Running rollback/cleanup of failed split of 
t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed 
ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
java.io.IOException: Failed 
ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
at 
org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
at 
org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
at 
org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.FileNotFoundException: File does not exist: 
/hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.init(DFSClient.java:1813)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
at 
org.apache.hadoop.hbase.regionserver.StoreFile$Reader.init(StoreFile.java:1008)
at 
org.apache.hadoop.hbase.io.HalfStoreFileReader.init(HalfStoreFileReader.java:65)
at 
org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
at 
org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
at 
org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
at org.apache.hadoop.hbase.regionserver.Store.init(Store.java:221)
at 
org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
at 
org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
at 
org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
at 
org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
at 
org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
... 1 more
2012-03-28 10:57:16,345 FATAL 
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 
ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
{code}

http://hastebin.com/diqinibajo.avrasm

later edit:

(I'm using the last 4 characters from each string)
Region 94e3 has storefile 7237
Region 94e3 gets splited in daughters a: ffa1 and b: eee1
Daughter region ffa1 get's splitted in daughters a: 3124 and b: dc77
ffa1 has a reference: 7237.94e3 for it's store file
when ffa1 gets splited it will create another reference: 7237.94e3.ffa1
when SplitTransaction will execute() it will try to open that (openDaughters 
above) and it will match it from left to right [storefile].[region] 
{code}
^([0-9a-f]+)(?:\\.(.+))?$
{code}
and will attempt to go to /hbase/t1/[region] which resolves to 
/hbase/t1/94e3.ffa1/f1/7237 - which obviously doesn't exist and will fail. 

This seems like a design problem: we should either stop from splitting if the 
path is reference or be able to recursively resolve reference paths (e.g. parse 
right to left