[jira] [Commented] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

2012-04-04 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13246905#comment-13246905
 ] 

Hudson commented on HBASE-5665:
---

Integrated in HBase-0.94-security #7 (See 
[https://builds.apache.org/job/HBase-0.94-security/7/])
HBASE-5665 Repeated split causes HRegionServer failures and breaks table 
(Revision 1308547)

 Result = SUCCESS
stack : 
Files : 
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransaction.java


 Repeated split causes HRegionServer failures and breaks table 
 --

 Key: HBASE-5665
 URL: https://issues.apache.org/jira/browse/HBASE-5665
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0, 0.92.1
Reporter: Cosmin Lehene
Assignee: Cosmin Lehene
Priority: Blocker
 Fix For: 0.92.2, 0.94.0

 Attachments: 5665trunk.v2.patch, HBASE-5665-0.92.patch, 
 HBASE-5665-trunk.patch


 Repeated splits on large tables (2 consecutive would suffice) will 
 essentially break the table (and the cluster), unrecoverable.
 The regionserver doing the split dies and the master will get into an 
 infinite loop trying to assign regions that seem to have the files missing 
 from HDFS.
 The table can be disabled once. upon trying to re-enable it, it will remain 
 in an intermediary state forever.
 I was able to reproduce this on a smaller table consistently.
 {code}
 hbase(main):030:0 (0..1).each{|x| put 't1', #{x}, 'f1:t', 'dd'}
 hbase(main):030:0 (0..1000).each{|x| split 't1', #{x*10}}
 {code}
 Running overlapping splits in parallel (e.g. #{x*10+1}, #{x*10+2}... ) 
 will reproduce the issue almost instantly and consistently. 
 {code}
 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: 
 Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in 
 META
 2012-03-28 10:57:16,321 DEBUG 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for 
 t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  
 compaction_queue=(0:1), split_queue=10
 2012-03-28 10:57:16,343 INFO 
 org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup 
 of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; 
 Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 java.io.IOException: Failed 
 ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
 at 
 org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.io.FileNotFoundException: File does not exist: 
 /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.init(DFSClient.java:1813)
 at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
 at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile$Reader.init(StoreFile.java:1008)
 at 
 org.apache.hadoop.hbase.io.HalfStoreFileReader.init(HalfStoreFileReader.java:65)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
 at 
 org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
 at org.apache.hadoop.hbase.regionserver.Store.init(Store.java:221)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
 at 
 

[jira] [Commented] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

2012-04-04 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13246965#comment-13246965
 ] 

Hudson commented on HBASE-5665:
---

Integrated in HBase-0.92-security #104 (See 
[https://builds.apache.org/job/HBase-0.92-security/104/])
HBASE-5665 Repeated split causes HRegionServer failures and breaks table 
(Revision 1308549)

 Result = FAILURE
stack : 
Files : 
* 
/hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
* 
/hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
* 
/hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java
* 
/hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransaction.java


 Repeated split causes HRegionServer failures and breaks table 
 --

 Key: HBASE-5665
 URL: https://issues.apache.org/jira/browse/HBASE-5665
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0, 0.92.1
Reporter: Cosmin Lehene
Assignee: Cosmin Lehene
Priority: Blocker
 Fix For: 0.92.2, 0.94.0

 Attachments: 5665trunk.v2.patch, HBASE-5665-0.92.patch, 
 HBASE-5665-trunk.patch


 Repeated splits on large tables (2 consecutive would suffice) will 
 essentially break the table (and the cluster), unrecoverable.
 The regionserver doing the split dies and the master will get into an 
 infinite loop trying to assign regions that seem to have the files missing 
 from HDFS.
 The table can be disabled once. upon trying to re-enable it, it will remain 
 in an intermediary state forever.
 I was able to reproduce this on a smaller table consistently.
 {code}
 hbase(main):030:0 (0..1).each{|x| put 't1', #{x}, 'f1:t', 'dd'}
 hbase(main):030:0 (0..1000).each{|x| split 't1', #{x*10}}
 {code}
 Running overlapping splits in parallel (e.g. #{x*10+1}, #{x*10+2}... ) 
 will reproduce the issue almost instantly and consistently. 
 {code}
 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: 
 Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in 
 META
 2012-03-28 10:57:16,321 DEBUG 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for 
 t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  
 compaction_queue=(0:1), split_queue=10
 2012-03-28 10:57:16,343 INFO 
 org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup 
 of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; 
 Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 java.io.IOException: Failed 
 ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
 at 
 org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.io.FileNotFoundException: File does not exist: 
 /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.init(DFSClient.java:1813)
 at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
 at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile$Reader.init(StoreFile.java:1008)
 at 
 org.apache.hadoop.hbase.io.HalfStoreFileReader.init(HalfStoreFileReader.java:65)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
 at 
 org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
 at org.apache.hadoop.hbase.regionserver.Store.init(Store.java:221)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
 at 
 

[jira] [Commented] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

2012-04-02 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13244713#comment-13244713
 ] 

Hudson commented on HBASE-5665:
---

Integrated in HBase-0.94 #79 (See 
[https://builds.apache.org/job/HBase-0.94/79/])
HBASE-5665 Repeated split causes HRegionServer failures and breaks table 
(Revision 1308547)

 Result = FAILURE
stack : 
Files : 
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransaction.java


 Repeated split causes HRegionServer failures and breaks table 
 --

 Key: HBASE-5665
 URL: https://issues.apache.org/jira/browse/HBASE-5665
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0, 0.92.1
Reporter: Cosmin Lehene
Assignee: Cosmin Lehene
Priority: Blocker
 Fix For: 0.92.2, 0.94.0

 Attachments: 5665trunk.v2.patch, HBASE-5665-0.92.patch, 
 HBASE-5665-trunk.patch


 Repeated splits on large tables (2 consecutive would suffice) will 
 essentially break the table (and the cluster), unrecoverable.
 The regionserver doing the split dies and the master will get into an 
 infinite loop trying to assign regions that seem to have the files missing 
 from HDFS.
 The table can be disabled once. upon trying to re-enable it, it will remain 
 in an intermediary state forever.
 I was able to reproduce this on a smaller table consistently.
 {code}
 hbase(main):030:0 (0..1).each{|x| put 't1', #{x}, 'f1:t', 'dd'}
 hbase(main):030:0 (0..1000).each{|x| split 't1', #{x*10}}
 {code}
 Running overlapping splits in parallel (e.g. #{x*10+1}, #{x*10+2}... ) 
 will reproduce the issue almost instantly and consistently. 
 {code}
 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: 
 Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in 
 META
 2012-03-28 10:57:16,321 DEBUG 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for 
 t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  
 compaction_queue=(0:1), split_queue=10
 2012-03-28 10:57:16,343 INFO 
 org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup 
 of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; 
 Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 java.io.IOException: Failed 
 ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
 at 
 org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.io.FileNotFoundException: File does not exist: 
 /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.init(DFSClient.java:1813)
 at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
 at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile$Reader.init(StoreFile.java:1008)
 at 
 org.apache.hadoop.hbase.io.HalfStoreFileReader.init(HalfStoreFileReader.java:65)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
 at 
 org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
 at org.apache.hadoop.hbase.regionserver.Store.init(Store.java:221)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
 at 
 

[jira] [Commented] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

2012-04-02 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13244830#comment-13244830
 ] 

Hudson commented on HBASE-5665:
---

Integrated in HBase-TRUNK #2704 (See 
[https://builds.apache.org/job/HBase-TRUNK/2704/])
HBASE-5665 Repeated split causes HRegionServer failures and breaks table 
(Revision 1308545)

 Result = FAILURE
stack : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransaction.java


 Repeated split causes HRegionServer failures and breaks table 
 --

 Key: HBASE-5665
 URL: https://issues.apache.org/jira/browse/HBASE-5665
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0, 0.92.1
Reporter: Cosmin Lehene
Assignee: Cosmin Lehene
Priority: Blocker
 Fix For: 0.92.2, 0.94.0

 Attachments: 5665trunk.v2.patch, HBASE-5665-0.92.patch, 
 HBASE-5665-trunk.patch


 Repeated splits on large tables (2 consecutive would suffice) will 
 essentially break the table (and the cluster), unrecoverable.
 The regionserver doing the split dies and the master will get into an 
 infinite loop trying to assign regions that seem to have the files missing 
 from HDFS.
 The table can be disabled once. upon trying to re-enable it, it will remain 
 in an intermediary state forever.
 I was able to reproduce this on a smaller table consistently.
 {code}
 hbase(main):030:0 (0..1).each{|x| put 't1', #{x}, 'f1:t', 'dd'}
 hbase(main):030:0 (0..1000).each{|x| split 't1', #{x*10}}
 {code}
 Running overlapping splits in parallel (e.g. #{x*10+1}, #{x*10+2}... ) 
 will reproduce the issue almost instantly and consistently. 
 {code}
 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: 
 Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in 
 META
 2012-03-28 10:57:16,321 DEBUG 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for 
 t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  
 compaction_queue=(0:1), split_queue=10
 2012-03-28 10:57:16,343 INFO 
 org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup 
 of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; 
 Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 java.io.IOException: Failed 
 ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
 at 
 org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.io.FileNotFoundException: File does not exist: 
 /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.init(DFSClient.java:1813)
 at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
 at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile$Reader.init(StoreFile.java:1008)
 at 
 org.apache.hadoop.hbase.io.HalfStoreFileReader.init(HalfStoreFileReader.java:65)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
 at 
 org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
 at org.apache.hadoop.hbase.regionserver.Store.init(Store.java:221)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
 at 
 

[jira] [Commented] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

2012-04-02 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13244858#comment-13244858
 ] 

Hudson commented on HBASE-5665:
---

Integrated in HBase-0.92 #351 (See 
[https://builds.apache.org/job/HBase-0.92/351/])
HBASE-5665 Repeated split causes HRegionServer failures and breaks table 
(Revision 1308549)

 Result = FAILURE
stack : 
Files : 
* 
/hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
* 
/hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
* 
/hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java
* 
/hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransaction.java


 Repeated split causes HRegionServer failures and breaks table 
 --

 Key: HBASE-5665
 URL: https://issues.apache.org/jira/browse/HBASE-5665
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0, 0.92.1
Reporter: Cosmin Lehene
Assignee: Cosmin Lehene
Priority: Blocker
 Fix For: 0.92.2, 0.94.0

 Attachments: 5665trunk.v2.patch, HBASE-5665-0.92.patch, 
 HBASE-5665-trunk.patch


 Repeated splits on large tables (2 consecutive would suffice) will 
 essentially break the table (and the cluster), unrecoverable.
 The regionserver doing the split dies and the master will get into an 
 infinite loop trying to assign regions that seem to have the files missing 
 from HDFS.
 The table can be disabled once. upon trying to re-enable it, it will remain 
 in an intermediary state forever.
 I was able to reproduce this on a smaller table consistently.
 {code}
 hbase(main):030:0 (0..1).each{|x| put 't1', #{x}, 'f1:t', 'dd'}
 hbase(main):030:0 (0..1000).each{|x| split 't1', #{x*10}}
 {code}
 Running overlapping splits in parallel (e.g. #{x*10+1}, #{x*10+2}... ) 
 will reproduce the issue almost instantly and consistently. 
 {code}
 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: 
 Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in 
 META
 2012-03-28 10:57:16,321 DEBUG 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for 
 t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  
 compaction_queue=(0:1), split_queue=10
 2012-03-28 10:57:16,343 INFO 
 org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup 
 of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; 
 Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 java.io.IOException: Failed 
 ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
 at 
 org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.io.FileNotFoundException: File does not exist: 
 /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.init(DFSClient.java:1813)
 at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
 at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile$Reader.init(StoreFile.java:1008)
 at 
 org.apache.hadoop.hbase.io.HalfStoreFileReader.init(HalfStoreFileReader.java:65)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
 at 
 org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
 at org.apache.hadoop.hbase.regionserver.Store.init(Store.java:221)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
 at 
 

[jira] [Commented] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

2012-04-02 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245008#comment-13245008
 ] 

Hudson commented on HBASE-5665:
---

Integrated in HBase-TRUNK-security #156 (See 
[https://builds.apache.org/job/HBase-TRUNK-security/156/])
HBASE-5665 Repeated split causes HRegionServer failures and breaks table 
(Revision 1308545)

 Result = FAILURE
stack : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransaction.java


 Repeated split causes HRegionServer failures and breaks table 
 --

 Key: HBASE-5665
 URL: https://issues.apache.org/jira/browse/HBASE-5665
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0, 0.92.1
Reporter: Cosmin Lehene
Assignee: Cosmin Lehene
Priority: Blocker
 Fix For: 0.92.2, 0.94.0

 Attachments: 5665trunk.v2.patch, HBASE-5665-0.92.patch, 
 HBASE-5665-trunk.patch


 Repeated splits on large tables (2 consecutive would suffice) will 
 essentially break the table (and the cluster), unrecoverable.
 The regionserver doing the split dies and the master will get into an 
 infinite loop trying to assign regions that seem to have the files missing 
 from HDFS.
 The table can be disabled once. upon trying to re-enable it, it will remain 
 in an intermediary state forever.
 I was able to reproduce this on a smaller table consistently.
 {code}
 hbase(main):030:0 (0..1).each{|x| put 't1', #{x}, 'f1:t', 'dd'}
 hbase(main):030:0 (0..1000).each{|x| split 't1', #{x*10}}
 {code}
 Running overlapping splits in parallel (e.g. #{x*10+1}, #{x*10+2}... ) 
 will reproduce the issue almost instantly and consistently. 
 {code}
 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: 
 Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in 
 META
 2012-03-28 10:57:16,321 DEBUG 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for 
 t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  
 compaction_queue=(0:1), split_queue=10
 2012-03-28 10:57:16,343 INFO 
 org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup 
 of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; 
 Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 java.io.IOException: Failed 
 ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
 at 
 org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.io.FileNotFoundException: File does not exist: 
 /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.init(DFSClient.java:1813)
 at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
 at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile$Reader.init(StoreFile.java:1008)
 at 
 org.apache.hadoop.hbase.io.HalfStoreFileReader.init(HalfStoreFileReader.java:65)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
 at 
 org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
 at org.apache.hadoop.hbase.regionserver.Store.init(Store.java:221)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
 at 
 

[jira] [Commented] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

2012-04-01 Thread Matteo Bertozzi (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243771#comment-13243771
 ] 

Matteo Bertozzi commented on HBASE-5665:


Can we also add a couple of methods to the region like isSplittable() and 
isAvailable()
{code}
boolean isAvailable() {
  return !isClosed()  !isClosing();
}

boolean isSplittable() {
  return isAvailable()  !hasReferences();
}
{code}

just to avoid similar problems in future...
For example in HRegionServer both getMostLoadedRegions() and closeUserRegions() 
does the same isAvailable() check...

 Repeated split causes HRegionServer failures and breaks table 
 --

 Key: HBASE-5665
 URL: https://issues.apache.org/jira/browse/HBASE-5665
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0, 0.92.1
Reporter: Cosmin Lehene
Assignee: Cosmin Lehene
Priority: Blocker
 Attachments: HBASE-5665-0.92.patch


 Repeated splits on large tables (2 consecutive would suffice) will 
 essentially break the table (and the cluster), unrecoverable.
 The regionserver doing the split dies and the master will get into an 
 infinite loop trying to assign regions that seem to have the files missing 
 from HDFS.
 The table can be disabled once. upon trying to re-enable it, it will remain 
 in an intermediary state forever.
 I was able to reproduce this on a smaller table consistently.
 {code}
 hbase(main):030:0 (0..1).each{|x| put 't1', #{x}, 'f1:t', 'dd'}
 hbase(main):030:0 (0..1000).each{|x| split 't1', #{x*10}}
 {code}
 Running overlapping splits in parallel (e.g. #{x*10+1}, #{x*10+2}... ) 
 will reproduce the issue almost instantly and consistently. 
 {code}
 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: 
 Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in 
 META
 2012-03-28 10:57:16,321 DEBUG 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for 
 t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  
 compaction_queue=(0:1), split_queue=10
 2012-03-28 10:57:16,343 INFO 
 org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup 
 of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; 
 Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 java.io.IOException: Failed 
 ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
 at 
 org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.io.FileNotFoundException: File does not exist: 
 /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.init(DFSClient.java:1813)
 at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
 at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile$Reader.init(StoreFile.java:1008)
 at 
 org.apache.hadoop.hbase.io.HalfStoreFileReader.init(HalfStoreFileReader.java:65)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
 at 
 org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
 at org.apache.hadoop.hbase.regionserver.Store.init(Store.java:221)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
 ... 1 more
 2012-03-28 10:57:16,345 FATAL 
 

[jira] [Commented] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

2012-04-01 Thread Ted Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243842#comment-13243842
 ] 

Ted Yu commented on HBASE-5665:
---

HBASE-5665-trunk.patch looks good.

 Repeated split causes HRegionServer failures and breaks table 
 --

 Key: HBASE-5665
 URL: https://issues.apache.org/jira/browse/HBASE-5665
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0, 0.92.1
Reporter: Cosmin Lehene
Assignee: Cosmin Lehene
Priority: Blocker
 Attachments: HBASE-5665-0.92.patch, HBASE-5665-trunk.patch


 Repeated splits on large tables (2 consecutive would suffice) will 
 essentially break the table (and the cluster), unrecoverable.
 The regionserver doing the split dies and the master will get into an 
 infinite loop trying to assign regions that seem to have the files missing 
 from HDFS.
 The table can be disabled once. upon trying to re-enable it, it will remain 
 in an intermediary state forever.
 I was able to reproduce this on a smaller table consistently.
 {code}
 hbase(main):030:0 (0..1).each{|x| put 't1', #{x}, 'f1:t', 'dd'}
 hbase(main):030:0 (0..1000).each{|x| split 't1', #{x*10}}
 {code}
 Running overlapping splits in parallel (e.g. #{x*10+1}, #{x*10+2}... ) 
 will reproduce the issue almost instantly and consistently. 
 {code}
 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: 
 Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in 
 META
 2012-03-28 10:57:16,321 DEBUG 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for 
 t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  
 compaction_queue=(0:1), split_queue=10
 2012-03-28 10:57:16,343 INFO 
 org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup 
 of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; 
 Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 java.io.IOException: Failed 
 ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
 at 
 org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.io.FileNotFoundException: File does not exist: 
 /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.init(DFSClient.java:1813)
 at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
 at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile$Reader.init(StoreFile.java:1008)
 at 
 org.apache.hadoop.hbase.io.HalfStoreFileReader.init(HalfStoreFileReader.java:65)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
 at 
 org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
 at org.apache.hadoop.hbase.regionserver.Store.init(Store.java:221)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
 ... 1 more
 2012-03-28 10:57:16,345 FATAL 
 org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 
 ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
 {code}
 http://hastebin.com/diqinibajo.avrasm
 later edit:
 (I'm using the last 4 characters from each string)
 Region 94e3 has storefile 7237
 Region 94e3 gets splited in daughters a: ffa1 and b: eee1
 Daughter region ffa1 

[jira] [Commented] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

2012-04-01 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243898#comment-13243898
 ] 

Hadoop QA commented on HBASE-5665:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12520847/HBASE-5665-trunk.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
 

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/1362//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/1362//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/1362//console

This message is automatically generated.

 Repeated split causes HRegionServer failures and breaks table 
 --

 Key: HBASE-5665
 URL: https://issues.apache.org/jira/browse/HBASE-5665
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0, 0.92.1
Reporter: Cosmin Lehene
Assignee: Cosmin Lehene
Priority: Blocker
 Attachments: HBASE-5665-0.92.patch, HBASE-5665-trunk.patch


 Repeated splits on large tables (2 consecutive would suffice) will 
 essentially break the table (and the cluster), unrecoverable.
 The regionserver doing the split dies and the master will get into an 
 infinite loop trying to assign regions that seem to have the files missing 
 from HDFS.
 The table can be disabled once. upon trying to re-enable it, it will remain 
 in an intermediary state forever.
 I was able to reproduce this on a smaller table consistently.
 {code}
 hbase(main):030:0 (0..1).each{|x| put 't1', #{x}, 'f1:t', 'dd'}
 hbase(main):030:0 (0..1000).each{|x| split 't1', #{x*10}}
 {code}
 Running overlapping splits in parallel (e.g. #{x*10+1}, #{x*10+2}... ) 
 will reproduce the issue almost instantly and consistently. 
 {code}
 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: 
 Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in 
 META
 2012-03-28 10:57:16,321 DEBUG 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for 
 t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  
 compaction_queue=(0:1), split_queue=10
 2012-03-28 10:57:16,343 INFO 
 org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup 
 of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; 
 Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 java.io.IOException: Failed 
 ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
 at 
 org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.io.FileNotFoundException: File does not exist: 
 /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.init(DFSClient.java:1813)
 at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
 at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile$Reader.init(StoreFile.java:1008)
 at 
 org.apache.hadoop.hbase.io.HalfStoreFileReader.init(HalfStoreFileReader.java:65)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
 at 
 

[jira] [Commented] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

2012-03-29 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241556#comment-13241556
 ] 

stack commented on HBASE-5665:
--

Or, is this a problem only with forced splits?  It doesn't happen when we split 
'naturally' because we'll check for references?

 Repeated split causes HRegionServer failures and breaks table 
 --

 Key: HBASE-5665
 URL: https://issues.apache.org/jira/browse/HBASE-5665
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0, 0.92.1, 0.94.0, 0.96.0, 0.94.1
Reporter: Cosmin Lehene
Assignee: Cosmin Lehene
Priority: Blocker
 Attachments: HBASE-5665-0.92.patch


 Repeated splits on large tables (2 consecutive would suffice) will 
 essentially break the table (and the cluster), unrecoverable.
 The regionserver doing the split dies and the master will get into an 
 infinite loop trying to assign regions that seem to have the files missing 
 from HDFS.
 The table can be disabled once. upon trying to re-enable it, it will remain 
 in an intermediary state forever.
 I was able to reproduce this on a smaller table consistently.
 {code}
 hbase(main):030:0 (0..1).each{|x| put 't1', #{x}, 'f1:t', 'dd'}
 hbase(main):030:0 (0..1000).each{|x| split 't1', #{x*10}}
 {code}
 Running overlapping splits in parallel (e.g. #{x*10+1}, #{x*10+2}... ) 
 will reproduce the issue almost instantly and consistently. 
 {code}
 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: 
 Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in 
 META
 2012-03-28 10:57:16,321 DEBUG 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for 
 t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  
 compaction_queue=(0:1), split_queue=10
 2012-03-28 10:57:16,343 INFO 
 org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup 
 of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; 
 Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 java.io.IOException: Failed 
 ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
 at 
 org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.io.FileNotFoundException: File does not exist: 
 /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.init(DFSClient.java:1813)
 at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
 at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile$Reader.init(StoreFile.java:1008)
 at 
 org.apache.hadoop.hbase.io.HalfStoreFileReader.init(HalfStoreFileReader.java:65)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
 at 
 org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
 at org.apache.hadoop.hbase.regionserver.Store.init(Store.java:221)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
 ... 1 more
 2012-03-28 10:57:16,345 FATAL 
 org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 
 ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
 {code}
 http://hastebin.com/diqinibajo.avrasm
 later edit:
 (I'm using the last 4 characters from each string)
 Region 94e3 has 

[jira] [Commented] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

2012-03-29 Thread Cosmin Lehene (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241586#comment-13241586
 ] 

Cosmin Lehene commented on HBASE-5665:
--

Indeed it seems to be a problem with forced splits. I'm not sure though if the 
natural splits are safe - they seem to be, but I need to test that too. 

RegionSplitPolicy.getSplitPoint() calls Store.getSplitPoint()
Store.getSplitPoint seems to do the check. 

{code}
for (StoreFile sf : storefiles) {
if (sf.isReference()) {
  // Should already be enforced since we return false in this case
  assert false : getSplitPoint() called on a region that can't split!;
  return null;
}
{code}

BTW, we also have Store.hasReferences()
{code}
  private boolean hasReferences(CollectionStoreFile files) {
if (files != null  files.size()  0) {
  for (StoreFile hsf: files) {
if (hsf.isReference()) {
  return true;
}
  }
}
return false;
  }

{code}


However here's the code in HRegion.checkSplit()
If there's an explicit split point it won't get to do the reference check.

{code}
 public byte[] checkSplit() {
// Can't split META
if (getRegionInfo().isMetaRegion()) {
  if (shouldForceSplit()) {
LOG.warn(Cannot split meta regions in HBase 0.20 and above);
  }
  return null;
}

if (this.explicitSplitPoint != null) {
  return this.explicitSplitPoint;
}

if (!splitPolicy.shouldSplit()) {
  return null;
}

byte[] ret = splitPolicy.getSplitPoint();

if (ret != null) {
  try {
checkRow(ret, calculated split);
  } catch (IOException e) {
LOG.error(Ignoring invalid split, e);
return null;
  }
}
return ret;
  }
{code}

Multiple return points + a ret variable - this could use some polishing too :)

I'm a bit puzzled about the natural split, because, I've seen the problem with 
a forced split from UI where I don't think we provide an explicit split point. 

Cosmin

 Repeated split causes HRegionServer failures and breaks table 
 --

 Key: HBASE-5665
 URL: https://issues.apache.org/jira/browse/HBASE-5665
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0, 0.92.1, 0.94.0, 0.96.0, 0.94.1
Reporter: Cosmin Lehene
Assignee: Cosmin Lehene
Priority: Blocker
 Attachments: HBASE-5665-0.92.patch


 Repeated splits on large tables (2 consecutive would suffice) will 
 essentially break the table (and the cluster), unrecoverable.
 The regionserver doing the split dies and the master will get into an 
 infinite loop trying to assign regions that seem to have the files missing 
 from HDFS.
 The table can be disabled once. upon trying to re-enable it, it will remain 
 in an intermediary state forever.
 I was able to reproduce this on a smaller table consistently.
 {code}
 hbase(main):030:0 (0..1).each{|x| put 't1', #{x}, 'f1:t', 'dd'}
 hbase(main):030:0 (0..1000).each{|x| split 't1', #{x*10}}
 {code}
 Running overlapping splits in parallel (e.g. #{x*10+1}, #{x*10+2}... ) 
 will reproduce the issue almost instantly and consistently. 
 {code}
 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: 
 Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in 
 META
 2012-03-28 10:57:16,321 DEBUG 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for 
 t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  
 compaction_queue=(0:1), split_queue=10
 2012-03-28 10:57:16,343 INFO 
 org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup 
 of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; 
 Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 java.io.IOException: Failed 
 ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
 at 
 org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.io.FileNotFoundException: File does not exist: 
 /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
 at 
 

[jira] [Commented] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

2012-03-29 Thread Cosmin Lehene (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241593#comment-13241593
 ] 

Cosmin Lehene commented on HBASE-5665:
--

BTW - I don't think getSplitPoint should do that check, and we also shouldn't 
have to places where we check for references - perhaps we should have another 
JIRA to fix this in trunk?

 Repeated split causes HRegionServer failures and breaks table 
 --

 Key: HBASE-5665
 URL: https://issues.apache.org/jira/browse/HBASE-5665
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0, 0.92.1
Reporter: Cosmin Lehene
Assignee: Cosmin Lehene
Priority: Blocker
 Attachments: HBASE-5665-0.92.patch


 Repeated splits on large tables (2 consecutive would suffice) will 
 essentially break the table (and the cluster), unrecoverable.
 The regionserver doing the split dies and the master will get into an 
 infinite loop trying to assign regions that seem to have the files missing 
 from HDFS.
 The table can be disabled once. upon trying to re-enable it, it will remain 
 in an intermediary state forever.
 I was able to reproduce this on a smaller table consistently.
 {code}
 hbase(main):030:0 (0..1).each{|x| put 't1', #{x}, 'f1:t', 'dd'}
 hbase(main):030:0 (0..1000).each{|x| split 't1', #{x*10}}
 {code}
 Running overlapping splits in parallel (e.g. #{x*10+1}, #{x*10+2}... ) 
 will reproduce the issue almost instantly and consistently. 
 {code}
 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: 
 Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in 
 META
 2012-03-28 10:57:16,321 DEBUG 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for 
 t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  
 compaction_queue=(0:1), split_queue=10
 2012-03-28 10:57:16,343 INFO 
 org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup 
 of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; 
 Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 java.io.IOException: Failed 
 ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
 at 
 org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.io.FileNotFoundException: File does not exist: 
 /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.init(DFSClient.java:1813)
 at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
 at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile$Reader.init(StoreFile.java:1008)
 at 
 org.apache.hadoop.hbase.io.HalfStoreFileReader.init(HalfStoreFileReader.java:65)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
 at 
 org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
 at org.apache.hadoop.hbase.regionserver.Store.init(Store.java:221)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
 ... 1 more
 2012-03-28 10:57:16,345 FATAL 
 org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 
 ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
 {code}
 http://hastebin.com/diqinibajo.avrasm
 later edit:
 (I'm using the last 4 

[jira] [Commented] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

2012-03-29 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241611#comment-13241611
 ] 

Hadoop QA commented on HBASE-5665:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12520458/HBASE-5665-0.92.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
 

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/1347//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/1347//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/1347//console

This message is automatically generated.

 Repeated split causes HRegionServer failures and breaks table 
 --

 Key: HBASE-5665
 URL: https://issues.apache.org/jira/browse/HBASE-5665
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0, 0.92.1
Reporter: Cosmin Lehene
Assignee: Cosmin Lehene
Priority: Blocker
 Attachments: HBASE-5665-0.92.patch


 Repeated splits on large tables (2 consecutive would suffice) will 
 essentially break the table (and the cluster), unrecoverable.
 The regionserver doing the split dies and the master will get into an 
 infinite loop trying to assign regions that seem to have the files missing 
 from HDFS.
 The table can be disabled once. upon trying to re-enable it, it will remain 
 in an intermediary state forever.
 I was able to reproduce this on a smaller table consistently.
 {code}
 hbase(main):030:0 (0..1).each{|x| put 't1', #{x}, 'f1:t', 'dd'}
 hbase(main):030:0 (0..1000).each{|x| split 't1', #{x*10}}
 {code}
 Running overlapping splits in parallel (e.g. #{x*10+1}, #{x*10+2}... ) 
 will reproduce the issue almost instantly and consistently. 
 {code}
 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: 
 Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in 
 META
 2012-03-28 10:57:16,321 DEBUG 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for 
 t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  
 compaction_queue=(0:1), split_queue=10
 2012-03-28 10:57:16,343 INFO 
 org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup 
 of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; 
 Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 java.io.IOException: Failed 
 ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
 at 
 org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.io.FileNotFoundException: File does not exist: 
 /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.init(DFSClient.java:1813)
 at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
 at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile$Reader.init(StoreFile.java:1008)
 at 
 org.apache.hadoop.hbase.io.HalfStoreFileReader.init(HalfStoreFileReader.java:65)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
 at 
 

[jira] [Commented] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

2012-03-29 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241765#comment-13241765
 ] 

stack commented on HBASE-5665:
--

Nice test Cosmin

 Repeated split causes HRegionServer failures and breaks table 
 --

 Key: HBASE-5665
 URL: https://issues.apache.org/jira/browse/HBASE-5665
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0, 0.92.1
Reporter: Cosmin Lehene
Assignee: Cosmin Lehene
Priority: Blocker
 Attachments: HBASE-5665-0.92.patch


 Repeated splits on large tables (2 consecutive would suffice) will 
 essentially break the table (and the cluster), unrecoverable.
 The regionserver doing the split dies and the master will get into an 
 infinite loop trying to assign regions that seem to have the files missing 
 from HDFS.
 The table can be disabled once. upon trying to re-enable it, it will remain 
 in an intermediary state forever.
 I was able to reproduce this on a smaller table consistently.
 {code}
 hbase(main):030:0 (0..1).each{|x| put 't1', #{x}, 'f1:t', 'dd'}
 hbase(main):030:0 (0..1000).each{|x| split 't1', #{x*10}}
 {code}
 Running overlapping splits in parallel (e.g. #{x*10+1}, #{x*10+2}... ) 
 will reproduce the issue almost instantly and consistently. 
 {code}
 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: 
 Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in 
 META
 2012-03-28 10:57:16,321 DEBUG 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for 
 t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  
 compaction_queue=(0:1), split_queue=10
 2012-03-28 10:57:16,343 INFO 
 org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup 
 of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; 
 Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 java.io.IOException: Failed 
 ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
 at 
 org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.io.FileNotFoundException: File does not exist: 
 /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.init(DFSClient.java:1813)
 at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
 at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile$Reader.init(StoreFile.java:1008)
 at 
 org.apache.hadoop.hbase.io.HalfStoreFileReader.init(HalfStoreFileReader.java:65)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
 at 
 org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
 at org.apache.hadoop.hbase.regionserver.Store.init(Store.java:221)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
 at 
 org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
 ... 1 more
 2012-03-28 10:57:16,345 FATAL 
 org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 
 ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
 {code}
 http://hastebin.com/diqinibajo.avrasm
 later edit:
 (I'm using the last 4 characters from each string)
 Region 94e3 has storefile 7237
 Region 94e3 gets splited in daughters a: ffa1 and b: eee1
 Daughter region ffa1 get's splitted in daughters a: 3124 and b: dc77