[jira] [Commented] (HBASE-15019) Replication stuck when HDFS is restarted

2016-01-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15123743#comment-15123743
 ] 

Hudson commented on HBASE-15019:


FAILURE: Integrated in HBase-1.0 #1139 (See 
[https://builds.apache.org/job/HBase-1.0/1139/])
HBASE-15019 Replication stuck when HDFS is restarted. (matteo.bertozzi: rev 
9c42beaa3423e1476aa87e56f59168ed5ce0f461)
* hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALFactory.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/util/LeaseNotRecoveredException.java


> Replication stuck when HDFS is restarted
> 
>
> Key: HBASE-15019
> URL: https://issues.apache.org/jira/browse/HBASE-15019
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 2.0.0, 1.2.0, 1.1.2, 1.0.3, 0.98.16.1
>Reporter: Matteo Bertozzi
>Assignee: Matteo Bertozzi
> Fix For: 2.0.0, 1.2.0, 1.3.0, 1.1.4, 1.0.4, 0.98.18
>
> Attachments: HBASE-15019-v0_branch-1.2.patch, HBASE-15019-v1.patch, 
> HBASE-15019-v1_0.98.patch, HBASE-15019-v1_branch-1.2.patch, 
> HBASE-15019-v2.patch, HBASE-15019-v3.patch, HBASE-15019-v4.patch
>
>
> RS is normally working and writing on the WAL.
> HDFS is killed and restarted, and the RS try to do a roll.
> The close fail, but the roll succeed (because hdfs is now up) and everything 
> works.
> {noformat}
> 2015-12-11 21:52:28,058 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter: Got IOException 
> while writing trailer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Failed close of HLog writer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
> Riding over HLog close failure! error count=1
> {noformat}
> The problem is on the replication side. that log we rolled and we were not 
> able to close
> is waiting for a lease recovery.
> {noformat}
> 2015-12-11 21:16:31,909 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Can't open after 267 
> attempts and 301124ms 
> {noformat}
> the WALFactory notify us about that, but there is nothing on the RS side that 
> perform the WAL recovery.
> {noformat}
> 2015-12-11 21:11:30,921 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Lease should have 
> recovered. This is not expected. Will retry
> java.io.IOException: Cannot obtain block length for 
> LocatedBlock{BP-1547065147-10.51.30.152-1446756937665:blk_1073801614_61243; 
> getBlockSize()=83; corrupt=false; offset=0; locs=[10.51.30.154:50010, 
> 10.51.30.152:50010, 10.51.30.155:50010]}
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:358)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:300)
>   at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:237)
>   at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:230)
>   at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1448)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:301)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:297)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:297)
>   at org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:161)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:116)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:89)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:77)
>   at 
> 

[jira] [Commented] (HBASE-15019) Replication stuck when HDFS is restarted

2016-01-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15123768#comment-15123768
 ] 

Hudson commented on HBASE-15019:


FAILURE: Integrated in HBase-0.98-matrix #290 (See 
[https://builds.apache.org/job/HBase-0.98-matrix/290/])
HBASE-15019 Replication stuck when HDFS is restarted. (matteo.bertozzi: rev 
60c6b6df104030995754bb1470a0d5d3e20cf220)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/util/LeaseNotRecoveredException.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogFactory.java


> Replication stuck when HDFS is restarted
> 
>
> Key: HBASE-15019
> URL: https://issues.apache.org/jira/browse/HBASE-15019
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 2.0.0, 1.2.0, 1.1.2, 1.0.3, 0.98.16.1
>Reporter: Matteo Bertozzi
>Assignee: Matteo Bertozzi
> Fix For: 2.0.0, 1.2.0, 1.3.0, 1.1.4, 1.0.4, 0.98.18
>
> Attachments: HBASE-15019-v0_branch-1.2.patch, HBASE-15019-v1.patch, 
> HBASE-15019-v1_0.98.patch, HBASE-15019-v1_branch-1.2.patch, 
> HBASE-15019-v2.patch, HBASE-15019-v3.patch, HBASE-15019-v4.patch
>
>
> RS is normally working and writing on the WAL.
> HDFS is killed and restarted, and the RS try to do a roll.
> The close fail, but the roll succeed (because hdfs is now up) and everything 
> works.
> {noformat}
> 2015-12-11 21:52:28,058 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter: Got IOException 
> while writing trailer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Failed close of HLog writer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
> Riding over HLog close failure! error count=1
> {noformat}
> The problem is on the replication side. that log we rolled and we were not 
> able to close
> is waiting for a lease recovery.
> {noformat}
> 2015-12-11 21:16:31,909 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Can't open after 267 
> attempts and 301124ms 
> {noformat}
> the WALFactory notify us about that, but there is nothing on the RS side that 
> perform the WAL recovery.
> {noformat}
> 2015-12-11 21:11:30,921 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Lease should have 
> recovered. This is not expected. Will retry
> java.io.IOException: Cannot obtain block length for 
> LocatedBlock{BP-1547065147-10.51.30.152-1446756937665:blk_1073801614_61243; 
> getBlockSize()=83; corrupt=false; offset=0; locs=[10.51.30.154:50010, 
> 10.51.30.152:50010, 10.51.30.155:50010]}
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:358)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:300)
>   at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:237)
>   at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:230)
>   at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1448)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:301)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:297)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:297)
>   at org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:161)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:116)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:89)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:77)
>   at 
> 

[jira] [Commented] (HBASE-15019) Replication stuck when HDFS is restarted

2016-01-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15123811#comment-15123811
 ] 

Hudson commented on HBASE-15019:


FAILURE: Integrated in HBase-0.98-on-Hadoop-1.1 #1164 (See 
[https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/1164/])
HBASE-15019 Replication stuck when HDFS is restarted. (matteo.bertozzi: rev 
60c6b6df104030995754bb1470a0d5d3e20cf220)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogFactory.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/util/LeaseNotRecoveredException.java


> Replication stuck when HDFS is restarted
> 
>
> Key: HBASE-15019
> URL: https://issues.apache.org/jira/browse/HBASE-15019
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 2.0.0, 1.2.0, 1.1.2, 1.0.3, 0.98.16.1
>Reporter: Matteo Bertozzi
>Assignee: Matteo Bertozzi
> Fix For: 2.0.0, 1.2.0, 1.3.0, 1.1.4, 1.0.4, 0.98.18
>
> Attachments: HBASE-15019-v0_branch-1.2.patch, HBASE-15019-v1.patch, 
> HBASE-15019-v1_0.98.patch, HBASE-15019-v1_branch-1.2.patch, 
> HBASE-15019-v2.patch, HBASE-15019-v3.patch, HBASE-15019-v4.patch
>
>
> RS is normally working and writing on the WAL.
> HDFS is killed and restarted, and the RS try to do a roll.
> The close fail, but the roll succeed (because hdfs is now up) and everything 
> works.
> {noformat}
> 2015-12-11 21:52:28,058 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter: Got IOException 
> while writing trailer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Failed close of HLog writer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
> Riding over HLog close failure! error count=1
> {noformat}
> The problem is on the replication side. that log we rolled and we were not 
> able to close
> is waiting for a lease recovery.
> {noformat}
> 2015-12-11 21:16:31,909 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Can't open after 267 
> attempts and 301124ms 
> {noformat}
> the WALFactory notify us about that, but there is nothing on the RS side that 
> perform the WAL recovery.
> {noformat}
> 2015-12-11 21:11:30,921 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Lease should have 
> recovered. This is not expected. Will retry
> java.io.IOException: Cannot obtain block length for 
> LocatedBlock{BP-1547065147-10.51.30.152-1446756937665:blk_1073801614_61243; 
> getBlockSize()=83; corrupt=false; offset=0; locs=[10.51.30.154:50010, 
> 10.51.30.152:50010, 10.51.30.155:50010]}
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:358)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:300)
>   at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:237)
>   at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:230)
>   at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1448)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:301)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:297)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:297)
>   at org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:161)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:116)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:89)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:77)
>   at 
> 

[jira] [Commented] (HBASE-15019) Replication stuck when HDFS is restarted

2016-01-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15122141#comment-15122141
 ] 

Hudson commented on HBASE-15019:


SUCCESS: Integrated in HBase-1.3-IT #466 (See 
[https://builds.apache.org/job/HBase-1.3-IT/466/])
HBASE-15019 Replication stuck when HDFS is restarted. (matteo.bertozzi: rev 
67c2fc7cd62f5d53da633f08d5a3c93600ac86f0)
* hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALFactory.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/util/LeaseNotRecoveredException.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java


> Replication stuck when HDFS is restarted
> 
>
> Key: HBASE-15019
> URL: https://issues.apache.org/jira/browse/HBASE-15019
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 2.0.0, 1.2.0, 1.1.2, 1.0.3, 0.98.16.1
>Reporter: Matteo Bertozzi
>Assignee: Matteo Bertozzi
> Fix For: 2.0.0, 1.2.0, 1.3.0, 1.1.4, 1.0.4, 0.98.18
>
> Attachments: HBASE-15019-v0_branch-1.2.patch, HBASE-15019-v1.patch, 
> HBASE-15019-v1_0.98.patch, HBASE-15019-v1_branch-1.2.patch, 
> HBASE-15019-v2.patch, HBASE-15019-v3.patch, HBASE-15019-v4.patch
>
>
> RS is normally working and writing on the WAL.
> HDFS is killed and restarted, and the RS try to do a roll.
> The close fail, but the roll succeed (because hdfs is now up) and everything 
> works.
> {noformat}
> 2015-12-11 21:52:28,058 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter: Got IOException 
> while writing trailer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Failed close of HLog writer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
> Riding over HLog close failure! error count=1
> {noformat}
> The problem is on the replication side. that log we rolled and we were not 
> able to close
> is waiting for a lease recovery.
> {noformat}
> 2015-12-11 21:16:31,909 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Can't open after 267 
> attempts and 301124ms 
> {noformat}
> the WALFactory notify us about that, but there is nothing on the RS side that 
> perform the WAL recovery.
> {noformat}
> 2015-12-11 21:11:30,921 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Lease should have 
> recovered. This is not expected. Will retry
> java.io.IOException: Cannot obtain block length for 
> LocatedBlock{BP-1547065147-10.51.30.152-1446756937665:blk_1073801614_61243; 
> getBlockSize()=83; corrupt=false; offset=0; locs=[10.51.30.154:50010, 
> 10.51.30.152:50010, 10.51.30.155:50010]}
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:358)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:300)
>   at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:237)
>   at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:230)
>   at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1448)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:301)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:297)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:297)
>   at org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:161)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:116)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:89)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:77)
>   at 
> 

[jira] [Commented] (HBASE-15019) Replication stuck when HDFS is restarted

2016-01-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15122388#comment-15122388
 ] 

Hudson commented on HBASE-15019:


SUCCESS: Integrated in HBase-1.2 #523 (See 
[https://builds.apache.org/job/HBase-1.2/523/])
HBASE-15019 Replication stuck when HDFS is restarted. (matteo.bertozzi: rev 
778c9730b3403f4b330578b44cce3f56d19cf25e)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/util/LeaseNotRecoveredException.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALFactory.java


> Replication stuck when HDFS is restarted
> 
>
> Key: HBASE-15019
> URL: https://issues.apache.org/jira/browse/HBASE-15019
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 2.0.0, 1.2.0, 1.1.2, 1.0.3, 0.98.16.1
>Reporter: Matteo Bertozzi
>Assignee: Matteo Bertozzi
> Fix For: 2.0.0, 1.2.0, 1.3.0, 1.1.4, 1.0.4, 0.98.18
>
> Attachments: HBASE-15019-v0_branch-1.2.patch, HBASE-15019-v1.patch, 
> HBASE-15019-v1_0.98.patch, HBASE-15019-v1_branch-1.2.patch, 
> HBASE-15019-v2.patch, HBASE-15019-v3.patch, HBASE-15019-v4.patch
>
>
> RS is normally working and writing on the WAL.
> HDFS is killed and restarted, and the RS try to do a roll.
> The close fail, but the roll succeed (because hdfs is now up) and everything 
> works.
> {noformat}
> 2015-12-11 21:52:28,058 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter: Got IOException 
> while writing trailer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Failed close of HLog writer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
> Riding over HLog close failure! error count=1
> {noformat}
> The problem is on the replication side. that log we rolled and we were not 
> able to close
> is waiting for a lease recovery.
> {noformat}
> 2015-12-11 21:16:31,909 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Can't open after 267 
> attempts and 301124ms 
> {noformat}
> the WALFactory notify us about that, but there is nothing on the RS side that 
> perform the WAL recovery.
> {noformat}
> 2015-12-11 21:11:30,921 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Lease should have 
> recovered. This is not expected. Will retry
> java.io.IOException: Cannot obtain block length for 
> LocatedBlock{BP-1547065147-10.51.30.152-1446756937665:blk_1073801614_61243; 
> getBlockSize()=83; corrupt=false; offset=0; locs=[10.51.30.154:50010, 
> 10.51.30.152:50010, 10.51.30.155:50010]}
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:358)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:300)
>   at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:237)
>   at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:230)
>   at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1448)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:301)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:297)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:297)
>   at org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:161)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:116)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:89)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:77)
>   at 
> 

[jira] [Commented] (HBASE-15019) Replication stuck when HDFS is restarted

2016-01-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15122440#comment-15122440
 ] 

Hudson commented on HBASE-15019:


FAILURE: Integrated in HBase-1.3 #520 (See 
[https://builds.apache.org/job/HBase-1.3/520/])
HBASE-15019 Replication stuck when HDFS is restarted. (matteo.bertozzi: rev 
67c2fc7cd62f5d53da633f08d5a3c93600ac86f0)
* hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALFactory.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/util/LeaseNotRecoveredException.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java


> Replication stuck when HDFS is restarted
> 
>
> Key: HBASE-15019
> URL: https://issues.apache.org/jira/browse/HBASE-15019
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 2.0.0, 1.2.0, 1.1.2, 1.0.3, 0.98.16.1
>Reporter: Matteo Bertozzi
>Assignee: Matteo Bertozzi
> Fix For: 2.0.0, 1.2.0, 1.3.0, 1.1.4, 1.0.4, 0.98.18
>
> Attachments: HBASE-15019-v0_branch-1.2.patch, HBASE-15019-v1.patch, 
> HBASE-15019-v1_0.98.patch, HBASE-15019-v1_branch-1.2.patch, 
> HBASE-15019-v2.patch, HBASE-15019-v3.patch, HBASE-15019-v4.patch
>
>
> RS is normally working and writing on the WAL.
> HDFS is killed and restarted, and the RS try to do a roll.
> The close fail, but the roll succeed (because hdfs is now up) and everything 
> works.
> {noformat}
> 2015-12-11 21:52:28,058 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter: Got IOException 
> while writing trailer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Failed close of HLog writer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
> Riding over HLog close failure! error count=1
> {noformat}
> The problem is on the replication side. that log we rolled and we were not 
> able to close
> is waiting for a lease recovery.
> {noformat}
> 2015-12-11 21:16:31,909 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Can't open after 267 
> attempts and 301124ms 
> {noformat}
> the WALFactory notify us about that, but there is nothing on the RS side that 
> perform the WAL recovery.
> {noformat}
> 2015-12-11 21:11:30,921 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Lease should have 
> recovered. This is not expected. Will retry
> java.io.IOException: Cannot obtain block length for 
> LocatedBlock{BP-1547065147-10.51.30.152-1446756937665:blk_1073801614_61243; 
> getBlockSize()=83; corrupt=false; offset=0; locs=[10.51.30.154:50010, 
> 10.51.30.152:50010, 10.51.30.155:50010]}
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:358)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:300)
>   at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:237)
>   at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:230)
>   at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1448)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:301)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:297)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:297)
>   at org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:161)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:116)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:89)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:77)
>   at 
> 

[jira] [Commented] (HBASE-15019) Replication stuck when HDFS is restarted

2016-01-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15122203#comment-15122203
 ] 

Hudson commented on HBASE-15019:


SUCCESS: Integrated in HBase-1.2-IT #413 (See 
[https://builds.apache.org/job/HBase-1.2-IT/413/])
HBASE-15019 Replication stuck when HDFS is restarted. (matteo.bertozzi: rev 
778c9730b3403f4b330578b44cce3f56d19cf25e)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/util/LeaseNotRecoveredException.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALFactory.java


> Replication stuck when HDFS is restarted
> 
>
> Key: HBASE-15019
> URL: https://issues.apache.org/jira/browse/HBASE-15019
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 2.0.0, 1.2.0, 1.1.2, 1.0.3, 0.98.16.1
>Reporter: Matteo Bertozzi
>Assignee: Matteo Bertozzi
> Fix For: 2.0.0, 1.2.0, 1.3.0, 1.1.4, 1.0.4, 0.98.18
>
> Attachments: HBASE-15019-v0_branch-1.2.patch, HBASE-15019-v1.patch, 
> HBASE-15019-v1_0.98.patch, HBASE-15019-v1_branch-1.2.patch, 
> HBASE-15019-v2.patch, HBASE-15019-v3.patch, HBASE-15019-v4.patch
>
>
> RS is normally working and writing on the WAL.
> HDFS is killed and restarted, and the RS try to do a roll.
> The close fail, but the roll succeed (because hdfs is now up) and everything 
> works.
> {noformat}
> 2015-12-11 21:52:28,058 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter: Got IOException 
> while writing trailer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Failed close of HLog writer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
> Riding over HLog close failure! error count=1
> {noformat}
> The problem is on the replication side. that log we rolled and we were not 
> able to close
> is waiting for a lease recovery.
> {noformat}
> 2015-12-11 21:16:31,909 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Can't open after 267 
> attempts and 301124ms 
> {noformat}
> the WALFactory notify us about that, but there is nothing on the RS side that 
> perform the WAL recovery.
> {noformat}
> 2015-12-11 21:11:30,921 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Lease should have 
> recovered. This is not expected. Will retry
> java.io.IOException: Cannot obtain block length for 
> LocatedBlock{BP-1547065147-10.51.30.152-1446756937665:blk_1073801614_61243; 
> getBlockSize()=83; corrupt=false; offset=0; locs=[10.51.30.154:50010, 
> 10.51.30.152:50010, 10.51.30.155:50010]}
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:358)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:300)
>   at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:237)
>   at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:230)
>   at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1448)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:301)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:297)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:297)
>   at org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:161)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:116)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:89)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:77)
>   at 
> 

[jira] [Commented] (HBASE-15019) Replication stuck when HDFS is restarted

2016-01-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15122559#comment-15122559
 ] 

Hudson commented on HBASE-15019:


SUCCESS: Integrated in HBase-Trunk_matrix #666 (See 
[https://builds.apache.org/job/HBase-Trunk_matrix/666/])
HBASE-15019 Replication stuck when HDFS is restarted. (matteo.bertozzi: rev 
8a217da8fd3990f9880270eb1e50d8f87d1e92fb)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/util/LeaseNotRecoveredException.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALFactory.java


> Replication stuck when HDFS is restarted
> 
>
> Key: HBASE-15019
> URL: https://issues.apache.org/jira/browse/HBASE-15019
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 2.0.0, 1.2.0, 1.1.2, 1.0.3, 0.98.16.1
>Reporter: Matteo Bertozzi
>Assignee: Matteo Bertozzi
> Fix For: 2.0.0, 1.2.0, 1.3.0, 1.1.4, 1.0.4, 0.98.18
>
> Attachments: HBASE-15019-v0_branch-1.2.patch, HBASE-15019-v1.patch, 
> HBASE-15019-v1_0.98.patch, HBASE-15019-v1_branch-1.2.patch, 
> HBASE-15019-v2.patch, HBASE-15019-v3.patch, HBASE-15019-v4.patch
>
>
> RS is normally working and writing on the WAL.
> HDFS is killed and restarted, and the RS try to do a roll.
> The close fail, but the roll succeed (because hdfs is now up) and everything 
> works.
> {noformat}
> 2015-12-11 21:52:28,058 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter: Got IOException 
> while writing trailer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Failed close of HLog writer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
> Riding over HLog close failure! error count=1
> {noformat}
> The problem is on the replication side. that log we rolled and we were not 
> able to close
> is waiting for a lease recovery.
> {noformat}
> 2015-12-11 21:16:31,909 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Can't open after 267 
> attempts and 301124ms 
> {noformat}
> the WALFactory notify us about that, but there is nothing on the RS side that 
> perform the WAL recovery.
> {noformat}
> 2015-12-11 21:11:30,921 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Lease should have 
> recovered. This is not expected. Will retry
> java.io.IOException: Cannot obtain block length for 
> LocatedBlock{BP-1547065147-10.51.30.152-1446756937665:blk_1073801614_61243; 
> getBlockSize()=83; corrupt=false; offset=0; locs=[10.51.30.154:50010, 
> 10.51.30.152:50010, 10.51.30.155:50010]}
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:358)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:300)
>   at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:237)
>   at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:230)
>   at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1448)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:301)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:297)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:297)
>   at org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:161)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:116)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:89)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:77)
>   at 
> 

[jira] [Commented] (HBASE-15019) Replication stuck when HDFS is restarted

2016-01-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15122552#comment-15122552
 ] 

Hudson commented on HBASE-15019:


FAILURE: Integrated in HBase-1.1-JDK8 #1735 (See 
[https://builds.apache.org/job/HBase-1.1-JDK8/1735/])
HBASE-15019 Replication stuck when HDFS is restarted. (matteo.bertozzi: rev 
5041485aa5c1ecfaa4697b8d0b8a78d027ceaa8a)
* hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALFactory.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/util/LeaseNotRecoveredException.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java


> Replication stuck when HDFS is restarted
> 
>
> Key: HBASE-15019
> URL: https://issues.apache.org/jira/browse/HBASE-15019
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 2.0.0, 1.2.0, 1.1.2, 1.0.3, 0.98.16.1
>Reporter: Matteo Bertozzi
>Assignee: Matteo Bertozzi
> Fix For: 2.0.0, 1.2.0, 1.3.0, 1.1.4, 1.0.4, 0.98.18
>
> Attachments: HBASE-15019-v0_branch-1.2.patch, HBASE-15019-v1.patch, 
> HBASE-15019-v1_0.98.patch, HBASE-15019-v1_branch-1.2.patch, 
> HBASE-15019-v2.patch, HBASE-15019-v3.patch, HBASE-15019-v4.patch
>
>
> RS is normally working and writing on the WAL.
> HDFS is killed and restarted, and the RS try to do a roll.
> The close fail, but the roll succeed (because hdfs is now up) and everything 
> works.
> {noformat}
> 2015-12-11 21:52:28,058 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter: Got IOException 
> while writing trailer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Failed close of HLog writer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
> Riding over HLog close failure! error count=1
> {noformat}
> The problem is on the replication side. that log we rolled and we were not 
> able to close
> is waiting for a lease recovery.
> {noformat}
> 2015-12-11 21:16:31,909 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Can't open after 267 
> attempts and 301124ms 
> {noformat}
> the WALFactory notify us about that, but there is nothing on the RS side that 
> perform the WAL recovery.
> {noformat}
> 2015-12-11 21:11:30,921 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Lease should have 
> recovered. This is not expected. Will retry
> java.io.IOException: Cannot obtain block length for 
> LocatedBlock{BP-1547065147-10.51.30.152-1446756937665:blk_1073801614_61243; 
> getBlockSize()=83; corrupt=false; offset=0; locs=[10.51.30.154:50010, 
> 10.51.30.152:50010, 10.51.30.155:50010]}
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:358)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:300)
>   at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:237)
>   at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:230)
>   at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1448)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:301)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:297)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:297)
>   at org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:161)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:116)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:89)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:77)
>   at 
> 

[jira] [Commented] (HBASE-15019) Replication stuck when HDFS is restarted

2016-01-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15122551#comment-15122551
 ] 

Hudson commented on HBASE-15019:


FAILURE: Integrated in HBase-1.1-JDK7 #1648 (See 
[https://builds.apache.org/job/HBase-1.1-JDK7/1648/])
HBASE-15019 Replication stuck when HDFS is restarted. (matteo.bertozzi: rev 
5041485aa5c1ecfaa4697b8d0b8a78d027ceaa8a)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALFactory.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/util/LeaseNotRecoveredException.java


> Replication stuck when HDFS is restarted
> 
>
> Key: HBASE-15019
> URL: https://issues.apache.org/jira/browse/HBASE-15019
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 2.0.0, 1.2.0, 1.1.2, 1.0.3, 0.98.16.1
>Reporter: Matteo Bertozzi
>Assignee: Matteo Bertozzi
> Fix For: 2.0.0, 1.2.0, 1.3.0, 1.1.4, 1.0.4, 0.98.18
>
> Attachments: HBASE-15019-v0_branch-1.2.patch, HBASE-15019-v1.patch, 
> HBASE-15019-v1_0.98.patch, HBASE-15019-v1_branch-1.2.patch, 
> HBASE-15019-v2.patch, HBASE-15019-v3.patch, HBASE-15019-v4.patch
>
>
> RS is normally working and writing on the WAL.
> HDFS is killed and restarted, and the RS try to do a roll.
> The close fail, but the roll succeed (because hdfs is now up) and everything 
> works.
> {noformat}
> 2015-12-11 21:52:28,058 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter: Got IOException 
> while writing trailer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Failed close of HLog writer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
> Riding over HLog close failure! error count=1
> {noformat}
> The problem is on the replication side. that log we rolled and we were not 
> able to close
> is waiting for a lease recovery.
> {noformat}
> 2015-12-11 21:16:31,909 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Can't open after 267 
> attempts and 301124ms 
> {noformat}
> the WALFactory notify us about that, but there is nothing on the RS side that 
> perform the WAL recovery.
> {noformat}
> 2015-12-11 21:11:30,921 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Lease should have 
> recovered. This is not expected. Will retry
> java.io.IOException: Cannot obtain block length for 
> LocatedBlock{BP-1547065147-10.51.30.152-1446756937665:blk_1073801614_61243; 
> getBlockSize()=83; corrupt=false; offset=0; locs=[10.51.30.154:50010, 
> 10.51.30.152:50010, 10.51.30.155:50010]}
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:358)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:300)
>   at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:237)
>   at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:230)
>   at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1448)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:301)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:297)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:297)
>   at org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:161)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:116)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:89)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:77)
>   at 
> 

[jira] [Commented] (HBASE-15019) Replication stuck when HDFS is restarted

2016-01-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15114242#comment-15114242
 ] 

Hadoop QA commented on HBASE-15019:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 
0s {color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 
35s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s 
{color} | {color:green} master passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 33s 
{color} | {color:green} master passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 4m 
15s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
16s {color} | {color:green} master passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 49s 
{color} | {color:red} hbase-server in master has 1 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s 
{color} | {color:green} master passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 34s 
{color} | {color:green} master passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
43s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 30s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 34s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 34s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 4m 
16s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
16s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
21m 15s {color} | {color:green} Patch does not cause any errors with Hadoop 
2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.1 2.6.2 2.6.3 2.7.1. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 3s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 32s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 77m 27s 
{color} | {color:green} hbase-server in the patch passed with JDK v1.8.0_66. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 78m 39s {color} 
| {color:red} hbase-server in the patch failed with JDK v1.7.0_91. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
16s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 198m 28s {color} 
| {color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=1.9.1 Server=1.9.1 Image:yetus/hbase:date2016-01-24 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12783597/HBASE-15019-v4.patch |
| JIRA Issue | HBASE-15019 |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  hadoopcheck  
hbaseanti  checkstyle  compile  |
| uname | Linux b3a146136232 3.13.0-36-lowlatency 

[jira] [Commented] (HBASE-15019) Replication stuck when HDFS is restarted

2016-01-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110361#comment-15110361
 ] 

Hadoop QA commented on HBASE-15019:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 
0s {color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 
28s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 35s 
{color} | {color:green} master passed with JDK v1.8.0 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 31s 
{color} | {color:green} master passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 4m 
10s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
15s {color} | {color:green} master passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 47s 
{color} | {color:red} hbase-server in master has 1 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s 
{color} | {color:green} master passed with JDK v1.8.0 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s 
{color} | {color:green} master passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
41s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 35s 
{color} | {color:green} the patch passed with JDK v1.8.0 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 35s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 31s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 31s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 4m 6s 
{color} | {color:red} Patch generated 1 new checkstyle issues in hbase-server 
(total was 17, now 18). {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
15s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
20m 52s {color} | {color:green} Patch does not cause any errors with Hadoop 
2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.1 2.6.2 2.6.3 2.7.1. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
58s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s 
{color} | {color:green} the patch passed with JDK v1.8.0 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 32s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 85m 23s 
{color} | {color:green} hbase-server in the patch passed with JDK v1.8.0. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 82m 49s {color} 
| {color:red} hbase-server in the patch failed with JDK v1.7.0_79. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
16s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 209m 27s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.7.0_79 Failed junit tests | 
hadoop.hbase.regionserver.TestRegionMergeTransactionOnCluster |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12783523/HBASE-15019-v3.patch |
| JIRA Issue | HBASE-15019 |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  hadoopcheck  
hbaseanti  checkstyle  compile  |
| uname | Linux asf901.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT 

[jira] [Commented] (HBASE-15019) Replication stuck when HDFS is restarted

2016-01-21 Thread Sean Busbey (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110996#comment-15110996
 ] 

Sean Busbey commented on HBASE-15019:
-

the test failures don't look related. what do you think [~stack]?

> Replication stuck when HDFS is restarted
> 
>
> Key: HBASE-15019
> URL: https://issues.apache.org/jira/browse/HBASE-15019
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 2.0.0, 1.2.0, 1.1.2, 1.0.3, 0.98.16.1
>Reporter: Matteo Bertozzi
>Assignee: Matteo Bertozzi
> Fix For: 2.0.0, 1.2.0, 1.3.0
>
> Attachments: HBASE-15019-v0_branch-1.2.patch, HBASE-15019-v1.patch, 
> HBASE-15019-v1_0.98.patch, HBASE-15019-v1_branch-1.2.patch, 
> HBASE-15019-v2.patch, HBASE-15019-v3.patch, HBASE-15019-v4.patch
>
>
> RS is normally working and writing on the WAL.
> HDFS is killed and restarted, and the RS try to do a roll.
> The close fail, but the roll succeed (because hdfs is now up) and everything 
> works.
> {noformat}
> 2015-12-11 21:52:28,058 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter: Got IOException 
> while writing trailer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Failed close of HLog writer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
> Riding over HLog close failure! error count=1
> {noformat}
> The problem is on the replication side. that log we rolled and we were not 
> able to close
> is waiting for a lease recovery.
> {noformat}
> 2015-12-11 21:16:31,909 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Can't open after 267 
> attempts and 301124ms 
> {noformat}
> the WALFactory notify us about that, but there is nothing on the RS side that 
> perform the WAL recovery.
> {noformat}
> 2015-12-11 21:11:30,921 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Lease should have 
> recovered. This is not expected. Will retry
> java.io.IOException: Cannot obtain block length for 
> LocatedBlock{BP-1547065147-10.51.30.152-1446756937665:blk_1073801614_61243; 
> getBlockSize()=83; corrupt=false; offset=0; locs=[10.51.30.154:50010, 
> 10.51.30.152:50010, 10.51.30.155:50010]}
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:358)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:300)
>   at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:237)
>   at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:230)
>   at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1448)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:301)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:297)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:297)
>   at org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:161)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:116)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:89)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:77)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:68)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:508)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:321)
> {noformat}
> the only way to trigger a WAL recovery is to restart and force the master to 
> trigger the lease recovery on WAL split. 
> but there is a case where restarting will not help. If 

[jira] [Commented] (HBASE-15019) Replication stuck when HDFS is restarted

2016-01-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110978#comment-15110978
 ] 

Hadoop QA commented on HBASE-15019:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 
0s {color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 
42s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 54s 
{color} | {color:green} master passed with JDK v1.8.0 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 36s 
{color} | {color:green} master passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 4m 
13s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
16s {color} | {color:green} master passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 2m 3s 
{color} | {color:red} hbase-server in master has 1 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 36s 
{color} | {color:green} master passed with JDK v1.8.0 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 41s 
{color} | {color:green} master passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
45s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 57s 
{color} | {color:green} the patch passed with JDK v1.8.0 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 57s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 35s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 35s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 4m 
24s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
15s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
23m 18s {color} | {color:green} Patch does not cause any errors with Hadoop 
2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.1 2.6.2 2.6.3 2.7.1. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 
31s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 45s 
{color} | {color:green} the patch passed with JDK v1.8.0 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 55s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 105m 21s 
{color} | {color:red} hbase-server in the patch failed with JDK v1.8.0. {color} 
|
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 22s {color} 
| {color:red} hbase-server in the patch failed with JDK v1.7.0_79. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
8s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 152m 43s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0 Timed out junit tests | 
org.apache.hadoop.hbase.coprocessor.TestRegionObserverInterface |
|   | org.apache.hadoop.hbase.snapshot.TestMobSecureExportSnapshot |
|   | org.apache.hadoop.hbase.snapshot.TestMobExportSnapshot |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12783597/HBASE-15019-v4.patch |
| JIRA Issue | HBASE-15019 |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  hadoopcheck  
hbaseanti  checkstyle  compile  |
| uname | Linux 

[jira] [Commented] (HBASE-15019) Replication stuck when HDFS is restarted

2016-01-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15112009#comment-15112009
 ] 

Hadoop QA commented on HBASE-15019:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | {color:red} docker {color} | {color:red} 18m 23s 
{color} | {color:red} Docker failed to build yetus/hbase:date2016-01-22. 
{color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12783597/HBASE-15019-v4.patch |
| JIRA Issue | HBASE-15019 |
| Powered by | Apache Yetus 0.2.0-SNAPSHOT   http://yetus.apache.org |
| Console output | 
https://builds.apache.org/job/PreCommit-HBASE-Build/250/console |


This message was automatically generated.



> Replication stuck when HDFS is restarted
> 
>
> Key: HBASE-15019
> URL: https://issues.apache.org/jira/browse/HBASE-15019
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 2.0.0, 1.2.0, 1.1.2, 1.0.3, 0.98.16.1
>Reporter: Matteo Bertozzi
>Assignee: Matteo Bertozzi
> Fix For: 2.0.0, 1.2.0, 1.3.0
>
> Attachments: HBASE-15019-v0_branch-1.2.patch, HBASE-15019-v1.patch, 
> HBASE-15019-v1_0.98.patch, HBASE-15019-v1_branch-1.2.patch, 
> HBASE-15019-v2.patch, HBASE-15019-v3.patch, HBASE-15019-v4.patch
>
>
> RS is normally working and writing on the WAL.
> HDFS is killed and restarted, and the RS try to do a roll.
> The close fail, but the roll succeed (because hdfs is now up) and everything 
> works.
> {noformat}
> 2015-12-11 21:52:28,058 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter: Got IOException 
> while writing trailer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Failed close of HLog writer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
> Riding over HLog close failure! error count=1
> {noformat}
> The problem is on the replication side. that log we rolled and we were not 
> able to close
> is waiting for a lease recovery.
> {noformat}
> 2015-12-11 21:16:31,909 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Can't open after 267 
> attempts and 301124ms 
> {noformat}
> the WALFactory notify us about that, but there is nothing on the RS side that 
> perform the WAL recovery.
> {noformat}
> 2015-12-11 21:11:30,921 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Lease should have 
> recovered. This is not expected. Will retry
> java.io.IOException: Cannot obtain block length for 
> LocatedBlock{BP-1547065147-10.51.30.152-1446756937665:blk_1073801614_61243; 
> getBlockSize()=83; corrupt=false; offset=0; locs=[10.51.30.154:50010, 
> 10.51.30.152:50010, 10.51.30.155:50010]}
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:358)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:300)
>   at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:237)
>   at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:230)
>   at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1448)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:301)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:297)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:297)
>   at org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:161)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:116)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:89)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:77)
>   at 

[jira] [Commented] (HBASE-15019) Replication stuck when HDFS is restarted

2016-01-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110118#comment-15110118
 ] 

Hadoop QA commented on HBASE-15019:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 3s {color} 
| {color:red} HBASE-15019 does not apply to master. Rebase required? Wrong 
Branch? See https://yetus.apache.org/documentation/latest/precommit-patchnames 
for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12780069/HBASE-15019-v2.patch |
| JIRA Issue | HBASE-15019 |
| Powered by | Apache Yetus 0.1.0   http://yetus.apache.org |
| Console output | 
https://builds.apache.org/job/PreCommit-HBASE-Build/222/console |


This message was automatically generated.



> Replication stuck when HDFS is restarted
> 
>
> Key: HBASE-15019
> URL: https://issues.apache.org/jira/browse/HBASE-15019
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 2.0.0, 1.2.0, 1.1.2, 1.0.3, 0.98.16.1
>Reporter: Matteo Bertozzi
>Assignee: Matteo Bertozzi
> Fix For: 2.0.0, 1.2.0, 1.3.0
>
> Attachments: HBASE-15019-v0_branch-1.2.patch, HBASE-15019-v1.patch, 
> HBASE-15019-v1_0.98.patch, HBASE-15019-v1_branch-1.2.patch, 
> HBASE-15019-v2.patch
>
>
> RS is normally working and writing on the WAL.
> HDFS is killed and restarted, and the RS try to do a roll.
> The close fail, but the roll succeed (because hdfs is now up) and everything 
> works.
> {noformat}
> 2015-12-11 21:52:28,058 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter: Got IOException 
> while writing trailer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Failed close of HLog writer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
> Riding over HLog close failure! error count=1
> {noformat}
> The problem is on the replication side. that log we rolled and we were not 
> able to close
> is waiting for a lease recovery.
> {noformat}
> 2015-12-11 21:16:31,909 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Can't open after 267 
> attempts and 301124ms 
> {noformat}
> the WALFactory notify us about that, but there is nothing on the RS side that 
> perform the WAL recovery.
> {noformat}
> 2015-12-11 21:11:30,921 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Lease should have 
> recovered. This is not expected. Will retry
> java.io.IOException: Cannot obtain block length for 
> LocatedBlock{BP-1547065147-10.51.30.152-1446756937665:blk_1073801614_61243; 
> getBlockSize()=83; corrupt=false; offset=0; locs=[10.51.30.154:50010, 
> 10.51.30.152:50010, 10.51.30.155:50010]}
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:358)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:300)
>   at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:237)
>   at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:230)
>   at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1448)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:301)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:297)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:297)
>   at org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:161)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:116)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:89)
>   at 
> 

[jira] [Commented] (HBASE-15019) Replication stuck when HDFS is restarted

2016-01-19 Thread Sean Busbey (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15108070#comment-15108070
 ] 

Sean Busbey commented on HBASE-15019:
-

you still trying to get this in for 1.2 [~mbertozzi]?

> Replication stuck when HDFS is restarted
> 
>
> Key: HBASE-15019
> URL: https://issues.apache.org/jira/browse/HBASE-15019
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 2.0.0, 1.2.0, 1.1.2, 1.0.3, 0.98.16.1
>Reporter: Matteo Bertozzi
>Assignee: Matteo Bertozzi
> Fix For: 2.0.0, 1.2.0, 1.3.0
>
> Attachments: HBASE-15019-v0_branch-1.2.patch, HBASE-15019-v1.patch, 
> HBASE-15019-v1_0.98.patch, HBASE-15019-v1_branch-1.2.patch, 
> HBASE-15019-v2.patch
>
>
> RS is normally working and writing on the WAL.
> HDFS is killed and restarted, and the RS try to do a roll.
> The close fail, but the roll succeed (because hdfs is now up) and everything 
> works.
> {noformat}
> 2015-12-11 21:52:28,058 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter: Got IOException 
> while writing trailer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Failed close of HLog writer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
> Riding over HLog close failure! error count=1
> {noformat}
> The problem is on the replication side. that log we rolled and we were not 
> able to close
> is waiting for a lease recovery.
> {noformat}
> 2015-12-11 21:16:31,909 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Can't open after 267 
> attempts and 301124ms 
> {noformat}
> the WALFactory notify us about that, but there is nothing on the RS side that 
> perform the WAL recovery.
> {noformat}
> 2015-12-11 21:11:30,921 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Lease should have 
> recovered. This is not expected. Will retry
> java.io.IOException: Cannot obtain block length for 
> LocatedBlock{BP-1547065147-10.51.30.152-1446756937665:blk_1073801614_61243; 
> getBlockSize()=83; corrupt=false; offset=0; locs=[10.51.30.154:50010, 
> 10.51.30.152:50010, 10.51.30.155:50010]}
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:358)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:300)
>   at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:237)
>   at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:230)
>   at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1448)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:301)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:297)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:297)
>   at org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:161)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:116)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:89)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:77)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:68)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:508)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:321)
> {noformat}
> the only way to trigger a WAL recovery is to restart and force the master to 
> trigger the lease recovery on WAL split. 
> but there is a case where restarting will not help. If the RS keeps going 
> rolling and flushing the unclosed 

[jira] [Commented] (HBASE-15019) Replication stuck when HDFS is restarted

2016-01-19 Thread Matteo Bertozzi (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15108111#comment-15108111
 ] 

Matteo Bertozzi commented on HBASE-15019:
-

I just need a +1

> Replication stuck when HDFS is restarted
> 
>
> Key: HBASE-15019
> URL: https://issues.apache.org/jira/browse/HBASE-15019
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 2.0.0, 1.2.0, 1.1.2, 1.0.3, 0.98.16.1
>Reporter: Matteo Bertozzi
>Assignee: Matteo Bertozzi
> Fix For: 2.0.0, 1.2.0, 1.3.0
>
> Attachments: HBASE-15019-v0_branch-1.2.patch, HBASE-15019-v1.patch, 
> HBASE-15019-v1_0.98.patch, HBASE-15019-v1_branch-1.2.patch, 
> HBASE-15019-v2.patch
>
>
> RS is normally working and writing on the WAL.
> HDFS is killed and restarted, and the RS try to do a roll.
> The close fail, but the roll succeed (because hdfs is now up) and everything 
> works.
> {noformat}
> 2015-12-11 21:52:28,058 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter: Got IOException 
> while writing trailer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Failed close of HLog writer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
> Riding over HLog close failure! error count=1
> {noformat}
> The problem is on the replication side. that log we rolled and we were not 
> able to close
> is waiting for a lease recovery.
> {noformat}
> 2015-12-11 21:16:31,909 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Can't open after 267 
> attempts and 301124ms 
> {noformat}
> the WALFactory notify us about that, but there is nothing on the RS side that 
> perform the WAL recovery.
> {noformat}
> 2015-12-11 21:11:30,921 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Lease should have 
> recovered. This is not expected. Will retry
> java.io.IOException: Cannot obtain block length for 
> LocatedBlock{BP-1547065147-10.51.30.152-1446756937665:blk_1073801614_61243; 
> getBlockSize()=83; corrupt=false; offset=0; locs=[10.51.30.154:50010, 
> 10.51.30.152:50010, 10.51.30.155:50010]}
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:358)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:300)
>   at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:237)
>   at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:230)
>   at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1448)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:301)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:297)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:297)
>   at org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:161)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:116)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:89)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:77)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:68)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:508)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:321)
> {noformat}
> the only way to trigger a WAL recovery is to restart and force the master to 
> trigger the lease recovery on WAL split. 
> but there is a case where restarting will not help. If the RS keeps going 
> rolling and flushing the unclosed WAL will be moved in the 

[jira] [Commented] (HBASE-15019) Replication stuck when HDFS is restarted

2015-12-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075699#comment-15075699
 ] 

Hadoop QA commented on HBASE-15019:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12780069/HBASE-15019-v2.patch
  against master branch at commit c1b6d47e7974a5d9d75933bab9a28572e9d95c14.
  ATTACHMENT ID: 12780069

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.0 2.6.1 2.7.0 
2.7.1)

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 protoc{color}.  The applied patch does not increase the 
total number of protoc compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:red}-1 checkstyle{color}.  The applied patch generated 
new checkstyle errors. Check build console for list of new errors.

{color:green}+1 findbugs{color}.  The patch does not introduce any  new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

{color:green}+1 site{color}.  The mvn post-site goal succeeds with this 
patch.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

{color:green}+1 zombies{color}. No zombie tests found running at the end of 
the build.

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/17090//testReport/
Release Findbugs (version 2.0.3)warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/17090//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/17090//artifact/patchprocess/checkstyle-aggregate.html

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/17090//console

This message is automatically generated.

> Replication stuck when HDFS is restarted
> 
>
> Key: HBASE-15019
> URL: https://issues.apache.org/jira/browse/HBASE-15019
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 2.0.0, 1.2.0, 1.1.2, 1.0.3, 0.98.16.1
>Reporter: Matteo Bertozzi
>Assignee: Matteo Bertozzi
> Attachments: HBASE-15019-v0_branch-1.2.patch, HBASE-15019-v1.patch, 
> HBASE-15019-v1_0.98.patch, HBASE-15019-v1_branch-1.2.patch, 
> HBASE-15019-v2.patch
>
>
> RS is normally working and writing on the WAL.
> HDFS is killed and restarted, and the RS try to do a roll.
> The close fail, but the roll succeed (because hdfs is now up) and everything 
> works.
> {noformat}
> 2015-12-11 21:52:28,058 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter: Got IOException 
> while writing trailer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Failed close of HLog writer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
> Riding over HLog close failure! error count=1
> {noformat}
> The problem is on the replication side. that log we rolled and we were not 
> able to close
> is waiting for a lease recovery.
> {noformat}
> 2015-12-11 21:16:31,909 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Can't open after 267 
> attempts and 301124ms 
> {noformat}
> the WALFactory notify us about that, but there is nothing on the RS side that 
> perform the 

[jira] [Commented] (HBASE-15019) Replication stuck when HDFS is restarted

2015-12-30 Thread Matteo Bertozzi (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075528#comment-15075528
 ] 

Matteo Bertozzi commented on HBASE-15019:
-

patches for all the branches are the same, just some whitespace conflict

> Replication stuck when HDFS is restarted
> 
>
> Key: HBASE-15019
> URL: https://issues.apache.org/jira/browse/HBASE-15019
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 2.0.0, 1.2.0, 1.1.2, 1.0.3, 0.98.16.1
>Reporter: Matteo Bertozzi
>Assignee: Matteo Bertozzi
> Attachments: HBASE-15019-v0_branch-1.2.patch, HBASE-15019-v1.patch, 
> HBASE-15019-v1_0.98.patch, HBASE-15019-v1_branch-1.2.patch
>
>
> RS is normally working and writing on the WAL.
> HDFS is killed and restarted, and the RS try to do a roll.
> The close fail, but the roll succeed (because hdfs is now up) and everything 
> works.
> {noformat}
> 2015-12-11 21:52:28,058 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter: Got IOException 
> while writing trailer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Failed close of HLog writer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
> Riding over HLog close failure! error count=1
> {noformat}
> The problem is on the replication side. that log we rolled and we were not 
> able to close
> is waiting for a lease recovery.
> {noformat}
> 2015-12-11 21:16:31,909 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Can't open after 267 
> attempts and 301124ms 
> {noformat}
> the WALFactory notify us about that, but there is nothing on the RS side that 
> perform the WAL recovery.
> {noformat}
> 2015-12-11 21:11:30,921 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Lease should have 
> recovered. This is not expected. Will retry
> java.io.IOException: Cannot obtain block length for 
> LocatedBlock{BP-1547065147-10.51.30.152-1446756937665:blk_1073801614_61243; 
> getBlockSize()=83; corrupt=false; offset=0; locs=[10.51.30.154:50010, 
> 10.51.30.152:50010, 10.51.30.155:50010]}
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:358)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:300)
>   at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:237)
>   at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:230)
>   at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1448)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:301)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:297)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:297)
>   at org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:161)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:116)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:89)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:77)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:68)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:508)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:321)
> {noformat}
> the only way to trigger a WAL recovery is to restart and force the master to 
> trigger the lease recovery on WAL split. 
> but there is a case where restarting will not help. If the RS keeps going 
> rolling and flushing the unclosed WAL will be moved in the archive, and at 
> 

[jira] [Commented] (HBASE-15019) Replication stuck when HDFS is restarted

2015-12-30 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1507#comment-1507
 ] 

Ted Yu commented on HBASE-15019:


lgtm
{code}
821 recoverLease(conf, currentPath);
{code}
Should return value from recoverLease() be checked ?

> Replication stuck when HDFS is restarted
> 
>
> Key: HBASE-15019
> URL: https://issues.apache.org/jira/browse/HBASE-15019
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 2.0.0, 1.2.0, 1.1.2, 1.0.3, 0.98.16.1
>Reporter: Matteo Bertozzi
>Assignee: Matteo Bertozzi
> Attachments: HBASE-15019-v0_branch-1.2.patch, HBASE-15019-v1.patch, 
> HBASE-15019-v1_0.98.patch, HBASE-15019-v1_branch-1.2.patch
>
>
> RS is normally working and writing on the WAL.
> HDFS is killed and restarted, and the RS try to do a roll.
> The close fail, but the roll succeed (because hdfs is now up) and everything 
> works.
> {noformat}
> 2015-12-11 21:52:28,058 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter: Got IOException 
> while writing trailer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Failed close of HLog writer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
> Riding over HLog close failure! error count=1
> {noformat}
> The problem is on the replication side. that log we rolled and we were not 
> able to close
> is waiting for a lease recovery.
> {noformat}
> 2015-12-11 21:16:31,909 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Can't open after 267 
> attempts and 301124ms 
> {noformat}
> the WALFactory notify us about that, but there is nothing on the RS side that 
> perform the WAL recovery.
> {noformat}
> 2015-12-11 21:11:30,921 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Lease should have 
> recovered. This is not expected. Will retry
> java.io.IOException: Cannot obtain block length for 
> LocatedBlock{BP-1547065147-10.51.30.152-1446756937665:blk_1073801614_61243; 
> getBlockSize()=83; corrupt=false; offset=0; locs=[10.51.30.154:50010, 
> 10.51.30.152:50010, 10.51.30.155:50010]}
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:358)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:300)
>   at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:237)
>   at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:230)
>   at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1448)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:301)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:297)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:297)
>   at org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:161)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:116)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:89)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:77)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:68)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:508)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:321)
> {noformat}
> the only way to trigger a WAL recovery is to restart and force the master to 
> trigger the lease recovery on WAL split. 
> but there is a case where restarting will not help. If the RS keeps going 
> rolling and flushing the unclosed WAL will be 

[jira] [Commented] (HBASE-15019) Replication stuck when HDFS is restarted

2015-12-30 Thread Matteo Bertozzi (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075560#comment-15075560
 ] 

Matteo Bertozzi commented on HBASE-15019:
-

nah, I can change to method to void... the return there is meaningless. we 
don't know if the lease was recovered or not anyway. and we will retry the open 
on the next run. 

> Replication stuck when HDFS is restarted
> 
>
> Key: HBASE-15019
> URL: https://issues.apache.org/jira/browse/HBASE-15019
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 2.0.0, 1.2.0, 1.1.2, 1.0.3, 0.98.16.1
>Reporter: Matteo Bertozzi
>Assignee: Matteo Bertozzi
> Attachments: HBASE-15019-v0_branch-1.2.patch, HBASE-15019-v1.patch, 
> HBASE-15019-v1_0.98.patch, HBASE-15019-v1_branch-1.2.patch
>
>
> RS is normally working and writing on the WAL.
> HDFS is killed and restarted, and the RS try to do a roll.
> The close fail, but the roll succeed (because hdfs is now up) and everything 
> works.
> {noformat}
> 2015-12-11 21:52:28,058 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter: Got IOException 
> while writing trailer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Failed close of HLog writer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
> Riding over HLog close failure! error count=1
> {noformat}
> The problem is on the replication side. that log we rolled and we were not 
> able to close
> is waiting for a lease recovery.
> {noformat}
> 2015-12-11 21:16:31,909 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Can't open after 267 
> attempts and 301124ms 
> {noformat}
> the WALFactory notify us about that, but there is nothing on the RS side that 
> perform the WAL recovery.
> {noformat}
> 2015-12-11 21:11:30,921 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Lease should have 
> recovered. This is not expected. Will retry
> java.io.IOException: Cannot obtain block length for 
> LocatedBlock{BP-1547065147-10.51.30.152-1446756937665:blk_1073801614_61243; 
> getBlockSize()=83; corrupt=false; offset=0; locs=[10.51.30.154:50010, 
> 10.51.30.152:50010, 10.51.30.155:50010]}
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:358)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:300)
>   at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:237)
>   at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:230)
>   at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1448)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:301)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:297)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:297)
>   at org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:161)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:116)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:89)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:77)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:68)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:508)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:321)
> {noformat}
> the only way to trigger a WAL recovery is to restart and force the master to 
> trigger the lease recovery on WAL split. 
> but there is a case where restarting will not help. If 

[jira] [Commented] (HBASE-15019) Replication stuck when HDFS is restarted

2015-12-30 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075564#comment-15075564
 ] 

Ted Yu commented on HBASE-15019:


Changing return type to void would make the usage clearer.

Thanks

> Replication stuck when HDFS is restarted
> 
>
> Key: HBASE-15019
> URL: https://issues.apache.org/jira/browse/HBASE-15019
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 2.0.0, 1.2.0, 1.1.2, 1.0.3, 0.98.16.1
>Reporter: Matteo Bertozzi
>Assignee: Matteo Bertozzi
> Attachments: HBASE-15019-v0_branch-1.2.patch, HBASE-15019-v1.patch, 
> HBASE-15019-v1_0.98.patch, HBASE-15019-v1_branch-1.2.patch
>
>
> RS is normally working and writing on the WAL.
> HDFS is killed and restarted, and the RS try to do a roll.
> The close fail, but the roll succeed (because hdfs is now up) and everything 
> works.
> {noformat}
> 2015-12-11 21:52:28,058 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter: Got IOException 
> while writing trailer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Failed close of HLog writer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
> Riding over HLog close failure! error count=1
> {noformat}
> The problem is on the replication side. that log we rolled and we were not 
> able to close
> is waiting for a lease recovery.
> {noformat}
> 2015-12-11 21:16:31,909 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Can't open after 267 
> attempts and 301124ms 
> {noformat}
> the WALFactory notify us about that, but there is nothing on the RS side that 
> perform the WAL recovery.
> {noformat}
> 2015-12-11 21:11:30,921 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Lease should have 
> recovered. This is not expected. Will retry
> java.io.IOException: Cannot obtain block length for 
> LocatedBlock{BP-1547065147-10.51.30.152-1446756937665:blk_1073801614_61243; 
> getBlockSize()=83; corrupt=false; offset=0; locs=[10.51.30.154:50010, 
> 10.51.30.152:50010, 10.51.30.155:50010]}
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:358)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:300)
>   at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:237)
>   at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:230)
>   at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1448)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:301)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:297)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:297)
>   at org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:161)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:116)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:89)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:77)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:68)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:508)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:321)
> {noformat}
> the only way to trigger a WAL recovery is to restart and force the master to 
> trigger the lease recovery on WAL split. 
> but there is a case where restarting will not help. If the RS keeps going 
> rolling and flushing the unclosed WAL will be moved in the archive, and at 
> that point the master 

[jira] [Commented] (HBASE-15019) Replication stuck when HDFS is restarted

2015-12-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075675#comment-15075675
 ] 

Hadoop QA commented on HBASE-15019:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12780064/HBASE-15019-v1.patch
  against master branch at commit c1b6d47e7974a5d9d75933bab9a28572e9d95c14.
  ATTACHMENT ID: 12780064

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.0 2.6.1 2.7.0 
2.7.1)

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 protoc{color}.  The applied patch does not increase the 
total number of protoc compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:red}-1 checkstyle{color}.  The applied patch generated 
new checkstyle errors. Check build console for list of new errors.

{color:green}+1 findbugs{color}.  The patch does not introduce any  new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

{color:green}+1 site{color}.  The mvn post-site goal succeeds with this 
patch.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

{color:green}+1 zombies{color}. No zombie tests found running at the end of 
the build.

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/17088//testReport/
Release Findbugs (version 2.0.3)warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/17088//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/17088//artifact/patchprocess/checkstyle-aggregate.html

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/17088//console

This message is automatically generated.

> Replication stuck when HDFS is restarted
> 
>
> Key: HBASE-15019
> URL: https://issues.apache.org/jira/browse/HBASE-15019
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 2.0.0, 1.2.0, 1.1.2, 1.0.3, 0.98.16.1
>Reporter: Matteo Bertozzi
>Assignee: Matteo Bertozzi
> Attachments: HBASE-15019-v0_branch-1.2.patch, HBASE-15019-v1.patch, 
> HBASE-15019-v1_0.98.patch, HBASE-15019-v1_branch-1.2.patch, 
> HBASE-15019-v2.patch
>
>
> RS is normally working and writing on the WAL.
> HDFS is killed and restarted, and the RS try to do a roll.
> The close fail, but the roll succeed (because hdfs is now up) and everything 
> works.
> {noformat}
> 2015-12-11 21:52:28,058 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter: Got IOException 
> while writing trailer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Failed close of HLog writer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
> Riding over HLog close failure! error count=1
> {noformat}
> The problem is on the replication side. that log we rolled and we were not 
> able to close
> is waiting for a lease recovery.
> {noformat}
> 2015-12-11 21:16:31,909 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Can't open after 267 
> attempts and 301124ms 
> {noformat}
> the WALFactory notify us about that, but there is nothing on the RS side that 
> perform the 

[jira] [Commented] (HBASE-15019) Replication stuck when HDFS is restarted

2015-12-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15070509#comment-15070509
 ] 

Hadoop QA commented on HBASE-15019:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12779343/HBASE-15019-v0_branch-1.2.patch
  against branch-1.2 branch at commit 04de427e57d144caf5a9cde3664dac780ed763ab.
  ATTACHMENT ID: 12779343

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.0 2.6.1 2.7.0 
2.7.1)

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 protoc{color}.  The applied patch does not increase the 
total number of protoc compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 checkstyle{color}. The applied patch does not generate new 
checkstyle errors.

{color:green}+1 findbugs{color}.  The patch does not introduce any  new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

{color:green}+1 site{color}.  The mvn post-site goal succeeds with this 
patch.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

{color:green}+1 zombies{color}. No zombie tests found running at the end of 
the build.

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/17004//testReport/
Release Findbugs (version 2.0.3)warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/17004//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/17004//artifact/patchprocess/checkstyle-aggregate.html

  Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/17004//console

This message is automatically generated.

> Replication stuck when HDFS is restarted
> 
>
> Key: HBASE-15019
> URL: https://issues.apache.org/jira/browse/HBASE-15019
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 2.0.0, 1.2.0, 1.1.2, 1.0.3, 0.98.16.1
>Reporter: Matteo Bertozzi
>Assignee: Matteo Bertozzi
> Attachments: HBASE-15019-v0_branch-1.2.patch
>
>
> RS is normally working and writing on the WAL.
> HDFS is killed and restarted, and the RS try to do a roll.
> The close fail, but the roll succeed (because hdfs is now up) and everything 
> works.
> {noformat}
> 2015-12-11 21:52:28,058 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter: Got IOException 
> while writing trailer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Failed close of HLog writer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
> Riding over HLog close failure! error count=1
> {noformat}
> The problem is on the replication side. that log we rolled and we were not 
> able to close
> is waiting for a lease recovery.
> {noformat}
> 2015-12-11 21:16:31,909 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Can't open after 267 
> attempts and 301124ms 
> {noformat}
> the WALFactory notify us about that, but there is nothing on the RS side that 
> perform the WAL recovery.
> {noformat}
> 2015-12-11 21:11:30,921 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Lease should have 
> recovered. This is 

[jira] [Commented] (HBASE-15019) Replication stuck when HDFS is restarted

2015-12-22 Thread Matteo Bertozzi (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068637#comment-15068637
 ] 

Matteo Bertozzi commented on HBASE-15019:
-

so, on the RS we know when we failed to close a WAL and we know when the open 
problem is caused by a file not close (lease recovery not called).
 * We can abort the RS once we get stuck on this problem and the split will 
take care of it.
 * We may try to call recoverLease() on the RS if the replication is not able 
to open. worst case (e.g. partition) the Master and the RS will fight for the 
lease and we will get stuck there
 * We can add an rpc to the master and ask him to do the recover lease, so we 
have just the master doing lease recovery and we avoid RS and Master fighting 
on the recovery.
 * We can keep a list of not closed stream in the FSHLog, and try to close them 
every once in awhile. If we are able to append to a new file we should be able 
to close the old one.

the first option is the easiest one, but we kill the RS. 
the second one is probably probably a no-go since may cause deadlock in the 
worst case.
the third one require a new rpc, which may be ok but meh...
the fourth looks like a hack but it is simple and isolated, the only problem 
with that is that we have the strong assumption that we are able to close a 
stream that we are hanging even after an hdfs restart (which seems to work, i'm 
trying to test it) 

> Replication stuck when HDFS is restarted
> 
>
> Key: HBASE-15019
> URL: https://issues.apache.org/jira/browse/HBASE-15019
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 2.0.0, 1.2.0, 1.1.2, 1.0.3, 0.98.16.1
>Reporter: Matteo Bertozzi
>Assignee: Matteo Bertozzi
>
> RS is normally working and writing on the WAL.
> HDFS is killed and restarted, and the RS try to do a roll.
> The close fail, but the roll succeed (because hdfs is now up) and everything 
> works.
> {noformat}
> 2015-12-11 21:52:28,058 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter: Got IOException 
> while writing trailer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Failed close of HLog writer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
> Riding over HLog close failure! error count=1
> {noformat}
> The problem is on the replication side. that log we rolled and we were not 
> able to close
> is waiting for a lease recovery.
> {noformat}
> 2015-12-11 21:16:31,909 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Can't open after 267 
> attempts and 301124ms 
> {noformat}
> the WALFactory notify us about that, but there is nothing on the RS side that 
> perform the WAL recovery.
> {noformat}
> 2015-12-11 21:11:30,921 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Lease should have 
> recovered. This is not expected. Will retry
> java.io.IOException: Cannot obtain block length for 
> LocatedBlock{BP-1547065147-10.51.30.152-1446756937665:blk_1073801614_61243; 
> getBlockSize()=83; corrupt=false; offset=0; locs=[10.51.30.154:50010, 
> 10.51.30.152:50010, 10.51.30.155:50010]}
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:358)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:300)
>   at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:237)
>   at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:230)
>   at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1448)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:301)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:297)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:297)
>   at 

[jira] [Commented] (HBASE-15019) Replication stuck when HDFS is restarted

2015-12-22 Thread Matteo Bertozzi (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068941#comment-15068941
 ] 

Matteo Bertozzi commented on HBASE-15019:
-

unfortunately the 4th option doesn't seems to work. calling close() a next time 
will report success but it is not really doing anything. the first round of 
close() will set the closed flag to true and on the next call to close() we 
will drop the exception.
https://github.com/apache/hadoop/blob/branch-2.6/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L2209

I played a bit with DFSOutputStream trying to keep it alive and be able to 
close the file later, but it doesn't look like it is a easy fix. later I found 
that colin tried to fix the problem in HDFS-4504 but the patch never got in.

also stack pointed out that in the second option (RS calling recover lease) at 
some point the RS will get the YouAreDeadException so even if we spin for some 
time we will never end up in an infinite loop with the two fighting for 
recovery.

> Replication stuck when HDFS is restarted
> 
>
> Key: HBASE-15019
> URL: https://issues.apache.org/jira/browse/HBASE-15019
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 2.0.0, 1.2.0, 1.1.2, 1.0.3, 0.98.16.1
>Reporter: Matteo Bertozzi
>Assignee: Matteo Bertozzi
>
> RS is normally working and writing on the WAL.
> HDFS is killed and restarted, and the RS try to do a roll.
> The close fail, but the roll succeed (because hdfs is now up) and everything 
> works.
> {noformat}
> 2015-12-11 21:52:28,058 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter: Got IOException 
> while writing trailer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Failed close of HLog writer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
> Riding over HLog close failure! error count=1
> {noformat}
> The problem is on the replication side. that log we rolled and we were not 
> able to close
> is waiting for a lease recovery.
> {noformat}
> 2015-12-11 21:16:31,909 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Can't open after 267 
> attempts and 301124ms 
> {noformat}
> the WALFactory notify us about that, but there is nothing on the RS side that 
> perform the WAL recovery.
> {noformat}
> 2015-12-11 21:11:30,921 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Lease should have 
> recovered. This is not expected. Will retry
> java.io.IOException: Cannot obtain block length for 
> LocatedBlock{BP-1547065147-10.51.30.152-1446756937665:blk_1073801614_61243; 
> getBlockSize()=83; corrupt=false; offset=0; locs=[10.51.30.154:50010, 
> 10.51.30.152:50010, 10.51.30.155:50010]}
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:358)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:300)
>   at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:237)
>   at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:230)
>   at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1448)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:301)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:297)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:297)
>   at org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:161)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:116)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:89)
>   at 
> 

[jira] [Commented] (HBASE-15019) Replication stuck when HDFS is restarted

2015-12-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069163#comment-15069163
 ] 

Hadoop QA commented on HBASE-15019:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12779163/HBASE-15019-v0_branch-1.2.patch
  against branch-1.2 branch at commit 1af98f255132ef6716a1f6ba1d8d71a36ea38840.
  ATTACHMENT ID: 12779163

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.0 2.6.1 2.7.0 
2.7.1)

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 protoc{color}.  The applied patch does not increase the 
total number of protoc compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 checkstyle{color}. The applied patch does not generate new 
checkstyle errors.

{color:green}+1 findbugs{color}.  The patch does not introduce any  new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

{color:green}+1 site{color}.  The mvn post-site goal succeeds with this 
patch.

{color:red}-1 core tests{color}.  The patch failed these unit tests:
 

{color:green}+1 zombies{color}. No zombie tests found running at the end of 
the build.

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16985//testReport/
Release Findbugs (version 2.0.3)warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16985//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16985//artifact/patchprocess/checkstyle-aggregate.html

  Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16985//console

This message is automatically generated.

> Replication stuck when HDFS is restarted
> 
>
> Key: HBASE-15019
> URL: https://issues.apache.org/jira/browse/HBASE-15019
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 2.0.0, 1.2.0, 1.1.2, 1.0.3, 0.98.16.1
>Reporter: Matteo Bertozzi
>Assignee: Matteo Bertozzi
> Attachments: HBASE-15019-v0_branch-1.2.patch
>
>
> RS is normally working and writing on the WAL.
> HDFS is killed and restarted, and the RS try to do a roll.
> The close fail, but the roll succeed (because hdfs is now up) and everything 
> works.
> {noformat}
> 2015-12-11 21:52:28,058 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter: Got IOException 
> while writing trailer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Failed close of HLog writer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
> Riding over HLog close failure! error count=1
> {noformat}
> The problem is on the replication side. that log we rolled and we were not 
> able to close
> is waiting for a lease recovery.
> {noformat}
> 2015-12-11 21:16:31,909 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Can't open after 267 
> attempts and 301124ms 
> {noformat}
> the WALFactory notify us about that, but there is nothing on the RS side that 
> perform the WAL recovery.
> {noformat}
> 2015-12-11 21:11:30,921 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Lease should have 
> recovered. 

[jira] [Commented] (HBASE-15019) Replication stuck when HDFS is restarted

2015-12-21 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15066796#comment-15066796
 ] 

stack commented on HBASE-15019:
---

bq. since we know that the RS is still going, should we try to recover the 
lease on the RS side? is it better/safer to trigger an abort on the RS, so we 
have only the master doing lease recovery?

Would be sweet if could avoid killing the RS. Can we figure out the explicit 
case of a replicating RS that comes across a log that needs lease recovery AND 
the log was written by this RS instance AND HDFS is healthy now -- i.e. the 
subsequent WAL log, the one that came after the one that needs recovering,  is 
good... 

I suppose the only scenario where the subsequent log is good would be this case 
where HDFS has been restarted underneath us. There are likely probings we can 
do to recognize this particular case and only here, let RS do lease recovery.

IIRC, if a process asks NN to recover a lease and it is taking a while and 
then the process asks again that the NN recover the lease, on receipt of the 
second call, the NN starts over from scratch the whole lease recovery process. 
The lease never recovers if the period at which the client process asks for a 
lease recovery is retried at a period that is less than the amount of time it 
takes to recover.

If a Master and a RS got into a situation where they were both trying to 
recover a lease, could end up fighting each other and frustrating lease 
recovery totally.



> Replication stuck when HDFS is restarted
> 
>
> Key: HBASE-15019
> URL: https://issues.apache.org/jira/browse/HBASE-15019
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 2.0.0, 1.2.0, 1.1.2, 1.0.3, 0.98.16.1
>Reporter: Matteo Bertozzi
>Assignee: Matteo Bertozzi
>
> RS is normally working and writing on the WAL.
> HDFS is killed and restarted, and the RS try to do a roll.
> The close fail, but the roll succeed (because hdfs is now up) and everything 
> works.
> {noformat}
> 2015-12-11 21:52:28,058 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter: Got IOException 
> while writing trailer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Failed close of HLog writer
> java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
> 2015-12-11 21:52:28,059 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
> Riding over HLog close failure! error count=1
> {noformat}
> The problem is on the replication side. that log we rolled and we were not 
> able to close
> is waiting for a lease recovery.
> {noformat}
> 2015-12-11 21:16:31,909 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Can't open after 267 
> attempts and 301124ms 
> {noformat}
> the WALFactory notify us about that, but there is nothing on the RS side that 
> perform the WAL recovery.
> {noformat}
> 2015-12-11 21:11:30,921 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Lease should have 
> recovered. This is not expected. Will retry
> java.io.IOException: Cannot obtain block length for 
> LocatedBlock{BP-1547065147-10.51.30.152-1446756937665:blk_1073801614_61243; 
> getBlockSize()=83; corrupt=false; offset=0; locs=[10.51.30.154:50010, 
> 10.51.30.152:50010, 10.51.30.155:50010]}
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:358)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:300)
>   at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:237)
>   at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:230)
>   at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1448)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:301)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:297)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
>