[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13758704#comment-13758704 ] Hudson commented on HBASE-8207: --- FAILURE: Integrated in HBase-0.92-security #150 (See [https://builds.apache.org/job/HBase-0.92-security/150/]) HBASE-9154. [0.92] Backport HBASE-8207 Replication could have data loss when machine name contains hyphen - (Jeffrey) (apurtell: rev 1511424) * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java * /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/replication/TestMasterReplication.java * /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/replication/TestMultiSlaveReplication.java * /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/replication/TestReplication.java Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.94.6, 0.95.0 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.98.0, 0.94.7, 0.95.0 Attachments: 8207-trunk-addendum.txt, 8207_v3.patch, failed.txt, hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13732747#comment-13732747 ] Hudson commented on HBASE-8207: --- FAILURE: Integrated in HBase-0.92 #614 (See [https://builds.apache.org/job/HBase-0.92/614/]) HBASE-9154. [0.92] Backport HBASE-8207 Replication could have data loss when machine name contains hyphen - (Jeffrey) (apurtell: rev 1511424) * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java * /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/replication/TestMasterReplication.java * /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/replication/TestMultiSlaveReplication.java * /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/replication/TestReplication.java Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.94.6, 0.95.0 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.98.0, 0.94.7, 0.95.0 Attachments: 8207-trunk-addendum.txt, 8207_v3.patch, failed.txt, hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622120#comment-13622120 ] Hudson commented on HBASE-8207: --- Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #476 (See [https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/476/]) HBASE-8207 Don't use hdfs append during lease recovery (Revision 1463957) Result = FAILURE nkeywal : Files : * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/FSHDFSUtils.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestHLog.java Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: 8207-trunk-addendum.txt, 8207_v3.patch, failed.txt, hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 {code} This happened in the recent test failure in http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13623142#comment-13623142 ] Hudson commented on HBASE-8207: --- Integrated in HBase-0.94-security-on-Hadoop-23 #13 (See [https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/13/]) HBASE-8207 Addendum fixes the typo in extractDeadServersFromZNodeString() (Revision 1462554) HBASE-8207 Replication could have data loss when machine name contains hyphen - (Jeffrey) (Revision 1462518) Result = FAILURE tedyu : Files : * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java tedyu : Files : * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationBase.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: 8207-trunk-addendum.txt, 8207_v3.patch, failed.txt, hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13621489#comment-13621489 ] Hudson commented on HBASE-8207: --- Integrated in hbase-0.95 #121 (See [https://builds.apache.org/job/hbase-0.95/121/]) HBASE-8207 Don't use hdfs append during lease recovery (Revision 1463958) Result = SUCCESS nkeywal : Files : * /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/util/FSHDFSUtils.java * /hbase/branches/0.95/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestHLog.java Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: 8207-trunk-addendum.txt, 8207_v3.patch, failed.txt, hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 {code} This happened in the recent test failure in http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false Search for {code}
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13621496#comment-13621496 ] Hudson commented on HBASE-8207: --- Integrated in HBase-TRUNK #4009 (See [https://builds.apache.org/job/HBase-TRUNK/4009/]) HBASE-8207 Don't use hdfs append during lease recovery (Revision 1463957) Result = FAILURE nkeywal : Files : * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/FSHDFSUtils.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestHLog.java Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: 8207-trunk-addendum.txt, 8207_v3.patch, failed.txt, hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 {code} This happened in the recent test failure in http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false Search for {code} File does
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13621772#comment-13621772 ] Hudson commented on HBASE-8207: --- Integrated in hbase-0.95-on-hadoop2 #53 (See [https://builds.apache.org/job/hbase-0.95-on-hadoop2/53/]) HBASE-8207 Don't use hdfs append during lease recovery (Revision 1463958) Result = FAILURE nkeywal : Files : * /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/util/FSHDFSUtils.java * /hbase/branches/0.95/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestHLog.java Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: 8207-trunk-addendum.txt, 8207_v3.patch, failed.txt, hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 {code} This happened in the recent test failure in http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617372#comment-13617372 ] Ted Yu commented on HBASE-8207: --- There is a long line: {code} +SortedMapString, SortedSetString testMap = rz1.copyQueuesFromRSUsingMulti(server.getServerName().getServerName()); {code} [~jdcryans]: Do you have more comments ? Thanks Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: failed.txt, hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 {code} This happened in the recent test failure in http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false Search for {code} File does not exist: hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540 {code} After 10 times retries, replication source gave up and move on to the next
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617410#comment-13617410 ] Jean-Daniel Cryans commented on HBASE-8207: --- +1, love the added test. Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: failed.txt, hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 {code} This happened in the recent test failure in http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false Search for {code} File does not exist: hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540 {code} After 10 times retries, replication source gave up and move on to the next file. Data loss happens. Since lots of EC2 machine names contain - including our Jenkin servers, this is a high impact issue. -- This message is
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617468#comment-13617468 ] Ted Yu commented on HBASE-8207: --- Integrated to 0.94, 0.95 and trunk. Thanks for the patch, Jeffrey. Thanks for the reviews, Lars, J-D and Jieshan. Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: 8207_v3.patch, failed.txt, hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 {code} This happened in the recent test failure in http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false Search for {code} File does not exist: hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540 {code} After 10 times retries, replication source gave up and move on to the next file. Data loss happens. Since lots of EC2 machine names contain -
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617535#comment-13617535 ] Enis Soztutar commented on HBASE-8207: -- Noticed a typo: extracDeadServersFromZNodeString - extractDeadServersFromZNodeString. Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: 8207_v3.patch, failed.txt, hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 {code} This happened in the recent test failure in http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false Search for {code} File does not exist: hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540 {code} After 10 times retries, replication source gave up and move on to the next file. Data loss happens. Since lots of EC2 machine names contain - including our Jenkin
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617549#comment-13617549 ] Ted Yu commented on HBASE-8207: --- Method name corrected in 0.94, 0.95 and trunk. Thanks for the finding, Enis. Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: 8207-trunk-addendum.txt, 8207_v3.patch, failed.txt, hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 {code} This happened in the recent test failure in http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false Search for {code} File does not exist: hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540 {code} After 10 times retries, replication source gave up and move on to the next file. Data loss happens. Since lots of EC2 machine names contain - including our Jenkin
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617768#comment-13617768 ] Hudson commented on HBASE-8207: --- Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #468 (See [https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/468/]) HBASE-8207 Addendum fixes the typo in extractDeadServersFromZNodeString() (Revision 1462552) HBASE-8207 Replication could have data loss when machine name contains hyphen - (Jeffrey) (Revision 1462515) Result = FAILURE tedyu : Files : * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java tedyu : Files : * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: 8207-trunk-addendum.txt, 8207_v3.patch, failed.txt, hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617785#comment-13617785 ] Hudson commented on HBASE-8207: --- Integrated in HBase-0.94-security #129 (See [https://builds.apache.org/job/HBase-0.94-security/129/]) HBASE-8207 Addendum fixes the typo in extractDeadServersFromZNodeString() (Revision 1462554) HBASE-8207 Replication could have data loss when machine name contains hyphen - (Jeffrey) (Revision 1462518) Result = FAILURE tedyu : Files : * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java tedyu : Files : * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationBase.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: 8207-trunk-addendum.txt, 8207_v3.patch, failed.txt, hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617807#comment-13617807 ] Hudson commented on HBASE-8207: --- Integrated in hbase-0.95-on-hadoop2 #47 (See [https://builds.apache.org/job/hbase-0.95-on-hadoop2/47/]) HBASE-8207 Addendum fixes the typo in extractDeadServersFromZNodeString() (Revision 1462553) HBASE-8207 Replication could have data loss when machine name contains hyphen - (Jeffrey) (Revision 1462516) Result = FAILURE tedyu : Files : * /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java tedyu : Files : * /hbase/branches/0.95/hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java * /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java * /hbase/branches/0.95/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: 8207-trunk-addendum.txt, 8207_v3.patch, failed.txt, hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617848#comment-13617848 ] Hudson commented on HBASE-8207: --- Integrated in hbase-0.95 #113 (See [https://builds.apache.org/job/hbase-0.95/113/]) HBASE-8207 Addendum fixes the typo in extractDeadServersFromZNodeString() (Revision 1462553) HBASE-8207 Replication could have data loss when machine name contains hyphen - (Jeffrey) (Revision 1462516) Result = FAILURE tedyu : Files : * /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java tedyu : Files : * /hbase/branches/0.95/hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java * /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java * /hbase/branches/0.95/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: 8207-trunk-addendum.txt, 8207_v3.patch, failed.txt, hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617861#comment-13617861 ] Hudson commented on HBASE-8207: --- Integrated in HBase-0.94 #927 (See [https://builds.apache.org/job/HBase-0.94/927/]) HBASE-8207 Addendum fixes the typo in extractDeadServersFromZNodeString() (Revision 1462554) HBASE-8207 Replication could have data loss when machine name contains hyphen - (Jeffrey) (Revision 1462518) Result = FAILURE tedyu : Files : * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java tedyu : Files : * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationBase.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: 8207-trunk-addendum.txt, 8207_v3.patch, failed.txt, hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617907#comment-13617907 ] Hudson commented on HBASE-8207: --- Integrated in HBase-TRUNK #4000 (See [https://builds.apache.org/job/HBase-TRUNK/4000/]) HBASE-8207 Addendum fixes the typo in extractDeadServersFromZNodeString() (Revision 1462552) HBASE-8207 Replication could have data loss when machine name contains hyphen - (Jeffrey) (Revision 1462515) Result = FAILURE tedyu : Files : * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java tedyu : Files : * /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: 8207-trunk-addendum.txt, 8207_v3.patch, failed.txt, hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616141#comment-13616141 ] Jieshan Bean commented on HBASE-8207: - We found the same problem in our test environment, attaching the logs for your reference: {noformat} 2013-03-25 04:51:20,929 INFO [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] NB dead servers : 4 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:517) 2013-03-25 04:51:20,929 INFO [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] Possible location hdfs://hacluster/hbase/.logs/130,60020,1364199883591/160-172-0-130%252C60020%252C1364199883591.1364200564291 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528) 2013-03-25 04:51:20,930 INFO [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] Possible location hdfs://hacluster/hbase/.logs/130,60020,1364199883591-splitting/160-172-0-130%252C60020%252C1364199883591.1364200564291 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528) 2013-03-25 04:51:20,932 INFO [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] Possible location hdfs://hacluster/hbase/.logs/0/160-172-0-130%252C60020%252C1364199883591.1364200564291 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528) 2013-03-25 04:51:20,934 INFO [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] Possible location hdfs://hacluster/hbase/.logs/0-splitting/160-172-0-130%252C60020%252C1364199883591.1364200564291 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528) 2013-03-25 04:51:20,935 INFO [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] Possible location hdfs://hacluster/hbase/.logs/172/160-172-0-130%252C60020%252C1364199883591.1364200564291 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528) 2013-03-25 04:51:20,937 INFO [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] Possible location hdfs://hacluster/hbase/.logs/172-splitting/160-172-0-130%252C60020%252C1364199883591.1364200564291 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528) 2013-03-25 04:51:20,938 INFO [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] Possible location hdfs://hacluster/hbase/.logs/160/160-172-0-130%252C60020%252C1364199883591.1364200564291 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528) 2013-03-25 04:51:20,939 INFO [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] Possible location hdfs://hacluster/hbase/.logs/160-splitting/160-172-0-130%252C60020%252C1364199883591.1364200564291 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528) 2013-03-25 04:51:20,941 WARN [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] 1-160-172-0-130,60020,1364199883591 Got: org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:563) java.io.IOException: File from recovered queue is nowhere to be found at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:545) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:311) Caused by: java.io.FileNotFoundException: File does not exist: hdfs://hacluster/hbase/.oldlogs/160-172-0-130%2C60020%2C1364199883591.1364200564291 at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:752) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1692) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1716) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.init(SequenceFileLogReader.java:55) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177) at org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:728) at org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:511) ... 1 more {noformat} Replication could have data loss when machine name contains hyphen -
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616469#comment-13616469 ] Andrew Purtell commented on HBASE-8207: --- Nice find. This has been vexing. Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: failed.txt, HBASE-8212-94.patch In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 {code} This happened in the recent test failure in http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false Search for {code} File does not exist: hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540 {code} After 10 times retries, replication source gave up and move on to the next file. Data loss happens. Since lots of EC2 machine names contain - including our Jenkin servers, this is a high impact issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616592#comment-13616592 ] Lars Hofhansl commented on HBASE-8207: -- +1 on Jeffrey's patch. Is this needed? {code} +// Sleep fixed interval to wait for log splitting work get done +this.sleepForRetries(waiting for log splitting is done, 1); {code} Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: failed.txt, hbase-8207.patch, HBASE-8212-94.patch In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 {code} This happened in the recent test failure in http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false Search for {code} File does not exist: hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540 {code} After 10 times retries, replication source gave up and move on to the next file. Data loss happens. Since lots of EC2 machine names contain -
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616599#comment-13616599 ] Ted Yu commented on HBASE-8207: --- @Jeffrey: Can you upload patch for 0.94 ? I wonder if we should integrate Jeff's patch to 0.94 first where we can verify on EC2 cluster that the problem is fixed. Thanks Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: failed.txt, hbase-8207.patch, HBASE-8212-94.patch In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 {code} This happened in the recent test failure in http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false Search for {code} File does not exist: hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540 {code} After 10 times retries, replication source gave up and move on to the next file. Data loss happens. Since lots of EC2 machine names contain - including our Jenkin servers, this is a high impact
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616617#comment-13616617 ] Ted Yu commented on HBASE-8207: --- @Jeff: Patch is great: we avoid incompatibilities for the next 0.94 release. Actually we can use your patch for 0.94 and switch to new delimiter in 0.95 / 0.98 so that code maintenance work is less. {code} + // Since servername can contains - like ip-10-46-221-101.ec2.internal, so we need skip some {code} 'can contains - like' - 'can contain -, such as' 'we need skip' - 'we need to skip' {code} + // 2-ip-10-46-221-101.ec2.internal,52170,1364333181125-ip-10-46-221-101.ec2.internal,52171, + // 1364333181127-... {code} Normally we don't want long line. But for the above sample, we'd better keep it on the same line. {code} +// extract dead servers +if (parts.length 1) { {code} Since there is no else block, you can revert the condition and bail out early. {code} + // valid server name delimiter - has to be after , {code} 'after ,' - 'after , in ServerName' {code} +LOG.error(Found invalid server name: + serverName); {code} The above log sentence is the same for first and second server name candidates. It would be nice to use slightly different terms. {code} + if(startIndex len - 1){ {code} I think we should add a log statement for the else block of the above check - basically there is no second server name in that case. One general minor comment is that you should insert space between if and the following left paren. Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: failed.txt, hbase-8207.patch, HBASE-8212-94.patch In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616623#comment-13616623 ] Jean-Daniel Cryans commented on HBASE-8207: --- Mmm sorry about this one, guys. I guess that's what happens when you test only in one environment. bq. 1) When replication is waiting for log splitting complete, there is no sleep in between so we keep hitting hdfs name node Since the HLog is always somewhere, AFAIK we don't wait for log splitting to finish. About the patch, it seems that the extract dead servers logic should itself be extracted into its own method. Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: failed.txt, hbase-8207.patch, HBASE-8212-94.patch In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 {code} This happened in the recent test failure in http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false Search for {code} File does not exist:
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616858#comment-13616858 ] Ted Yu commented on HBASE-8207: --- {code} +ReplicationSource.extracDeadServersFromZNodeString(parts[1], this.deadRegionServers); {code} Class name, ReplicationSource, is not needed, right ? {code} + public void testNodeFailoveDeadServerParsing() throws Exception { {code} Typo: the 'ove' in Failover Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: failed.txt, hbase-8207.patch, hbase-8207_v1.patch, HBASE-8212-94.patch In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 {code} This happened in the recent test failure in http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false Search for {code} File does not exist: hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540 {code} After 10 times retries, replication source gave up and move on to the next file.
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617039#comment-13617039 ] Lars Hofhansl commented on HBASE-8207: -- Looks good to me. replication.source.size.capacity is small to cause lot's of activity, 10k is OK. +1 on this one. Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: failed.txt, hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 {code} This happened in the recent test failure in http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false Search for {code} File does not exist: hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540 {code} After 10 times retries, replication source gave up and move on to the next file. Data loss happens. Since lots of EC2 machine names contain -
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617056#comment-13617056 ] Jieshan Bean commented on HBASE-8207: - New patch also looks good to me. Is it neccessary to add restrictions on peer-id when calling add_peer? Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: failed.txt, hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 {code} This happened in the recent test failure in http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false Search for {code} File does not exist: hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540 {code} After 10 times retries, replication source gave up and move on to the next file. Data loss happens. Since lots of EC2 machine names contain - including our Jenkin
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617071#comment-13617071 ] Lars Hofhansl commented on HBASE-8207: -- I'd rather not do that (at least in 0.94) lest we get that wrong. Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: failed.txt, hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 {code} This happened in the recent test failure in http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false Search for {code} File does not exist: hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540 {code} After 10 times retries, replication source gave up and move on to the next file. Data loss happens. Since lots of EC2 machine names contain - including our Jenkin servers, this is a high impact
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617090#comment-13617090 ] Hadoop QA commented on HBASE-8207: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12576001/hbase-8207_v2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces lines longer than 100 {color:red}-1 site{color}. The patch appears to cause mvn site goal to fail. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5044//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5044//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5044//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5044//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5044//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5044//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5044//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5044//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5044//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5044//console This message is automatically generated. Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: failed.txt, hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615941#comment-13615941 ] Lars Hofhansl commented on HBASE-8207: -- [~jdcryans] FYI. Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: failed.txt In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 {code} This happened in the recent test failure in http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false Search for {code} File does not exist: hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540 {code} After 10 times retries, replication source gave up and move on to the next file. Data loss happens. Since lots of EC2 machine names contain - including our Jenkin servers, this is a high impact issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see:
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615950#comment-13615950 ] Enis Soztutar commented on HBASE-8207: -- I think the problem comes from: ReplicationZookeeper.copyQueuesFromRSUsingMulti {code} src/main/java/org/apache/hadoop/hbase/replication//ReplicationZookeeper.java: String newPeerId = peerId + - + znode; src/main/java/org/apache/hadoop/hbase/replication//regionserver/ReplicationSource.java: String[] parts = peerClusterZnode.split(-); {code} Fixing it should not be that hard. Jeffrey, this is a production bug as well as test, right? Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: failed.txt In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 {code} This happened in the recent test failure in http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false Search for {code} File does not exist:
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615952#comment-13615952 ] Ted Yu commented on HBASE-8207: --- How about using '=' as the separator ? Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: failed.txt In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 {code} This happened in the recent test failure in http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false Search for {code} File does not exist: hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540 {code} After 10 times retries, replication source gave up and move on to the next file. Data loss happens. Since lots of EC2 machine names contain - including our Jenkin servers, this is a high impact issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see:
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615957#comment-13615957 ] Jimmy Xiang commented on HBASE-8207: We need to make sure there is a proper migration plan, at least mentioning the fix in release notes. Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: failed.txt In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 {code} This happened in the recent test failure in http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false Search for {code} File does not exist: hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540 {code} After 10 times retries, replication source gave up and move on to the next file. Data loss happens. Since lots of EC2 machine names contain - including our Jenkin servers, this is a high impact issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615960#comment-13615960 ] Jimmy Xiang commented on HBASE-8207: A non-printable/configurable delimiter? Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: failed.txt In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 {code} This happened in the recent test failure in http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false Search for {code} File does not exist: hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540 {code} After 10 times retries, replication source gave up and move on to the next file. Data loss happens. Since lots of EC2 machine names contain - including our Jenkin servers, this is a high impact issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA,
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615963#comment-13615963 ] Jean-Marc Spaggiari commented on HBASE-8207: Any thing which is not allowed as an host name might be acceptable, like pipe, % or like [~jxiang] proposed, a non printable delimiter. But if we need to print that on the logs, then the pipe option might be better? also, if this is stored in ZK, then indeed, we might need to keep the existing - + new delimiter for the reads, and the new delimiter for the writes... Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: failed.txt In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 {code} This happened in the recent test failure in http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false Search for {code} File does not exist: hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540 {code} After 10 times
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615964#comment-13615964 ] Jeffrey Zhong commented on HBASE-8207: -- [~enis] This could be a production issue if a user setups HBase with replication enabled in EC2 like env. [~jxiang] I thought about to change to another delimiter : which is NOT allowed by machine name nor hdfs file name. While it will cause a little bit migration issue. Since the znode value could be either a peerId or peerId-servername-servername... and server name has to contain , Therefore, I'm planning to keep the current way just patch the function checkIfQueueRecovered so before treating - as valid server name delimiter we have to see a , before. In 0.96 or onwords we could use a different char like : which isn't allowed in either machine name or hdfs. Thanks, -Jeffrey Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: failed.txt In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 {code} This happened in the recent test failure in
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615966#comment-13615966 ] Lars Hofhansl commented on HBASE-8207: -- The same is in ReplicationZookeeper.copyQueuesFromRS, so at least it is nothing new. Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: failed.txt In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 {code} This happened in the recent test failure in http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false Search for {code} File does not exist: hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540 {code} After 10 times retries, replication source gave up and move on to the next file. Data loss happens. Since lots of EC2 machine names contain - including our Jenkin servers, this is a high impact issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615969#comment-13615969 ] Lars Hofhansl commented on HBASE-8207: -- I like ':' as separator. We can do this in the backwards compatible way: Since we know that ':' will not be in a valid hostname it won't be in the concatenated strings. So when read the data from ZK in ReplicationSource, we check whether we find a ':', if we do we split on that, otherwise we split on '-' as we do now. We will always write in the new format. Rolling restarts will still be a problem, though. Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: failed.txt In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 {code} This happened in the recent test failure in http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false Search for {code} File does not exist:
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615971#comment-13615971 ] Enis Soztutar commented on HBASE-8207: -- BTW, for future reviews, we should ensure that carrying data around znodes names, and file names, and string parsing etc should be avoided, unless there is a valid reason. Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: failed.txt In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 {code} This happened in the recent test failure in http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false Search for {code} File does not exist: hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540 {code} After 10 times retries, replication source gave up and move on to the next file. Data loss happens. Since lots of EC2 machine names contain - including our Jenkin servers, this is a high impact issue. -- This message is
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615977#comment-13615977 ] Ted Yu commented on HBASE-8207: --- I wonder why trunk builds didn't encounter such problem. From http://54.241.6.143/job/HBase-TRUNK/45/console: {code} Running org.apache.hadoop.hbase.replication.TestReplicationQueueFailoverCompressed Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 66.645 sec ... Running org.apache.hadoop.hbase.replication.TestReplicationQueueFailover Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 77.707 sec {code} Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: failed.txt In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 {code} This happened in the recent test failure in http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false Search for {code} File does not exist:
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616060#comment-13616060 ] Jeffrey Zhong commented on HBASE-8207: -- [~te...@apache.org] It depends if log splitting can finish in time before replication gives up. In our test case setup, log splitting normally completes within 1-2 secs while replication takes about 5 secs to give up. So in most cases, the test runs fine. In local dev env, we don't have machine name containing - so we can't even reproduce it. Since we just changed Jenkins, I can't find more build history. From my flaky test detector tool(HBASE-8018) result in trunk which I ran on March 5. We can see replication in trunk is flaky at that time: _HBase-TRUNK (from last 10 runs)_ Failed Test Cases 3908 3909 3910 3912 3913 3914 3915 3916 org.apache.hadoop.hbase.replication.testreplicationqueuefailover.queuefailover 1 1 1 -1 0 -1 0 1 org.apache.hadoop.hbase.replication.testreplicationqueuefailovercompressed.queuefailover 1 1 1 -1 0 -1 0 1 _HBase-0.95 (from last 10 runs configurable)_ Failed Test Cases 21 22 23 24 25 27 org.apache.hadoop.hbase.replication.testreplicationqueuefailover.queuefailover 1 -1 0 1 -1 0 org.apache.hadoop.hbase.replication.testreplicationqueuefailovercompressed.queuefailover 0 1 -1 0 -1 0 Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: failed.txt In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO