[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-09-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13758704#comment-13758704
 ] 

Hudson commented on HBASE-8207:
---

FAILURE: Integrated in HBase-0.92-security #150 (See 
[https://builds.apache.org/job/HBase-0.92-security/150/])
HBASE-9154. [0.92] Backport HBASE-8207 Replication could have data loss when 
machine name contains hyphen - (Jeffrey) (apurtell: rev 1511424)
* 
/hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java
* 
/hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
* 
/hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/replication/TestMasterReplication.java
* 
/hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/replication/TestMultiSlaveReplication.java
* 
/hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/replication/TestReplication.java


 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.94.6, 0.95.0
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.98.0, 0.94.7, 0.95.0

 Attachments: 8207-trunk-addendum.txt, 8207_v3.patch, failed.txt, 
 hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, 
 hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-08-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13732747#comment-13732747
 ] 

Hudson commented on HBASE-8207:
---

FAILURE: Integrated in HBase-0.92 #614 (See 
[https://builds.apache.org/job/HBase-0.92/614/])
HBASE-9154. [0.92] Backport HBASE-8207 Replication could have data loss when 
machine name contains hyphen - (Jeffrey) (apurtell: rev 1511424)
* 
/hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java
* 
/hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
* 
/hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/replication/TestMasterReplication.java
* 
/hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/replication/TestMultiSlaveReplication.java
* 
/hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/replication/TestReplication.java


 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.94.6, 0.95.0
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.98.0, 0.94.7, 0.95.0

 Attachments: 8207-trunk-addendum.txt, 8207_v3.patch, failed.txt, 
 hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, 
 hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-04-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622120#comment-13622120
 ] 

Hudson commented on HBASE-8207:
---

Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #476 (See 
[https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/476/])
HBASE-8207 Don't use hdfs append during lease recovery (Revision 1463957)

 Result = FAILURE
nkeywal : 
Files : 
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/FSHDFSUtils.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestHLog.java


 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: 8207-trunk-addendum.txt, 8207_v3.patch, failed.txt, 
 hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, 
 hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 {code}
 This happened in the recent test failure in 
 http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false
 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-04-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13623142#comment-13623142
 ] 

Hudson commented on HBASE-8207:
---

Integrated in HBase-0.94-security-on-Hadoop-23 #13 (See 
[https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/13/])
HBASE-8207 Addendum fixes the typo in extractDeadServersFromZNodeString() 
(Revision 1462554)
HBASE-8207 Replication could have data loss when machine name contains hyphen 
- (Jeffrey) (Revision 1462518)

 Result = FAILURE
tedyu : 
Files : 
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

tedyu : 
Files : 
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationBase.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java


 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: 8207-trunk-addendum.txt, 8207_v3.patch, failed.txt, 
 hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, 
 hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-04-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13621489#comment-13621489
 ] 

Hudson commented on HBASE-8207:
---

Integrated in hbase-0.95 #121 (See 
[https://builds.apache.org/job/hbase-0.95/121/])
HBASE-8207 Don't use hdfs append during lease recovery (Revision 1463958)

 Result = SUCCESS
nkeywal : 
Files : 
* 
/hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/util/FSHDFSUtils.java
* 
/hbase/branches/0.95/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestHLog.java


 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: 8207-trunk-addendum.txt, 8207_v3.patch, failed.txt, 
 hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, 
 hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 {code}
 This happened in the recent test failure in 
 http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false
 Search for 
 {code}
 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-04-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13621496#comment-13621496
 ] 

Hudson commented on HBASE-8207:
---

Integrated in HBase-TRUNK #4009 (See 
[https://builds.apache.org/job/HBase-TRUNK/4009/])
HBASE-8207 Don't use hdfs append during lease recovery (Revision 1463957)

 Result = FAILURE
nkeywal : 
Files : 
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/FSHDFSUtils.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestHLog.java


 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: 8207-trunk-addendum.txt, 8207_v3.patch, failed.txt, 
 hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, 
 hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 {code}
 This happened in the recent test failure in 
 http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false
 Search for 
 {code}
 File does 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-04-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13621772#comment-13621772
 ] 

Hudson commented on HBASE-8207:
---

Integrated in hbase-0.95-on-hadoop2 #53 (See 
[https://builds.apache.org/job/hbase-0.95-on-hadoop2/53/])
HBASE-8207 Don't use hdfs append during lease recovery (Revision 1463958)

 Result = FAILURE
nkeywal : 
Files : 
* 
/hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/util/FSHDFSUtils.java
* 
/hbase/branches/0.95/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestHLog.java


 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: 8207-trunk-addendum.txt, 8207_v3.patch, failed.txt, 
 hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, 
 hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 {code}
 This happened in the recent test failure in 
 http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false
 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-29 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617372#comment-13617372
 ] 

Ted Yu commented on HBASE-8207:
---

There is a long line:
{code}
+SortedMapString, SortedSetString testMap = 
rz1.copyQueuesFromRSUsingMulti(server.getServerName().getServerName());
{code}

[~jdcryans]:
Do you have more comments ?

Thanks

 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: failed.txt, hbase-8207-0.94-v1.patch, hbase-8207.patch, 
 hbase-8207_v1.patch, hbase-8207_v2.patch, hbase-8207_v2.patch, 
 HBASE-8212-94.patch


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 {code}
 This happened in the recent test failure in 
 http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false
 Search for 
 {code}
 File does not exist: 
 hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540
 {code}
 After 10 times retries, replication source gave up and move on to the next 
 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-29 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617410#comment-13617410
 ] 

Jean-Daniel Cryans commented on HBASE-8207:
---

+1, love the added test.

 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: failed.txt, hbase-8207-0.94-v1.patch, hbase-8207.patch, 
 hbase-8207_v1.patch, hbase-8207_v2.patch, hbase-8207_v2.patch, 
 HBASE-8212-94.patch


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 {code}
 This happened in the recent test failure in 
 http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false
 Search for 
 {code}
 File does not exist: 
 hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540
 {code}
 After 10 times retries, replication source gave up and move on to the next 
 file. Data loss happens. 
 Since lots of EC2 machine names contain - including our Jenkin servers, 
 this is a high impact issue.

--
This message is 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-29 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617468#comment-13617468
 ] 

Ted Yu commented on HBASE-8207:
---

Integrated to 0.94, 0.95 and trunk.

Thanks for the patch, Jeffrey.

Thanks for the reviews, Lars, J-D and Jieshan.

 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: 8207_v3.patch, failed.txt, hbase-8207-0.94-v1.patch, 
 hbase-8207.patch, hbase-8207_v1.patch, hbase-8207_v2.patch, 
 hbase-8207_v2.patch, HBASE-8212-94.patch


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 {code}
 This happened in the recent test failure in 
 http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false
 Search for 
 {code}
 File does not exist: 
 hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540
 {code}
 After 10 times retries, replication source gave up and move on to the next 
 file. Data loss happens. 
 Since lots of EC2 machine names contain - 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-29 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617535#comment-13617535
 ] 

Enis Soztutar commented on HBASE-8207:
--

Noticed a typo: extracDeadServersFromZNodeString - 
extractDeadServersFromZNodeString.

 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: 8207_v3.patch, failed.txt, hbase-8207-0.94-v1.patch, 
 hbase-8207.patch, hbase-8207_v1.patch, hbase-8207_v2.patch, 
 hbase-8207_v2.patch, HBASE-8212-94.patch


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 {code}
 This happened in the recent test failure in 
 http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false
 Search for 
 {code}
 File does not exist: 
 hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540
 {code}
 After 10 times retries, replication source gave up and move on to the next 
 file. Data loss happens. 
 Since lots of EC2 machine names contain - including our Jenkin 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-29 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617549#comment-13617549
 ] 

Ted Yu commented on HBASE-8207:
---

Method name corrected in 0.94, 0.95 and trunk.

Thanks for the finding, Enis.

 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: 8207-trunk-addendum.txt, 8207_v3.patch, failed.txt, 
 hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, 
 hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 {code}
 This happened in the recent test failure in 
 http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false
 Search for 
 {code}
 File does not exist: 
 hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540
 {code}
 After 10 times retries, replication source gave up and move on to the next 
 file. Data loss happens. 
 Since lots of EC2 machine names contain - including our Jenkin 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617768#comment-13617768
 ] 

Hudson commented on HBASE-8207:
---

Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #468 (See 
[https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/468/])
HBASE-8207 Addendum fixes the typo in extractDeadServersFromZNodeString() 
(Revision 1462552)
HBASE-8207 Replication could have data loss when machine name contains hyphen 
- (Jeffrey) (Revision 1462515)

 Result = FAILURE
tedyu : 
Files : 
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

tedyu : 
Files : 
* 
/hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java


 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: 8207-trunk-addendum.txt, 8207_v3.patch, failed.txt, 
 hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, 
 hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617785#comment-13617785
 ] 

Hudson commented on HBASE-8207:
---

Integrated in HBase-0.94-security #129 (See 
[https://builds.apache.org/job/HBase-0.94-security/129/])
HBASE-8207 Addendum fixes the typo in extractDeadServersFromZNodeString() 
(Revision 1462554)
HBASE-8207 Replication could have data loss when machine name contains hyphen 
- (Jeffrey) (Revision 1462518)

 Result = FAILURE
tedyu : 
Files : 
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

tedyu : 
Files : 
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationBase.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java


 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: 8207-trunk-addendum.txt, 8207_v3.patch, failed.txt, 
 hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, 
 hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617807#comment-13617807
 ] 

Hudson commented on HBASE-8207:
---

Integrated in hbase-0.95-on-hadoop2 #47 (See 
[https://builds.apache.org/job/hbase-0.95-on-hadoop2/47/])
HBASE-8207 Addendum fixes the typo in extractDeadServersFromZNodeString() 
(Revision 1462553)
HBASE-8207 Replication could have data loss when machine name contains hyphen 
- (Jeffrey) (Revision 1462516)

 Result = FAILURE
tedyu : 
Files : 
* 
/hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

tedyu : 
Files : 
* 
/hbase/branches/0.95/hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java
* 
/hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
* 
/hbase/branches/0.95/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java


 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: 8207-trunk-addendum.txt, 8207_v3.patch, failed.txt, 
 hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, 
 hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617848#comment-13617848
 ] 

Hudson commented on HBASE-8207:
---

Integrated in hbase-0.95 #113 (See 
[https://builds.apache.org/job/hbase-0.95/113/])
HBASE-8207 Addendum fixes the typo in extractDeadServersFromZNodeString() 
(Revision 1462553)
HBASE-8207 Replication could have data loss when machine name contains hyphen 
- (Jeffrey) (Revision 1462516)

 Result = FAILURE
tedyu : 
Files : 
* 
/hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

tedyu : 
Files : 
* 
/hbase/branches/0.95/hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java
* 
/hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
* 
/hbase/branches/0.95/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java


 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: 8207-trunk-addendum.txt, 8207_v3.patch, failed.txt, 
 hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, 
 hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617861#comment-13617861
 ] 

Hudson commented on HBASE-8207:
---

Integrated in HBase-0.94 #927 (See 
[https://builds.apache.org/job/HBase-0.94/927/])
HBASE-8207 Addendum fixes the typo in extractDeadServersFromZNodeString() 
(Revision 1462554)
HBASE-8207 Replication could have data loss when machine name contains hyphen 
- (Jeffrey) (Revision 1462518)

 Result = FAILURE
tedyu : 
Files : 
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

tedyu : 
Files : 
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationBase.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java


 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: 8207-trunk-addendum.txt, 8207_v3.patch, failed.txt, 
 hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, 
 hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617907#comment-13617907
 ] 

Hudson commented on HBASE-8207:
---

Integrated in HBase-TRUNK #4000 (See 
[https://builds.apache.org/job/HBase-TRUNK/4000/])
HBASE-8207 Addendum fixes the typo in extractDeadServersFromZNodeString() 
(Revision 1462552)
HBASE-8207 Replication could have data loss when machine name contains hyphen 
- (Jeffrey) (Revision 1462515)

 Result = FAILURE
tedyu : 
Files : 
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

tedyu : 
Files : 
* 
/hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java


 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: 8207-trunk-addendum.txt, 8207_v3.patch, failed.txt, 
 hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, 
 hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-28 Thread Jieshan Bean (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616141#comment-13616141
 ] 

Jieshan Bean commented on HBASE-8207:
-

We found the same problem in our test environment, attaching the logs for your 
reference:
{noformat}
2013-03-25 04:51:20,929 INFO  
[ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] 
NB dead servers : 4 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:517)
2013-03-25 04:51:20,929 INFO  
[ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] 
Possible location 
hdfs://hacluster/hbase/.logs/130,60020,1364199883591/160-172-0-130%252C60020%252C1364199883591.1364200564291
 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528)
2013-03-25 04:51:20,930 INFO  
[ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] 
Possible location 
hdfs://hacluster/hbase/.logs/130,60020,1364199883591-splitting/160-172-0-130%252C60020%252C1364199883591.1364200564291
 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528)
2013-03-25 04:51:20,932 INFO  
[ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] 
Possible location 
hdfs://hacluster/hbase/.logs/0/160-172-0-130%252C60020%252C1364199883591.1364200564291
 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528)
2013-03-25 04:51:20,934 INFO  
[ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] 
Possible location 
hdfs://hacluster/hbase/.logs/0-splitting/160-172-0-130%252C60020%252C1364199883591.1364200564291
 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528)
2013-03-25 04:51:20,935 INFO  
[ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] 
Possible location 
hdfs://hacluster/hbase/.logs/172/160-172-0-130%252C60020%252C1364199883591.1364200564291
 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528)
2013-03-25 04:51:20,937 INFO  
[ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] 
Possible location 
hdfs://hacluster/hbase/.logs/172-splitting/160-172-0-130%252C60020%252C1364199883591.1364200564291
 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528)
2013-03-25 04:51:20,938 INFO  
[ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] 
Possible location 
hdfs://hacluster/hbase/.logs/160/160-172-0-130%252C60020%252C1364199883591.1364200564291
 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528)
2013-03-25 04:51:20,939 INFO  
[ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] 
Possible location 
hdfs://hacluster/hbase/.logs/160-splitting/160-172-0-130%252C60020%252C1364199883591.1364200564291
 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528)
2013-03-25 04:51:20,941 WARN  
[ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] 
1-160-172-0-130,60020,1364199883591 Got:  
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:563)
java.io.IOException: File from recovered queue is nowhere to be found
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:545)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:311)
Caused by: java.io.FileNotFoundException: File does not exist: 
hdfs://hacluster/hbase/.oldlogs/160-172-0-130%2C60020%2C1364199883591.1364200564291
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:752)
at 
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1692)
at 
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1716)
at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.init(SequenceFileLogReader.java:55)
at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177)
at 
org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:728)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:511)
... 1 more
{noformat}

 Replication could have data loss when machine name contains hyphen -
 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-28 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616469#comment-13616469
 ] 

Andrew Purtell commented on HBASE-8207:
---

Nice find. This has been vexing.

 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: failed.txt, HBASE-8212-94.patch


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 {code}
 This happened in the recent test failure in 
 http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false
 Search for 
 {code}
 File does not exist: 
 hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540
 {code}
 After 10 times retries, replication source gave up and move on to the next 
 file. Data loss happens. 
 Since lots of EC2 machine names contain - including our Jenkin servers, 
 this is a high impact issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-28 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616592#comment-13616592
 ] 

Lars Hofhansl commented on HBASE-8207:
--

+1 on Jeffrey's patch.

Is this needed?
{code}
+// Sleep fixed interval to wait for log splitting work get done
+this.sleepForRetries(waiting for log splitting is done, 1);
{code}


 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: failed.txt, hbase-8207.patch, HBASE-8212-94.patch


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 {code}
 This happened in the recent test failure in 
 http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false
 Search for 
 {code}
 File does not exist: 
 hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540
 {code}
 After 10 times retries, replication source gave up and move on to the next 
 file. Data loss happens. 
 Since lots of EC2 machine names contain - 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-28 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616599#comment-13616599
 ] 

Ted Yu commented on HBASE-8207:
---

@Jeffrey:
Can you upload patch for 0.94 ?

I wonder if we should integrate Jeff's patch to 0.94 first where we can verify 
on EC2 cluster that the problem is fixed.

Thanks

 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: failed.txt, hbase-8207.patch, HBASE-8212-94.patch


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 {code}
 This happened in the recent test failure in 
 http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false
 Search for 
 {code}
 File does not exist: 
 hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540
 {code}
 After 10 times retries, replication source gave up and move on to the next 
 file. Data loss happens. 
 Since lots of EC2 machine names contain - including our Jenkin servers, 
 this is a high impact 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-28 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616617#comment-13616617
 ] 

Ted Yu commented on HBASE-8207:
---

@Jeff:
Patch is great: we avoid incompatibilities for the next 0.94 release.
Actually we can use your patch for 0.94 and switch to new delimiter in 0.95 / 
0.98 so that code maintenance work is less.
{code}
+  // Since servername can contains - like ip-10-46-221-101.ec2.internal, 
so we need skip some
{code}
'can contains - like' - 'can contain -, such as'
'we need skip' - 'we need to skip'
{code}
+  // 
2-ip-10-46-221-101.ec2.internal,52170,1364333181125-ip-10-46-221-101.ec2.internal,52171,
+  // 1364333181127-...
{code}
Normally we don't want long line. But for the above sample, we'd better keep it 
on the same line.
{code}
+// extract dead servers
+if (parts.length  1) {
{code}
Since there is no else block, you can revert the condition and bail out early.
{code}
+  // valid server name delimiter - has to be after ,
{code}
'after ,' - 'after , in ServerName'
{code}
+LOG.error(Found invalid server name: + serverName);
{code}
The above log sentence is the same for first and second server name candidates. 
It would be nice to use slightly different terms.
{code}
+  if(startIndex  len - 1){
{code}
I think we should add a log statement for the else block of the above check - 
basically there is no second server name in that case.

One general minor comment is that you should insert space between if and the 
following left paren.

 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: failed.txt, hbase-8207.patch, HBASE-8212-94.patch


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-28 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616623#comment-13616623
 ] 

Jean-Daniel Cryans commented on HBASE-8207:
---

Mmm sorry about this one, guys. I guess that's what happens when you test only 
in one environment.

bq. 1) When replication is waiting for log splitting complete, there is no 
sleep in between so we keep hitting hdfs name node 

Since the HLog is always somewhere, AFAIK we don't wait for log splitting to 
finish.

About the patch, it seems that the extract dead servers logic should itself 
be extracted into its own method.

 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: failed.txt, hbase-8207.patch, HBASE-8212-94.patch


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 {code}
 This happened in the recent test failure in 
 http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false
 Search for 
 {code}
 File does not exist: 
 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-28 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616858#comment-13616858
 ] 

Ted Yu commented on HBASE-8207:
---

{code}
+ReplicationSource.extracDeadServersFromZNodeString(parts[1], 
this.deadRegionServers);
{code}
Class name, ReplicationSource, is not needed, right ?
{code}
+  public void testNodeFailoveDeadServerParsing() throws Exception {
{code}
Typo: the 'ove' in Failover

 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: failed.txt, hbase-8207.patch, hbase-8207_v1.patch, 
 HBASE-8212-94.patch


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 {code}
 This happened in the recent test failure in 
 http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false
 Search for 
 {code}
 File does not exist: 
 hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540
 {code}
 After 10 times retries, replication source gave up and move on to the next 
 file. 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-28 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617039#comment-13617039
 ] 

Lars Hofhansl commented on HBASE-8207:
--

Looks good to me. replication.source.size.capacity is small to cause lot's of 
activity, 10k is OK.

+1 on this one.

 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: failed.txt, hbase-8207-0.94-v1.patch, hbase-8207.patch, 
 hbase-8207_v1.patch, hbase-8207_v2.patch, hbase-8207_v2.patch, 
 HBASE-8212-94.patch


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 {code}
 This happened in the recent test failure in 
 http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false
 Search for 
 {code}
 File does not exist: 
 hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540
 {code}
 After 10 times retries, replication source gave up and move on to the next 
 file. Data loss happens. 
 Since lots of EC2 machine names contain - 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-28 Thread Jieshan Bean (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617056#comment-13617056
 ] 

Jieshan Bean commented on HBASE-8207:
-

New patch also looks good to me. Is it neccessary to add restrictions on 
peer-id when calling add_peer?

 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: failed.txt, hbase-8207-0.94-v1.patch, hbase-8207.patch, 
 hbase-8207_v1.patch, hbase-8207_v2.patch, hbase-8207_v2.patch, 
 HBASE-8212-94.patch


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 {code}
 This happened in the recent test failure in 
 http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false
 Search for 
 {code}
 File does not exist: 
 hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540
 {code}
 After 10 times retries, replication source gave up and move on to the next 
 file. Data loss happens. 
 Since lots of EC2 machine names contain - including our Jenkin 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-28 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617071#comment-13617071
 ] 

Lars Hofhansl commented on HBASE-8207:
--

I'd rather not do that (at least in 0.94) lest we get that wrong.

 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: failed.txt, hbase-8207-0.94-v1.patch, hbase-8207.patch, 
 hbase-8207_v1.patch, hbase-8207_v2.patch, hbase-8207_v2.patch, 
 HBASE-8212-94.patch


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 {code}
 This happened in the recent test failure in 
 http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false
 Search for 
 {code}
 File does not exist: 
 hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540
 {code}
 After 10 times retries, replication source gave up and move on to the next 
 file. Data loss happens. 
 Since lots of EC2 machine names contain - including our Jenkin servers, 
 this is a high impact 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617090#comment-13617090
 ] 

Hadoop QA commented on HBASE-8207:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12576001/hbase-8207_v2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified tests.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 lineLengths{color}.  The patch introduces lines longer than 
100

{color:red}-1 site{color}.  The patch appears to cause mvn site goal to 
fail.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5044//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5044//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5044//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5044//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5044//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5044//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5044//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5044//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5044//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5044//console

This message is automatically generated.

 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: failed.txt, hbase-8207-0.94-v1.patch, hbase-8207.patch, 
 hbase-8207_v1.patch, hbase-8207_v2.patch, hbase-8207_v2.patch, 
 HBASE-8212-94.patch


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-27 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615941#comment-13615941
 ] 

Lars Hofhansl commented on HBASE-8207:
--

[~jdcryans] FYI.

 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: failed.txt


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 {code}
 This happened in the recent test failure in 
 http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false
 Search for 
 {code}
 File does not exist: 
 hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540
 {code}
 After 10 times retries, replication source gave up and move on to the next 
 file. Data loss happens. 
 Since lots of EC2 machine names contain - including our Jenkin servers, 
 this is a high impact issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-27 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615950#comment-13615950
 ] 

Enis Soztutar commented on HBASE-8207:
--

I think the problem comes from: 

ReplicationZookeeper.copyQueuesFromRSUsingMulti
{code}
src/main/java/org/apache/hadoop/hbase/replication//ReplicationZookeeper.java:   
 String newPeerId = peerId + - + znode;
src/main/java/org/apache/hadoop/hbase/replication//regionserver/ReplicationSource.java:
String[] parts = peerClusterZnode.split(-);
{code}

Fixing it should not be that hard. Jeffrey, this is a production bug as well as 
test, right? 



 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: failed.txt


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 {code}
 This happened in the recent test failure in 
 http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false
 Search for 
 {code}
 File does not exist: 
 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-27 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615952#comment-13615952
 ] 

Ted Yu commented on HBASE-8207:
---

How about using '=' as the separator ?

 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: failed.txt


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 {code}
 This happened in the recent test failure in 
 http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false
 Search for 
 {code}
 File does not exist: 
 hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540
 {code}
 After 10 times retries, replication source gave up and move on to the next 
 file. Data loss happens. 
 Since lots of EC2 machine names contain - including our Jenkin servers, 
 this is a high impact issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-27 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615957#comment-13615957
 ] 

Jimmy Xiang commented on HBASE-8207:


We need to make sure there is a proper migration plan, at least mentioning the 
fix in release notes.

 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: failed.txt


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 {code}
 This happened in the recent test failure in 
 http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false
 Search for 
 {code}
 File does not exist: 
 hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540
 {code}
 After 10 times retries, replication source gave up and move on to the next 
 file. Data loss happens. 
 Since lots of EC2 machine names contain - including our Jenkin servers, 
 this is a high impact issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-27 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615960#comment-13615960
 ] 

Jimmy Xiang commented on HBASE-8207:


A non-printable/configurable delimiter?

 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: failed.txt


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 {code}
 This happened in the recent test failure in 
 http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false
 Search for 
 {code}
 File does not exist: 
 hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540
 {code}
 After 10 times retries, replication source gave up and move on to the next 
 file. Data loss happens. 
 Since lots of EC2 machine names contain - including our Jenkin servers, 
 this is a high impact issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-27 Thread Jean-Marc Spaggiari (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615963#comment-13615963
 ] 

Jean-Marc Spaggiari commented on HBASE-8207:


Any thing which is not allowed as an host name might be acceptable, like pipe, 
% or like [~jxiang] proposed, a non printable delimiter. But if we need to 
print that on the logs, then the pipe option might be better? also, if this is 
stored in ZK, then indeed, we might need to keep the existing - + new 
delimiter for the reads, and the new delimiter for the writes...

 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: failed.txt


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 {code}
 This happened in the recent test failure in 
 http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false
 Search for 
 {code}
 File does not exist: 
 hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540
 {code}
 After 10 times 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-27 Thread Jeffrey Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615964#comment-13615964
 ] 

Jeffrey Zhong commented on HBASE-8207:
--

[~enis] This could be a production issue if a user setups HBase with 
replication enabled in EC2 like env.

[~jxiang] I thought about to change to another delimiter : which is NOT 
allowed by machine name nor hdfs file name. While it will cause a little bit 
migration issue. Since the znode value could be either a peerId or 
peerId-servername-servername... and server name has to contain ,

Therefore, I'm planning to keep the current way just patch the function 
checkIfQueueRecovered so before treating - as valid server name delimiter we 
have to see a , before. In 0.96 or onwords we could use a different char like 
: which isn't allowed in either machine name or hdfs.

Thanks,
-Jeffrey

 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: failed.txt


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 {code}
 This happened in the recent test failure in 
 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-27 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615966#comment-13615966
 ] 

Lars Hofhansl commented on HBASE-8207:
--

The same is in ReplicationZookeeper.copyQueuesFromRS, so at least it is nothing 
new.

 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: failed.txt


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 {code}
 This happened in the recent test failure in 
 http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false
 Search for 
 {code}
 File does not exist: 
 hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540
 {code}
 After 10 times retries, replication source gave up and move on to the next 
 file. Data loss happens. 
 Since lots of EC2 machine names contain - including our Jenkin servers, 
 this is a high impact issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-27 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615969#comment-13615969
 ] 

Lars Hofhansl commented on HBASE-8207:
--

I like ':' as separator. We can do this in the backwards compatible way: Since 
we know that ':' will not be in a valid hostname it won't be in the 
concatenated strings. So when read the data from ZK in ReplicationSource, we 
check whether we find a ':', if we do we split on that, otherwise we split on 
'-' as we do now. We will always write in the new format.
Rolling restarts will still be a problem, though.


 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: failed.txt


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 {code}
 This happened in the recent test failure in 
 http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false
 Search for 
 {code}
 File does not exist: 
 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-27 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615971#comment-13615971
 ] 

Enis Soztutar commented on HBASE-8207:
--

BTW, for future reviews, we should ensure that carrying data around znodes 
names, and file names, and string parsing etc should be avoided, unless there 
is a valid reason.

 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: failed.txt


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 {code}
 This happened in the recent test failure in 
 http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false
 Search for 
 {code}
 File does not exist: 
 hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540
 {code}
 After 10 times retries, replication source gave up and move on to the next 
 file. Data loss happens. 
 Since lots of EC2 machine names contain - including our Jenkin servers, 
 this is a high impact issue.

--
This message is 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-27 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615977#comment-13615977
 ] 

Ted Yu commented on HBASE-8207:
---

I wonder why trunk builds didn't encounter such problem.
From http://54.241.6.143/job/HBase-TRUNK/45/console:
{code}
Running 
org.apache.hadoop.hbase.replication.TestReplicationQueueFailoverCompressed
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 66.645 sec
...
Running org.apache.hadoop.hbase.replication.TestReplicationQueueFailover
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 77.707 sec
{code}

 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: failed.txt


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,398 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 {code}
 This happened in the recent test failure in 
 http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false
 Search for 
 {code}
 File does not exist: 
 

[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -

2013-03-27 Thread Jeffrey Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616060#comment-13616060
 ] 

Jeffrey Zhong commented on HBASE-8207:
--

[~te...@apache.org] It depends if log splitting can finish in time before 
replication gives up. In our test case setup, log splitting normally completes 
within 1-2 secs while replication takes about 5 secs to give up. So in most 
cases, the test runs fine. In local dev env, we don't have machine name 
containing - so we can't even reproduce it. 

Since we just changed Jenkins, I can't find more build history. From my flaky 
test detector tool(HBASE-8018) result in trunk which I ran on March 5. We can 
see replication in trunk is flaky at that time:

_HBase-TRUNK (from last 10 runs)_

Failed Test Cases              3908 3909 3910 3912 3913 3914 3915 3916

org.apache.hadoop.hbase.replication.testreplicationqueuefailover.queuefailover  
  1    1    1   -1    0   -1    0    1
org.apache.hadoop.hbase.replication.testreplicationqueuefailovercompressed.queuefailover
    1    1    1   -1    0   -1    0    1

_HBase-0.95 (from last 10 runs configurable)_

Failed Test Cases                21   22   23   24   25   27

org.apache.hadoop.hbase.replication.testreplicationqueuefailover.queuefailover  
  1   -1    0    1   -1    0
org.apache.hadoop.hbase.replication.testreplicationqueuefailovercompressed.queuefailover
    0    1   -1    0   -1    0



 Replication could have data loss when machine name contains hyphen -
 --

 Key: HBASE-8207
 URL: https://issues.apache.org/jira/browse/HBASE-8207
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.95.0, 0.94.6
Reporter: Jeffrey Zhong
Assignee: Jeffrey Zhong
Priority: Critical
 Fix For: 0.95.0, 0.98.0, 0.94.7

 Attachments: failed.txt


 In the recent test case TestReplication* failures, I'm finally able to find 
 the cause(or one of causes) for its intermittent failures.
 When a machine name contains -, it breaks the function 
 ReplicationSource.checkIfQueueRecovered. It causes the following issue:
 deadRegionServers list is way off so that replication doesn't wait for log 
 splitting finish for a wal file and move on to the next one(data loss)
 You can see that replication use those weird paths constructed from 
 deadRegionServers to check a file existence
 {code}
 2013-03-26 21:26:51,385 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,386 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,387 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,389 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,391 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,394 INFO  
 [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
  regionserver.ReplicationSource(524): Possible location 
 hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
 2013-03-26 21:26:51,396 INFO